bioconductor v3.9.0 HDF5Array

Implements the HDF5Array and TENxMatrix classes, 2 convenient

Link to this section Summary

Functions

HDF5ArraySeed objects

HDF5 datasets as DelayedArray objects

TENxMatrixSeed objects

10x Genomics datasets as DelayedMatrix objects

HDF5 dump management

An alternative to rhdf5::h5read

Save/load an HDF5-based SummarizedExperiment object

Write an array-like object to an HDF5 file

Write a matrix-like object as an HDF5-based sparse matrix

Link to this section Functions

Link to this function

HDF5ArraySeed_class()

HDF5ArraySeed objects

Description

HDF5ArraySeed is a low-level helper class for representing a pointer to an HDF5 dataset. HDF5ArraySeed objects are not intended to be used directly. Most end-users should create and manipulate HDF5Array objects instead. See ? for more information.

Usage

## Constructor function:
HDF5ArraySeed(filepath, name, type=NA)

Arguments

ArgumentDescription
filepath, name, typeSee ? for a description of these arguments.

Details

No operation can be performed directly on an HDF5ArraySeed object. It first needs to be wrapped in a DelayedArray object. The result of this wrapping is an HDF5Array object (an HDF5Array object is just an HDF5ArraySeed object wrapped in a DelayedArray object).

Value

An HDF5ArraySeed object.

Seealso

  • HDF5Array objects.

  • The list("rhdf5") package on top of which HDF5ArraySeed objects are implemented.

Examples

library(h5vcData)
tally_file <- system.file("extdata", "example.tally.hfs5",
package="h5vcData")

library(rhdf5)  # for h5ls()
h5ls(tally_file)

x <- HDF5ArraySeed(tally_file, "/ExampleStudy/16/Coverages")
x
path(x)
dim(x)
chunkdim(x)
Link to this function

HDF5Array_class()

HDF5 datasets as DelayedArray objects

Description

The HDF5Array class is a DelayedArray subclass for representing a conventional (i.e. dense) HDF5 dataset.

All the operations available for DelayedArray objects work on HDF5Array objects.

Usage

## Constructor function:
HDF5Array(filepath, name, type=NA)

Arguments

ArgumentDescription
filepathThe path (as a single character string) to the HDF5 file where the dataset is located.
nameThe name of the dataset in the HDF5 file.
typeNA or the R atomic type (specified as a single string) corresponding to the type of the HDF5 dataset.

Value

An HDF5Array object.

Seealso

Note

The 1.3 Million Brain Cell Dataset and other datasets published by 10x Genomics use an HDF5-based sparse matrix representation instead of the conventional (i.e. dense) HDF5 representation.

If your dataset uses the conventional (i.e. dense) HDF5 representation, use the HDF5Array() constructor.

If your dataset uses the HDF5-based sparse matrix representation from 10x Genomics, use the TENxMatrix constructor.

Examples

## ---------------------------------------------------------------------
## CONSTRUCTION
## ---------------------------------------------------------------------
library(h5vcData)
tally_file <- system.file("extdata", "example.tally.hfs5",
package="h5vcData")

library(rhdf5)  # for h5ls()
h5ls(tally_file)

## Pick up "Coverages" dataset for Human chromosome 16:
cov0 <- HDF5Array(tally_file, "/ExampleStudy/16/Coverages")
cov0

is(cov0, "DelayedArray")  # TRUE
seed(cov0)
path(cov0)
chunkdim(cov0)

## ---------------------------------------------------------------------
## dim/dimnames
## ---------------------------------------------------------------------
dim(cov0)

dimnames(cov0)
dimnames(cov0) <- list(paste0("s", 1:6), c("+", "-"), NULL)
dimnames(cov0)

## ---------------------------------------------------------------------
## SLICING (A.K.A. SUBSETTING)
## ---------------------------------------------------------------------
cov1 <- cov0[ , , 29000001:29000007]
cov1

dim(cov1)
as.array(cov1)
stopifnot(identical(dim(as.array(cov1)), dim(cov1)))
stopifnot(identical(dimnames(as.array(cov1)), dimnames(cov1)))

cov2 <- cov0[ , "+", 29000001:29000007]
cov2
as.matrix(cov2)

## ---------------------------------------------------------------------
## SummarizedExperiment OBJECTS WITH DELAYED ASSAYS
## ---------------------------------------------------------------------

## DelayedArray objects can be used inside a SummarizedExperiment object
## to hold the assay data and to delay operations on them.

library(SummarizedExperiment)

pcov <- cov0[ , 1, ]  # coverage on plus strand
mcov <- cov0[ , 2, ]  # coverage on minus strand

nrow(pcov)  # nb of samples
ncol(pcov)  # length of Human chromosome 16

## The convention for a SummarizedExperiment object is to have 1 column
## per sample so first we need to transpose 'pcov' and 'mcov':
pcov <- t(pcov)
mcov <- t(mcov)
se <- SummarizedExperiment(list(pcov=pcov, mcov=mcov))
se
stopifnot(validObject(se, complete=TRUE))

## A GPos object can be used to represent the genomic positions along
## the dataset:
gpos <- GPos(GRanges("16", IRanges(1, nrow(se))))
gpos
rowRanges(se) <- gpos
se
stopifnot(validObject(se))
assays(se)$pcov
assays(se)$mcov
Link to this function

TENxMatrixSeed_class()

TENxMatrixSeed objects

Description

TENxMatrixSeed is a low-level helper class for representing a pointer to an HDF5-based sparse matrix like one used by 10x Genomics for the 1.3 Million Brain Cell Dataset. TENxMatrixSeed objects are not intended to be used directly. Most end-users should create and manipulate TENxMatrix objects instead. See ? for more information.

Usage

## Constructor function:
TENxMatrixSeed(filepath, group="mm10")

Arguments

ArgumentDescription
filepath, groupSee ? for a description of these arguments.

Details

No operation can be performed directly on a TENxMatrixSeed object. It first needs to be wrapped in a DelayedMatrix object. The result of this wrapping is a TENxMatrix object (a TENxMatrix object is just a TENxMatrixSeed object wrapped in a DelayedMatrix object).

Value

TENxMatrixSeed() returns a TENxMatrixSeed object.

See ? for the value returned by sparsity() and extractNonzeroDataByCol() .

Seealso

  • TENxMatrix objects.

  • The list("rhdf5") package on top of which TENxMatrixSeed objects are implemented.

  • The TENxBrainData dataset (in the TENxBrainData package).

Examples

## The 1.3 Million Brain Cell Dataset from 10x Genomics is available
## via ExperimentHub:
library(ExperimentHub)
hub <- ExperimentHub()
query(hub, "TENxBrainData")
fname <- hub[["EH1039"]]

## The structure of this HDF5 file can be seen using the h5ls() command
## from the rhdf5 package:
library(rhdf5)
h5ls(fname)

## The 1.3 Million Brain Cell Dataset is represented by the "mm10"
## group. We point the TENxMatrixSeed() constructor to this group
## to create a TENxMatrixSeed object representing the dataset:
x <- TENxMatrixSeed(fname, "mm10")
x
path(x)
dim(x)
sparsity(x)
Link to this function

TENxMatrix_class()

10x Genomics datasets as DelayedMatrix objects

Description

The 1.3 Million Brain Cell Dataset and other datasets published by 10x Genomics use an HDF5-based sparse matrix representation instead of the conventional (i.e. dense) HDF5 representation.

The TENxMatrix class is a DelayedMatrix subclass for representing an HDF5-based sparse matrix like one used by 10x Genomics for the 1.3 Million Brain Cell Dataset.

All the operations available for DelayedMatrix objects work on TENxMatrix objects.

Usage

## Constructor functions:
TENxMatrix(filepath, group="mm10")
## sparsity() and a convenient data extractor:
sparsity(x)
extractNonzeroDataByCol(x, j)

Arguments

ArgumentDescription
filepathThe path (as a single character string) to the HDF5 file where the 10x Genomics dataset is located.
groupThe name of the group in the HDF5 file containing the 10x Genomics data.
xA TENxMatrix (or TENxMatrixSeed ) object.
jAn integer vector containing valid column indices.

Value

TENxMatrix : A TENxMatrix object.

sparsity : The number of zero-valued matrix elements in the object divided by its total number of elements (a.k.a. its length).

extractNonzeroDataByCol : A NumericList or IntegerList object parallel to j i.e. with one list element per column index in j . The row indices of the values are not returned. Furthermore, the values within a given list element can be returned in any order. In particular you should not assume that they are ordered by ascending row index.

Seealso

Note

If your dataset uses the HDF5-based sparse matrix representation from 10x Genomics, use the TENxMatrix() constructor.

If your dataset uses the conventional (i.e. dense) HDF5 representation, use the HDF5Array constructor.

Examples

## ---------------------------------------------------------------------
## THE "1.3 Million Brain Cell Dataset" AS A DelayedMatrix OBJECT
## ---------------------------------------------------------------------
## The 1.3 Million Brain Cell Dataset from 10x Genomics is available
## via ExperimentHub:
library(ExperimentHub)
hub <- ExperimentHub()
query(hub, "TENxBrainData")
fname <- hub[["EH1039"]]

## The structure of this HDF5 file can be seen using the h5ls() command
## from the rhdf5 package:
library(rhdf5)
h5ls(fname)

## The 1.3 Million Brain Cell Dataset is represented by the "mm10"
## group. We point the TENxMatrix() constructor to this group to
## create a TENxMatrix object representing the dataset:
oneM <- TENxMatrix(fname, "mm10")
oneM

is(oneM, "DelayedMatrix")  # TRUE
seed(oneM)
path(oneM)
sparsity(oneM)

## Some examples of delayed operations:
oneM != 0
oneM^2

## ---------------------------------------------------------------------
## SOME EXAMPLES OF ROW/COL SUMMARIZATION
## ---------------------------------------------------------------------
## In order to reduce computation times, we'll use only the first
## 50000 columns of the 1.3 Million Brain Cell Dataset:
oneM50k <- oneM[ , 1:50000]

## Row/col summarization methods like rowSums() use a block-processing
## mechanism behind the scene that can be controlled via global
## settings. 2 important settings that can have a strong impact on
## performance are the automatic number of workers and automatic block
## size, controlled by setAutoBPPARAM() and setAutoBlockSize()
## respectively. On a modern Linux laptop with 8 core (as reported
## by parallel::detectCores()) and 16 Gb of RAM, reasonably good
## performance is achieved by setting the automatic number of workers
## to 6 and automatic block size to 500 Mb:
workers <- 6
block_size <- 5e8  # 500 Mb
if (.Platform$OS.type != "windows") {
setAutoBPPARAM(MulticoreParam(workers))
} else {
## MulticoreParam() is not supported on Windows so we use SnowParam()
## on this platform. Also we reduce the block size to 250 Mb on
## 32-bit Windows to avoid memory allocation problems (they tend to
## be common there because a process cannot use more than 2 Gb of
## memory).
setAutoBPPARAM(SnowParam(workers))
if (.Platform$r_arch == "i386")
block_size <- 2.5e8  # 250 Mb
}
setAutoBlockSize(block_size)
DelayedArray:::set_verbose_block_processing(TRUE)

## We're ready to compute the library sizes, number of genes expressed
## per cell, and average expression across cells:
system.time(lib_sizes <- colSums(oneM50k))
system.time(n_exprs <- colSums(oneM50k != 0))
system.time(ave_exprs <- rowMeans(oneM50k))

## Note that the 3 computations above load the data in oneM50k 3 times
## in memory. This can be avoided by computing the 3 summarizations in
## a single pass with blockApply(). First we define the function that
## we're going to apply to each block of data:
FUN <- function(block)
list(colSums(block), colSums(block != 0), rowSums(block))

## Then we call blockApply() to apply FUN() to each block. The blocks
## are defined by the grid passed to the 'grid' argument. In this case
## we supply a grid made with colGrid() to generate blocks of full
## columns (see ?colGrid for more information):
system.time({
block_results <- blockApply(oneM50k, FUN, grid=colGrid(oneM50k))
})

## 'block_results' is a list with 1 list element per block in
## colGrid(oneM50k). Each list element is the result that was obtained
## by applying FUN() on the block so is itself a list of length 3.
## Let's combine the results:
lib_sizes2 <- unlist(lapply(block_results, `[[`, 1L))
n_exprs2 <- unlist(lapply(block_results, `[[`, 2L))
block_rowsums <- unlist(lapply(block_results, `[[`, 3L), use.names=FALSE)
tot_exprs <- rowSums(matrix(block_rowsums, nrow=nrow(oneM50k)))
ave_exprs2 <- setNames(tot_exprs / ncol(oneM50k), rownames(oneM50k))

## Sanity checks:
stopifnot(all.equal(lib_sizes, lib_sizes2))
stopifnot(all.equal(n_exprs, n_exprs2))
stopifnot(all.equal(ave_exprs, ave_exprs2))

## Reset automatic number of workers and automatic block size to factory
## settings:
setAutoBPPARAM()
setAutoBlockSize()
DelayedArray:::set_verbose_block_processing(FALSE)

## ---------------------------------------------------------------------
## extractNonzeroDataByCol()
## ---------------------------------------------------------------------
## extractNonzeroDataByCol() provides a convenient and very efficient
## way to extract the nonzero data in a compact form:
nonzeroes <- extractNonzeroDataByCol(oneM, 1:50000)  # takes < 5 sec.

## The data is returned as an IntegerList object with one list element
## per column and no row indices associated to the values in the object.
## Furthermore, the values within a given list element can be returned
## in any order:
nonzeroes

names(nonzeroes) <- colnames(oneM50k)

## This can be used to compute some simple summaries like the library
## sizes and the number of genes expressed per cell. For these use
## cases, it is a lot more efficient than using colSums(oneM50k) and
## colSums(oneM50k != 0):
lib_sizes3 <- sum(nonzeroes)
n_exprs3 <- lengths(nonzeroes)

## Sanity checks:
stopifnot(all.equal(lib_sizes, lib_sizes3))
stopifnot(all.equal(n_exprs, n_exprs3))
Link to this function

dump_management()

HDF5 dump management

Description

A set of utilities to control the location and physical properties of automatically created HDF5 datasets.

Usage

setHDF5DumpDir(dir)
setHDF5DumpFile(filepath)
setHDF5DumpName(name)
setHDF5DumpChunkLength(length=1000000L)
setHDF5DumpChunkShape(shape="scale")
setHDF5DumpCompressionLevel(level=6L)
getHDF5DumpDir()
getHDF5DumpFile(for.use=FALSE)
getHDF5DumpName(for.use=FALSE)
getHDF5DumpChunkLength()
getHDF5DumpChunkShape()
getHDF5DumpCompressionLevel()
lsHDF5DumpFile()
showHDF5DumpLog()
## For developers:
getHDF5DumpChunkDim(dim)
appendDatasetCreationToHDF5DumpLog(filepath, name, dim, type,
                                   chunkdim, level)

Arguments

ArgumentDescription
dirThe path (as a single string) to the current HDF5 dump directory , that is, to the (new or existing) directory where HDF5 dump files with automatic names will be created. This is ignored if the user specified an HDF5 dump file with setHDF5DumpFile . If dir is missing, then the HDF5 dump directory is set back to its default value i.e. to some directory under tempdir() (call getHDF5DumpDir() to get the exact path).

|filepath | For setHDF5DumpFile : The path (as a single string) to the current list("HDF5 dump file") , that is, to the (new or existing) HDF5 file where the list("next ", " automatic HDF5 datasets") will be written. If filepath is missing, then a new file with an automatic name will be created (in getHDF5DumpDir() ) and used for each new dataset. For appendDatasetCreationToHDF5DumpLog : See the Note TO DEVELOPERS below. | |name | For setHDF5DumpName : The name of the next automatic HDF5 dataset to be written to the current HDF5 dump file . For appendDatasetCreationToHDF5DumpLog : See the Note TO DEVELOPERS below. | |length | The maximum length of the physical chunks of the list("next automatic ", " HDF5 dataset") to be written to the current list("HDF5 dump file") . | |shape | A string specifying the shape of the physical chunks of the list("next ", " automatic HDF5 dataset") to be written to the current list("HDF5 dump file") . See makeCappedVolumeBox in the list("DelayedArray") package for a description of the supported shapes. | |level | For setHDF5DumpCompressionLevel : The compression level to use for writing automatic HDF5 datasets to disk. See the level argument in ?rhdf5:: (in the rhdf5 package) for more information about this. For appendDatasetCreationToHDF5DumpLog : See the Note TO DEVELOPERS below. | |for.use | Whether the returned file or dataset name is for use by the caller or not. See below for the details. | |dim | The dimensions of the HDF5 dataset to be written to disk, that is, an integer vector of length one or more giving the maximal indices in each dimension. See the dims argument in ?rhdf5:: (in the rhdf5 package) for more information about this. | |type | The type (a.k.a. storage mode) of the data to be written to disk. Can be obtained with type on an array-like object (which is equivalent to storage.mode() or typeof() on an ordinary array). This is typically what an application writing datasets to the HDF5 dump should pass to the storage.mode argument of its call to rhdf5:: . See the Note TO DEVELOPERS below for more information. | |chunkdim | The dimensions of the chunks. |

Details

Calling getHDF5DumpFile() and getHDF5DumpName() with no argument should be list("informative") only i.e. it's a mean for the user to know where the list("next automatic HDF5 dataset") will be written. Since a given file/name combination can be used only once, the user should be careful to not use that combination to explicitely create an HDF5 dataset because that would get in the way of the creation of the list("next ", " automatic HDF5 dataset") . See the Note TO DEVELOPERS below if you actually need to use this file/name combination.

lsHDF5DumpFile() is a just convenience wrapper for rhdf5:: .

Value

getHDF5DumpDir returns the absolute path to the directory where list("HDF5 dump files") with automatic names will be created. Only meaningful if the user did NOT specify an list("HDF5 dump file") with setHDF5DumpFile .

getHDF5DumpFile returns the absolute path to the HDF5 file where the list("next automatic HDF5 dataset") will be written.

getHDF5DumpName returns the name of the list("next automatic HDF5 ", " dataset") .

getHDF5DumpCompressionLevel returns the compression level currently used for writing list("automatic HDF5 datasets") to disk.

showHDF5DumpLog returns the dump log in an invisible data frame.

getHDF5DumpChunkDim returns the dimensions of the physical chunks that will be used to write the dataset to disk.

Seealso

Note

TO DEVELOPERS:

If your application needs to write its own dataset to the HDF5 dump then it should:

  • Get a file/name combination by calling getHDF5DumpFile(for.use=TRUE) and getHDF5DumpName(for.use=TRUE) .

  • [OPTIONAL] Call getHDF5DumpChunkDim(dim) to get reasonable chunk dimensions to use for writing the dataset to disk. Or choose your own chunk dimensions.

  • Add an entry to the dump log by calling appendDatasetCreationToHDF5DumpLog . Typically, this should be done right after creating the dataset (e.g. with rhdf5::h5createDataset ) and before starting to write the dataset to disk. The values passed to appendDatasetCreationToHDF5DumpLog via the filepath , name , dim , type , chunkdim , and level arguments should be those that were passed to rhdf5::h5createDataset via the file , dataset , dims , storage.mode , chunk , and level arguments, respectively. Note that appendDatasetCreationToHDF5DumpLog uses a lock mechanism so is safe to use in the context of parallel execution.
    This is actually what the coercion method to HDF5Array does internally.

Examples

getHDF5DumpDir()
getHDF5DumpFile()

## Use setHDF5DumpFile() to change the current HDF5 dump file.
## If the specified file exists, then it must be in HDF5 format or
## an error will be raised. If it doesn't exist, then it will be
## created.
#setHDF5DumpFile("path/to/some/HDF5/file")

lsHDF5DumpFile()

a <- array(1:600, c(150, 4))
A <- as(a, "HDF5Array")
lsHDF5DumpFile()
A

b <- array(runif(6000), c(4, 2, 150))
B <- as(b, "HDF5Array")
lsHDF5DumpFile()
B

C <- (log(2 * A + 0.88) - 5)^3 * t(B[ , 1, ])
as(C, "HDF5Array")  # realize C on disk
lsHDF5DumpFile()

## Matrix multiplication is not delayed: the output matrix is realized
## block by block. The current "realization backend" controls where
## realization happens e.g. in memory if set to NULL or in an HDF5 file
## if set to "HDF5Array". See '?realize' in the DelayedArray package for
## more information about "realization backends".
setRealizationBackend("HDF5Array")
m <- matrix(runif(20), nrow=4)
P <- C %*% m
lsHDF5DumpFile()

## See all the HDF5 datasets created in the current session so far:
showHDF5DumpLog()

## Wrap the call in suppressMessages() if you are only interested in the
## data frame version of the dump log:
dump_log <- suppressMessages(showHDF5DumpLog())
dump_log

An alternative to rhdf5::h5read

Description

h5mread is the result of experimenting with alternative rhdf5::h5read implementations.

It should still be considered experimental!

Usage

h5mread(filepath, name, starts, counts=NULL, noreduce=FALSE,
        as.integer=FALSE, method=0L)
get_h5mread_returned_type(filepath, name, as.integer=FALSE)

Arguments

ArgumentDescription
filepathThe path (as a single character string) to the HDF5 file where the dataset to read from is located.
nameThe name of the dataset in the HDF5 file.
starts, counts2 lists specifying the array selection. The 2 lists must have one list element per dimension in the dataset. Each list element in starts must be a vector of valid positive indices along the corresponding dimension in the dataset. An empty vector ( integer(0) ) is accepted and indicates an empty selection along that dimension. A NULL is accepted and indicates a full selection along the dimension so has the same meaning as a missing subscript when subsetting an array-like object with [ . (Note that for [ a NULL subscript indicates an empty selection.) Each list element in counts must be NULL or a vector of non-negative integers of the same length as the corresponding list element in starts . Each value in the vector indicates how many positions to select starting from the associated start value. A NULL indicates that a single position is selected for each value along the corresponding dimension. If counts is NULL, then each index in each starts list element indicates a single position selection along the corresponding dimension. Note that in this case the starts argument is equivalent to the index argument of h5read and extract_array (with the caveat that h5read doesn't accept empty selections). Finally note that when counts is not NULL then the selection described by starts and counts must be strictly ascending along each dimension.
noreduceTODO
as.integerTODO
methodTODO

Details

COMING SOON...

Value

An array for h5mread .

The type of the array that will be returned by h5mread for get_h5mread_returned_type . Equivalent to: list(" typeof(h5mread(filepath, name, rep(list(integer(0)), ndim))) ", " ") where ndim is the number of dimensions (a.k.a. the list("rank") in hdf5 jargon) of the dataset. get_h5mread_returned_type is provided for convenience.

Seealso

Examples

## ---------------------------------------------------------------------
## BASIC USAGE
## ---------------------------------------------------------------------
m0 <- matrix((runif(600) - 0.5) * 10, ncol=12)
M0 <- writeHDF5Array(m0, name="M0")

m <- h5mread(path(M0), "M0", starts=list(NULL, c(3, 12:8)))
stopifnot(identical(m0[ , c(3, 12:8)], m))

m <- h5mread(path(M0), "M0", starts=list(integer(0), c(3, 12:8)))
stopifnot(identical(m0[NULL , c(3, 12:8)], m))

m <- h5mread(path(M0), "M0", starts=list(1:5, NULL), as.integer=TRUE)
storage.mode(m0) <- "integer"
stopifnot(identical(m0[1:5, ], m))

m1 <- matrix(1:60, ncol=6)
M1 <- writeHDF5Array(m1, filepath=path(M0), name="M1")
h5ls(path(M1))

m <- h5mread(path(M1), "M1", starts=list(c(2, 7), NULL),
counts=list(c(4, 2), NULL))
stopifnot(identical(m1[c(2:5, 7:8), ], m))

## ---------------------------------------------------------------------
## PERFORMANCE
## ---------------------------------------------------------------------
library(ExperimentHub)
hub <- ExperimentHub()

## With the "sparse" TENxBrainData dataset
## ---------------------------------------
fname0 <- hub[["EH1039"]]
h5ls(fname0)  # all datasets are 1D datasets

index <- list(77 * sample(34088679, 5000, replace=TRUE))
## h5mread() about 3x faster than h5read():
system.time(a <- h5mread(fname0, "mm10/data", index))
system.time(b <- h5read(fname0, "mm10/data", index=index))
stopifnot(identical(a, b))

index <- list(sample(1306127, 7500, replace=TRUE))
## h5mread() about 20x faster than h5read():
system.time(a <- h5mread(fname0, "mm10/barcodes", index))
system.time(b <- h5read(fname0, "mm10/barcodes", index=index))
stopifnot(identical(a, b))

## With the "dense" TENxBrainData dataset
## ---------------------------------------
fname1 <- hub[["EH1040"]]
h5ls(fname1)  # "counts" is a 2D dataset

index <- list(sample(  27998, 250, replace=TRUE),
sample(1306127, 250, replace=TRUE))
## h5mread() about 2x faster than h5read():
system.time(a <- h5mread(fname1, "counts", index))
system.time(b <- h5read(fname1, "counts", index=index))
stopifnot(identical(a, b))

## The bigger the selection, the greater the speedup between
## h5read() and h5mread():
index <- list(sample(  27998, 1000, replace=TRUE),
sample(1306127, 1000, replace=TRUE))
## h5mread() about 8x faster than h5read() (22s vs 3min):
system.time(a <- h5mread(fname1, "counts", index))
system.time(b <- h5read(fname1, "counts", index=index))
stopifnot(identical(a, b))
Link to this function

saveHDF5SummarizedExperiment()

Save/load an HDF5-based SummarizedExperiment object

Description

saveHDF5SummarizedExperiment and loadHDF5SummarizedExperiment can be used to save/load an HDF5-based SummarizedExperiment object to/from disk.

NOTE: These functions use functionalities from the SummarizedExperiment package internally and so require this package to be installed.

Usage

saveHDF5SummarizedExperiment(x, dir="my_h5_se", prefix="", replace=FALSE,
                                chunkdim=NULL, level=NULL, verbose=FALSE)
loadHDF5SummarizedExperiment(dir="my_h5_se", prefix="")
quickResaveHDF5SummarizedExperiment(x, verbose=FALSE)

Arguments

ArgumentDescription
xA SummarizedExperiment object or derivative. For quickResaveHDF5SummarizedExperiment the object must have been previously saved with saveHDF5SummarizedExperiment (and has been possibly modified since then).
dirThe path (as a single string) to the directory where to save the HDF5-based SummarizedExperiment object or to load it from. When saving, the directory will be created if it doesn't already exist. If the directory already exists and no prefix is specified and replace is set to TRUE , then it's replaced with an empty directory.
prefixAn optional prefix to add to the names of the files created inside dir . Allows saving more than one object in the same directory.
replaceWhen no prefix is specified, should a pre-existing directory be replaced with a new empty one? The content of the pre-existing directory will be lost!
chunkdim, levelThe dimensions of the chunks and the compression level to use for writing the assay data to disk. Passed to the internal calls to writeHDF5Array . See ? for more information.
verboseSet to TRUE to make the function display progress.

Details

list(" ", " ", list(list(list("saveHDF5SummarizedExperiment()"), ":"), list(" ", " Creates the directory specified thru the ", list("dir"), " argument and ", " populates it with the HDF5 datasets (one per assay in ", list("x"), ") ", " plus a serialized version of ", list("x"), " that contains pointers to ", " these datasets. This directory provides a self-contained HDF5-based ", " representation of ", list("x"), " that can then be loaded back in R with ", " ",

list("loadHDF5SummarizedExperiment"), ".

", " ", " Note that this directory is ", list("relocatable"), " i.e. it can be moved ", " (or copied) to a different place, on the same or a different ", " computer, before calling ", list("loadHDF5SummarizedExperiment"), " on it. ", " For convenient sharing with collaborators, it is suggested to turn ", " it into a tarball (with Unix command ", list("tar"), "), or zip file, ", " before the transfer. ", " ", " Please keep in mind that ",

list("saveHDF5SummarizedExperiment"), " and

", " ", list("loadHDF5SummarizedExperiment"), " don't know how to produce/read ", " tarballs or zip files at the moment, so the process of ", " packaging/extracting the tarball or zip file is entirely the user ", " responsibility. This is typically done from outside R. ", " ", " Finally please note that, depending on the size of the data to ", " write to disk and the performance of the disk, ", " ", list("saveHDF5SummarizedExperiment"),

" can take a long time to complete.

", " Use ", list("verbose=TRUE"), " to see its progress. ", " ")), " ", " ", list(list(list("loadHDF5SummarizedExperiment()"), ":"), list(" ", " Typically very fast, even if the assay data is big, because all ", " the assays in the returned object are ", list("HDF5Array"), " objects ", " pointing to the on-disk HDF5 datasets located in ", list("dir"), ". ", " ", list("HDF5Array"), " objects are typically light-weight in memory. ",

"    ")), "

", " ", list(list(list("quickResaveHDF5SummarizedExperiment()"), ":"), list(" ", " Preserves the HDF5 file and datasets that the assays in ", list("x"), " ", " are already pointing to (and which were created by an earlier call ", " to ", list("saveHDF5SummarizedExperiment"), "). All it does is re-serialize ", " ", list("x"), " on top of the ", list(".rds"), " file that is associated with ", " this HDF5 file (and which was created by an earlier call to ",

"      ", list("saveHDF5SummarizedExperiment"), " or

", " ", list("quickResaveHDF5SummarizedExperiment"), "). Because the delayed ", " operations possibly carried by the assays in ", list("x"), " are not ", " realized, this is very fast. ", " ")), " ", " ")

Value

saveHDF5SummarizedExperiment returns an invisible SummarizedExperiment object that is the same as what loadHDF5SummarizedExperiment will return when loading back the object. All the assays in the object are HDF5Array objects pointing to datasets in the HDF5 file saved in dir .

Seealso

Note

The files created by saveHDF5SummarizedExperiment in the user-specified directory dir should not be renamed.

The user-specified directory created by saveHDF5SummarizedExperiment is relocatable i.e. it can be renamed and/or moved around, but not the individual files in it.

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## saveHDF5SummarizedExperiment() / loadHDF5SummarizedExperiment()
## ---------------------------------------------------------------------
library(SummarizedExperiment)

nrow <- 200
ncol <- 6
counts <- matrix(as.integer(runif(nrow * ncol, 1, 1e4)), nrow)
colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 3),
row.names=LETTERS[1:6])
se0 <- SummarizedExperiment(assays=list(counts=counts), colData=colData)
se0

## Save 'se0' as an HDF5-based SummarizedExperiment object:
dir <- tempfile("h5_se0_")
h5_se0 <- saveHDF5SummarizedExperiment(se0, dir)
list.files(dir)

h5_se0
assay(h5_se0, withDimnames=FALSE)   # HDF5Matrix object

h5_se0b <- loadHDF5SummarizedExperiment(dir)
h5_se0b
assay(h5_se0b, withDimnames=FALSE)  # HDF5Matrix object

## Sanity checks:
stopifnot(is(assay(h5_se0, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0))))
stopifnot(is(assay(h5_se0b, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0b))))

## ---------------------------------------------------------------------
## More sanity checks
## ---------------------------------------------------------------------

## Make a copy of directory 'dir':
somedir <- tempfile("somedir")
dir.create(somedir)
file.copy(dir, somedir, recursive=TRUE)
dir2 <- list.files(somedir, full.names=TRUE)

## 'dir2' contains a copy of 'dir'. Call loadHDF5SummarizedExperiment()
## on it.
h5_se0c <- loadHDF5SummarizedExperiment(dir2)

stopifnot(is(assay(h5_se0c, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0c))))

## ---------------------------------------------------------------------
## Using a prefix
## ---------------------------------------------------------------------

se1 <- se0[51:100, ]
saveHDF5SummarizedExperiment(se1, dir, prefix="xx_")
list.files(dir)
loadHDF5SummarizedExperiment(dir, prefix="xx_")

## ---------------------------------------------------------------------
## quickResaveHDF5SummarizedExperiment()
## ---------------------------------------------------------------------

se2 <- loadHDF5SummarizedExperiment(dir, prefix="xx_")
se2 <- se2[1:14, ]
assay1 <- assay(se2, withDimnames=FALSE)
assays(se2) <- c(assays(se2), list(score=assay1/100))
rowRanges(se2) <- GRanges("chr1", IRanges(1:14, width=5))
rownames(se2) <- letters[1:14]
se2

## This will replace saved 'se1'!
quickResaveHDF5SummarizedExperiment(se2, verbose=TRUE)
list.files(dir)
loadHDF5SummarizedExperiment(dir, prefix="xx_")
Link to this function

writeHDF5Array()

Write an array-like object to an HDF5 file

Description

A function for writing an array-like object to an HDF5 file.

Usage

writeHDF5Array(x, filepath=NULL, name=NULL, chunkdim=NULL, level=NULL,
               verbose=FALSE)

Arguments

ArgumentDescription
xThe array-like object to write to an HDF5 file. If x is a DelayedArray object, writeHDF5Array realizes it on disk, that is, all the delayed operations carried by the object are executed while the object is written to disk. See "On-disk realization of a DelayedArray object as an HDF5 dataset" section below for more information.

|filepath | NULL or the path (as a single string) to the (new or existing) HDF5 file where to write the dataset. If NULL , then the dataset will be written to the current list("HDF5 ", " dump file") i.e. to the file whose path is getHDF5DumpFile . | |name | NULL or the name of the HDF5 dataset to write. If NULL , then the name returned by getHDF5DumpName will be used. | |chunkdim | The dimensions of the chunks to use for writing the data to disk. By default, getHDF5DumpChunkDim(dim(x)) will be used. See ? for more information. | |level | The compression level to use for writing the data to disk. By default, getHDF5DumpCompressionLevel() will be used. See ? for more information. | |verbose | Set to TRUE to make the function display progress. |

Details

Please note that, depending on the size of the data to write to disk and the performance of the disk, writeHDF5Array can take a long time to complete. Use verbose=TRUE to see its progress.

Use setHDF5DumpFile and setHDF5DumpName to control the location of automatically created HDF5 datasets.

Use setHDF5DumpChunkLength , setHDF5DumpChunkShape , and setHDF5DumpCompressionLevel , to control the physical properties of automatically created HDF5 datasets.

Value

An HDF5Array object pointing to the newly written HDF5 dataset on disk.

Seealso

Examples

## ---------------------------------------------------------------------
## WRITE AN ORDINARY ARRAY TO AN HDF5 FILE
## ---------------------------------------------------------------------
m <- matrix(runif(350, min=-1), nrow=25)
out_file <- tempfile()

M1 <- writeHDF5Array(m, out_file, name="M1", chunkdim=c(5, 5))
M1
chunkdim(M1)

## ---------------------------------------------------------------------
## WRITE A DelayedArray OBJECT TO AN HDF5 FILE
## ---------------------------------------------------------------------
M2 <- log(t(DelayedArray(m)) + 1)
M2 <- writeHDF5Array(M2, out_file, name="M2", chunkdim=c(5, 5))
M2
chunkdim(M2)

library(rhdf5)
library(h5vcData)

tally_file <- system.file("extdata", "example.tally.hfs5",
package="h5vcData")
h5ls(tally_file)

cov0 <- HDF5Array(tally_file, "/ExampleStudy/16/Coverages")

cov1 <- cov0[ , , 29000001:29000007]

writeHDF5Array(cov1, out_file, "cov1")
h5ls(out_file)
Link to this function

writeTENxMatrix()

Write a matrix-like object as an HDF5-based sparse matrix

Description

The 1.3 Million Brain Cell Dataset and other datasets published by 10x Genomics use an HDF5-based sparse matrix representation instead of the conventional (i.e. dense) HDF5 representation.

writeTENxMatrix writes a matrix-like object to this format.

IMPORTANT NOTE: Only use writeTENxMatrix if the matrix-like object to write is sparse, that is, if most of its elements are zero. Using writeTENxMatrix on dense data is very inefficient! In this case, you should use writeHDF5Array instead.

Usage

writeTENxMatrix(x, filepath=NULL, group=NULL, level=NULL, verbose=FALSE)

Arguments

ArgumentDescription
xThe matrix-like object to write to an HDF5 file. The object to write should typically be sparse, that is, most of its elements should be zero. If x is a DelayedMatrix object, writeTENxMatrix realizes it on disk, that is, all the delayed operations carried by the object are executed while the object is written to disk.

|filepath | NULL or the path (as a single string) to the (new or existing) HDF5 file where to write the data. If NULL , then the data will be written to the current list("HDF5 ", " dump file") i.e. to the file whose path is getHDF5DumpFile . | |group | NULL or the name of the HDF5 group where to write the data. If NULL , then the name returned by getHDF5DumpName will be used. | |level | The compression level to use for writing the data to disk. By default, getHDF5DumpCompressionLevel() will be used. See ? for more information. | |verbose | Set to TRUE to make the function display progress. |

Details

Please note that, depending on the size of the data to write to disk and the performance of the disk, writeTENxMatrix can take a long time to complete. Use verbose=TRUE to see its progress.

Use setHDF5DumpFile and setHDF5DumpName to control the location of automatically created HDF5 datasets.

Value

A TENxMatrix object pointing to the newly written HDF5 data on disk.

Seealso

Examples

## ---------------------------------------------------------------------
## A SIMPLE EXAMPLE
## ---------------------------------------------------------------------
m0 <- matrix(0L, nrow=25, ncol=12,
dimnames=list(letters[1:25], LETTERS[1:12]))
m0[cbind(2:24, c(12:1, 2:12))] <- 100L + sample(55L, 23, replace=TRUE)
out_file <- tempfile()
M0 <- writeTENxMatrix(m0, out_file, group="m0")
M0
sparsity(M0)

path(M0)  # same as 'out_file'

## Use the h5ls() command from the rhdf5 package to see the structure of
## the file:
library(rhdf5)
h5ls(path(M0))

## ---------------------------------------------------------------------
## USING THE "1.3 Million Brain Cell Dataset"
## ---------------------------------------------------------------------

## The 1.3 Million Brain Cell Dataset from 10x Genomics is available via
## ExperimentHub:
library(ExperimentHub)
hub <- ExperimentHub()
query(hub, "TENxBrainData")
fname <- hub[["EH1039"]]
oneM <- TENxMatrix(fname, "mm10")  # see ?TENxMatrix for the details
oneM

## Note that the following transformation preserves sparsity:
M2 <- log(oneM + 1)  # delayed
M2                   # a DelayedMatrix instance

## In order to reduce computation times, we'll write only the first
## 5000 columns of M2 to disk:
out_file <- tempfile()
M3 <- writeTENxMatrix(M2[ , 1:5000], out_file, group="mm10", verbose=TRUE)
M3                   # a TENxMatrix instance