HDF5 is an excellent format for storing large, multi-dimensional
numerical arrays. h5lite simplifies the process of reading
and writing matrices and arrays by handling the complex memory layout
differences between R and HDF5 automatically.
This vignette covers writing matrices, preserving dimension names
(dimnames), and understanding how h5lite
manages dimension ordering.
In R, matrices are simply 2-dimensional arrays. You can write them
directly using h5_write(). h5lite preserves
the dimensions exactly as they appear in R.
The same logic applies to arrays with 3 or more dimensions.
R objects often carry metadata in the form of dimnames
(row names, column names, etc.). HDF5 does not have a native “row name”
concept for numerical arrays, but it supports Dimension
Scales.
h5lite automatically converts R dimnames
into HDF5 Dimension Scales. This allows your row and column names to
survive the round-trip to disk and back.
# Create a matrix with row and column names
data <- matrix(rnorm(6), nrow = 2)
rownames(data) <- c("Sample_A", "Sample_B")
colnames(data) <- c("Gene_1", "Gene_2", "Gene_3")
h5_write(data, file, "genetics/expression")
# Read back
data_in <- h5_read(file, "genetics/expression")
print(data_in)
#> Gene_1 Gene_2 Gene_3
#> Sample_A 1.3613766 -1.0888527 -0.01808857
#> Sample_B 0.4309375 0.1311432 1.62414883Technical Note: In the HDF5 file, the names are stored as
separate datasets (e.g., _rownames, _colnames)
and linked to the main dataset using HDF5 Dimension Scale
attributes.
One of the most confusing aspects of HDF5 for R users is dimension ordering.
To ensure that a 3x4 matrix in R looks like a
3x4 dataset in HDF5 tools (like h5dump or
HDFView), h5lite physically
transposes the data during read/write operations.
h5lite converts R’s
column-major memory layout to HDF5’s row-major layout.h5lite converts the data back
to column-major for R.This ensures that indexing is preserved.
x[2, 1] in R refers to the exact same value after reading
it back from HDF5.
Because h5lite writes the data in C-order (Row-Major) to
match the HDF5 specification, files created with h5lite are
perfectly readable by Python (h5py or
pandas).
(3, 4)(3, 4)Note: Some other R packages create HDF5 files by swapping the
dimensions (writing a 3x4 matrix as 4x3) to avoid the cost of
transposing data. h5lite prioritizes correctness and
interoperability over raw write speed.
Matrices and arrays benefit significantly from compression. When you
enable compression, h5lite automatically “chunks” the
dataset (breaks it into smaller tiles).
# Large matrix of zeros (highly compressible)
sparse_mat <- matrix(0, nrow = 1000, ncol = 1000)
sparse_mat[1:10, 1:10] <- 1
# Write with compression (zlib level 5)
h5_write(sparse_mat, file, "compressed/matrix", compress = TRUE)
# Write with high compression (zlib level 9)
h5_write(sparse_mat, file, "compressed/matrix_max", compress = 9)h5lite is designed for simplicity and currently
reads/writes full datasets at once. It does not support
partial I/O (hyperslabs), such as reading only rows 1-10 of a 1,000,000
row matrix.
If you need to read specific subsets of data that are too large to
fit in memory, you should consider using the rhdf5 or
hdf5r packages.