Tutorial 1: Using the ondisc_matrix class

This tutorial shows how to use ondisc_matrix, the core class implemented by ondisc. An ondisc_matrix is an R object that represents an expression matrix stored on-disk rather than in-memory. We cover the topics of initialization, querying basic information, subsetting, and pulling submatrices into memory. We begin by loading the ondisc package.

library(ondisc)

Initialization

ondisc ships with several example datasets, stored in the “extdata” subdirectory of the package.

raw_data_dir <- system.file("extdata", package = "ondisc")
list.files(raw_data_dir)
#> [1] "cell_barcodes.tsv"   "gene_expression.mtx" "genes.tsv"          
#> [4] "guides.tsv"          "perturbation.mtx"

The files “gene_expression.mtx”, “cell_barcodes.tsv,” and “genes.tsv” together define a gene-by-cell expression matrix. We save the full paths to these files in the variables mtx_fp, barcodes_fp, and features_fp.

mtx_fp <- paste0(raw_data_dir, "/gene_expression.mtx")
barcodes_fp <- paste0(raw_data_dir, "/cell_barcodes.tsv")
features_fp <- paste0(raw_data_dir, "/genes.tsv")

An ondisc_matrix consists of two parts: an HDF5 (i.e., .h5) file that stores the expression data on-disk in a novel format, and an in-memory object that allows us to interact with the expression data from within R. The easiest way to initialize an ondisc_matrix is by calling the function create_ondisc_matrix_from_mtx. We pass to this function (i) a file path to the .mtx file storing the expression data, (ii) a file path to the .tsv file storing the cell barcodes, and (iii) a file path to the .tsv file storing the feature IDs and human-readable feature names. We optionally can specify the directory in which to store the initialized .h5 file, which in this tutorial we will take to be the temporary directory.

temp_dir <- tempdir()
exp_mat_list <- create_ondisc_matrix_from_mtx(mtx_fp = mtx_fp,
                                              barcodes_fp = barcodes_fp,
                                              features_fp = features_fp,
                                              on_disk_dir = temp_dir)
#> 
|========                                                                 | 11%
|=================                                                        | 23%
|==========================                                               | 36%
|====================================                                     | 48%
|=============================================                            | 61%
|======================================================                   | 73%
|===============================================================          | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#> 
|========                                                                 | 11%
|=================                                                        | 23%
|==========================                                               | 36%
|====================================                                     | 48%
|=============================================                            | 61%
|======================================================                   | 73%
|===============================================================          | 86%
|=========================================================================| 98%
|=========================================================================| 100%
#> Writing CSC data.
#> Writing CSR data.

By default, create_ondisc_matrix_from_mtx returns a list of three elements: (i) an ondisc_matrix representing the expression data, (ii) a cell-wise covariate matrix, and (iii) a feature-wise covariate matrix. The exact cell-wise and feature-wise covariate matrices that are computed depend on the inputs to create_ondisc_matrix_from_mtx (see documentation via ?create_ondisc_matrix_from_mtx for full details). The advantage to computing the cell-wise and feature-wise covariates at initialization is that it obviates the need to load the entire dataset into memory a second time.

expression_mat <- exp_mat_list$ondisc_matrix
head(expression_mat)
#> Showing 5 of 300 featuress and 6 of 900 cells:
#> Loading required package: Matrix
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    3    0    0    0    0    5
#> [2,]    0    2    0    0    0    0
#> [3,]    0    8    0    0    0    0
#> [4,]    0    0    0    0    0    0
#> [5,]    0    0    0    0    0    0
cell_covariates <- exp_mat_list$cell_covariates
head(cell_covariates)
#>   n_nonzero n_umis     p_mito
#> 1        43    214 0.04672897
#> 2        26    169 0.00000000
#> 3        22    116 0.05172414
#> 4        37    258 0.08139535
#> 5        36    224 0.08035714
#> 6        31    147 0.07482993
feature_covariates <- exp_mat_list$feature_covariates
head(feature_covariates)
#>   mean_expression coef_of_variation n_nonzero
#> 1       0.7577778          2.981871       114
#> 2       0.5977778          3.302883        96
#> 3       0.5788889          3.539932        85
#> 4       0.6533333          3.341677        91
#> 5       0.5522222          3.578487        82
#> 6       0.5455556          3.541223        84

The initialized HDF5 file is named ondisc_matrix_1.h5 and is located in the temporary directory.

"ondisc_matrix_1.h5" %in% list.files(temp_dir)
#> [1] TRUE

A strength of create_ondisc_matrix_from_mtx is that it does not assume that entire expression matrix fits into memory. The optional argument n_lines_per_chunk can be used to specify the number of lines to read from the .mtx file at a time. Additionally, create_ondisc_matrix_from_mtx is fast: the novel algorithm that underlies this function is highly efficient and implemented in C++ for maximum speed. Typically, create_ondisc_matrix_from_mtx takes aboout 4-8 minutes/GB to run. Finally, for a given dataset, create_ondisc_matrix_from_mtx only needs to be run once, even after closing and opening new R sessions.

Querying basic information

We can use the functions get_feature_ids, get_feature_names, and get_cell_barcodes to obtain the feature IDs, feature names (if applicable), and cell barcodes, respectively, of an ondisc_matrix.

feature_ids <- get_feature_ids(expression_mat)
feature_names <- get_feature_names(expression_mat)
cell_barcodes <- get_cell_barcodes(expression_mat)

head(feature_ids)
#> [1] "ENSG00000198060" "ENSG00000237832" "ENSG00000267543" "ENSG00000103460"
#> [5] "ENSG00000229637" "ENSG00000174990"
head(feature_names)
#> [1] "MARCH5"     "AL138808.1" "AC015802.3" "TOX3"       "PRAC2"     
#> [6] "CA5A"
head(cell_barcodes)
#> [1] "GCTTTCGTCTAGACCA-1" "ACGGTCGTCGTTAGAC-1" "TTTACGTTCACCTCGT-1"
#> [4] "TGGATCATCCTTCAGC-1" "ACAGGGAAGACGCCCT-1" "ACCTACCAGTGTTCCA-1"

Additionally, we can use dim, nrow, and ncol to obtain the dimension, number of rows (i.e., number of features), and number of columns (i.e., number of cells) of an ondisc_matrix.

dim(expression_mat)
#> [1] 300 900
nrow(expression_mat)
#> [1] 300
ncol(expression_mat)
#> [1] 900

Subsetting

We can subset an ondisc_matrix to obtain a new ondisc_matrix that is a submatrix of the original. To subset an ondisc_matrix, apply the [ operator and pass a numeric, logical, or character vector indicating the cells or features to keep. Character vectors are assumed to refer to feature IDs (for rows) and cell barcodes (for columns).

# numeric vector examples
# keep genes 100-110
x <- expression_mat[100:110,]
# keep all cells except 10 and 20
x <- expression_mat[,-c(10,20)]
# keep genes 50-100 and 200-250 and cells 300-500
x <- expression_mat[c(50:100, 200:250), 300:500]

# character vector examples
# keep genes ENSG00000107581, ENSG00000286857, and ENSG00000266371
x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),]
# keep cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1 
x <- expression_mat[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]

# logical vector example
# keep all genes except ENSG00000237832 and ENSG00000229637
x <- expression_mat[!(get_feature_ids(expression_mat) 
                 %in% c("ENSG00000237832", "ENSG00000229637")),]

Subsetting an ondisc_matrix leaves the original object unchanged.

expression_mat
#> An ondisc_matrix with 300 features and 900 cells.

This important property, called object persistence, makes programming with ondisc_matrices intuitive. The underlying HDF5 file is not copied upon subset; instead, information is shared across ondisc_matrix objects, making subsets fast.

Pulling a submatrix into memory

We can pull a submatrix of an ondisc_matrix into memory, allowing us to perform computations on a subset of the data. To pull a submatrix into memory, use the [[ operator, passing a numeric, character, or logical vector indicating the cells or features to access. The data structure that underlies an ondisc_matrix enables fast access to both rows and columns of the matrix.

# numeric vector examples
# pull gene 6
m <- expression_mat[[6,]]
# pull cells 200 - 250
m <- expression_mat[[,200:250]]
# pull genes 50 - 100 and cells 200 - 250
m <- expression_mat[[50:100, 200:250]]

# character vector examples
# pull genes ENSG00000107581 and ENSG00000286857
m <- expression_mat[[c("ENSG00000107581", "ENSG00000286857"),]]
# pull cells CGTTGGGCATGGCTGC-1 and GTAACCAGTACAGTTC-1
m <- expression_mat[[,c("CGTTGGGCATGGCTGC-1", "GTAACCAGTACAGTTC-1")]]

# logical vector examples
# subset the matrix, keeping genes ENSG00000107581, ENSG00000286857, and ENSG00000266371
x <- expression_mat[c("ENSG00000107581", "ENSG00000286857", "ENSG00000266371"),]
# pull all genes except ENSG00000107581
m <- x[[get_feature_ids(x) != "ENSG00000107581",]]

The last example demonstrates that we can pull a submatrix of an ondisc_matrix into memory after having subset the matrix.

One can remember the difference between [ and [[ by recalling R lists: [ is used to subset a list, and [[ is used to access elements stored within a list. Similarly, [ is used to subset an ondisc_matrix, and [[ is used to access a submatrix stored within an ondisc_matrix.

Saving and loading an ondisc_matrix

As discussed previously, there are two components to an ondisc_matrix: the HDF5 file stored on-disk, and the R object stored in memory. The latter contains a file path to the former, allowing us to interact with the expression data from within R.

To save an ondisc_matrix, simply call saveRDS on the ondisc_matrix R object to create an .rds file.

saveRDS(object = expression_mat, file = paste0(temp_dir, "/expression_matrix.rds"))
rm(expression_mat)

We then can load the ondisc_matrix by calling readRDS on the .rds file.

expression_mat <- readRDS(paste0(temp_dir, "/expression_matrix.rds"))

We also can use the constructor of the ondisc_matrix class to create an ondisc_matrix from an already-initialized HDF5 file.

h5_file <- paste0(temp_dir, "/ondisc_matrix_1.h5")
expression_mat <- ondisc_matrix(h5_file)