--- title: "Get Started with SCIntRuler" author: - name: Yue Lyu ORCID logo email: yuelyu0521@gmail.com package: SCIntRuler description: > Guiding the integration of multiple single-cell RNA-seq datasets with a novel statistical metric. output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Integrating Single-Cell RNA-seq Datasets with SCIntRuler} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introducrtion The accumulation of single-cell RNA-seq (scRNA-seq) studies highlights the potential benefits of integrating multiple data sets. By augmenting sample sizes and enhancing analytical robustness, integration can lead to more insightful biological conclusions. However, challenges arise due to the inherent diversity and batch discrepancies within and across studies. **SCIntRuler**, a novel R package, addresses these challenges by guiding the integration of multiple scRNA-seq data sets. [SCIntRuler](https://github.com/yuelyu21/SCIntRuler) is an R package developed for single-cell RNA-seq analysis. It was designed using the [Seurat](https://satijalab.org/seurat/) framework, and offers existing and novel single-cell analytic work flows. Integrating scRNA-seq data sets can be complex due to various factors, including batch effects and sample diversity. Key decisions – whether to integrate data sets, which method to choose for integration, and how to best handle inherent data discrepancies – are crucial. `SCIntRuler` offers a statistical metric to aid in these decisions, ensuring more robust and accurate analyses. - **Informed Decision Making**: Helps researchers decide on the necessity of data integration and the most suitable method. - **Flexibility**: Suitable for various scenarios, accommodating different levels of data heterogeneity. - **Robustness**: Enhances analytical robustness in joint analyses of merged or integrated scRNA-seq data sets. - **User-Friendly**: Streamlines decision-making processes, simplifying the complexities involved in scRNA-seq data integration. ## 1. Installation To install the package, you need to install the `batchelor` and `MatrixGenerics` package from Bioconductor. ```{r install Depedency, eval = FALSE} # Check if BioManager is installed, install if not if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") # Check if 'batchelor' is installed, install if not if (!requireNamespace("batchelor", quietly = TRUE)) BiocManager::install("batchelor") # Check if 'MatrixGenerics' is installed, install if not if (!requireNamespace("MatrixGenerics", quietly = TRUE)) BiocManager::install("MatrixGenerics") ``` The `SCIntRuler` can be installed by the following commands, the source code can be found at [GitHub](https://github.com/yuelyu21/SCIntRuler). ```{r install SCIntRuler, eval = FALSE} BiocManager::install("SCIntRuler") ``` After the installation, the package can be loaded with ```{r setup, message=FALSE, warning=FALSE} library(SCIntRuler) library(Seurat) library(dplyr) library(ggplot2) ``` ## 2. Explore with an example data Let's start with an example data. We conducted a series of simulation studies to assess the efficacy of `SCIntRuler` in guiding the integration selection under different scenarios with varying degrees of shared information among data sets. We generated the simulation data based on a real `Peripheral Blood Mononuclear Cells (PBMC)` scRNA-seq dataset. ### Overview of the data This dataset is a subset of what we used in our **Simulation 2**, where we have three studies. In each study, we randomly drew different numbers of CD4 T helper cells, B cells, CD14 monocytes, and CD56 NK cells to mimic four real-world scenarios with three data sources **Simulation 2** introduces a moderate overlap, with 20.3% cells sharing the same cell type identity. There are 2000 B cells and 400 CD4T cells in the first study, 700 CD14Mono cells and 400 CD4T cells in the second study, 2000 CD56NK cells and 400 CD4T cells in the third study. This data is already in `Seurat` format and can be found under `/data`. There are 32738 genes and 5900 cells in simulation 2. Here, we subset 800 cells with 3000 genes. ```{r load the data, message=FALSE, warning=FALSE} data("sim_data_sce", package = "SCIntRuler") sim_data <- as.Seurat(sim_data_sce) ``` ```{r show some data, message=FALSE, warning=FALSE} head(sim_data[[]]) ``` ### Data pre-process and visulization with Seurat Followed by the tutorial of [Seurat](https://satijalab.org/seurat/articles/pbmc3k_tutorial), we first pre-processed the data by the functions `NormalizeData`, `FindVariableFeature`, `ScaleData`, `RunPCA`, `FindNeighbors`, `FinsClusters` and `RunUMAP` from `Seurat` and then draw the UMAP by using `DimPlot` stratified by **Study** and **Cell Type**. ```{r pre-process , message=FALSE, warning=FALSE} # Normalize the data sim_data <- NormalizeData(sim_data) # Identify highly variable features sim_data <- FindVariableFeatures(sim_data, selection.method = "vst", nfeatures = 2000) # Scale the data all.genes <- rownames(sim_data) sim_data <- ScaleData(sim_data, features = all.genes) # Perform linear dimensional reduction sim_data <- RunPCA(sim_data, features = VariableFeatures(object = sim_data)) # Cluster the cells sim_data <- FindNeighbors(sim_data, dims = 1:20) sim_data <- FindClusters(sim_data, resolution = 0.5) sim_data <- RunUMAP(sim_data, dims = 1:20) ``` #### UMAP separated by Study ```{r UMAP , message=FALSE, warning=FALSE} p1 <- DimPlot(sim_data, reduction = "umap", label = FALSE, pt.size = .5, group.by = "Study", repel = TRUE) p1 ``` #### UMAP separated by cell type ```{r UMAP2 , message=FALSE, warning=FALSE} p2 <- DimPlot(sim_data, reduction = "umap", label = TRUE, pt.size = .8, group.by = "CellType", repel = TRUE) p2 ``` ### Try different data integration methods To further illustrate which integration method is more suitable under different settings, we visualize the data without integration (simply merging the single cell objects) and after applying three popular data integration methods: CCA, Harmony and Scanorama. The UMAP visualizations across the simulations indicate that the choice of integration method significantly impacts the resulting data integration. #### Run Seurat CCA Seurat CCA can be directly applied by functions `FindIntegrationAnchors` and `IntegrateData` in `Seurat` package. For more information, please see[Tutorial of SeuratCCA](https://satijalab.org/seurat/articles/integration_introduction.html). ```{r CCA, message=FALSE, warning=FALSE} ### CCA sim.list <- SplitObject(sim_data, split.by = "Study") sim.anchors <- FindIntegrationAnchors(object.list = sim.list, dims = 1:30, reduction = "cca") sim.int <- IntegrateData(anchorset = sim.anchors, dims = 1:30, new.assay.name = "CCA") # run standard analysis workflow sim.int <- ScaleData(sim.int, verbose = FALSE) sim.int <- RunPCA(sim.int, npcs = 30, verbose = FALSE) sim.int <- RunUMAP(sim.int, dims = 1:30, reduction.name = "umap_cca") ``` #### Run Harmony Harmony is an algorithm for performing integration of single cell genomics data sets. To run harmony, we need to install `harmony` package. For more information, please see [Quick start to Harmony](https://portals.broadinstitute.org/harmony/articles/quickstart.html). ```{r harmony, message=FALSE, warning=FALSE} ### Harmony # install.packages("harmony") sim.harmony <- harmony::RunHarmony(sim_data, group.by.vars = "Study", reduction.use = "pca", #dims.use = 1:20, assay.use = "RNA" ) sim.int[["harmony"]] <- sim.harmony[["harmony"]] sim.int <- RunUMAP(sim.int, dims = 1:20, reduction = "harmony", reduction.name = "umap_harmony") ``` #### UMAP figures for all integrated data ```{r integration, fig.height=8, fig.width=14} p5 <- DimPlot(sim_data, reduction = "umap", group.by = "Study") + theme(legend.position = "none", # axis.line.y = element_line( size = 2, linetype = "solid"), # axis.line.x = element_line( size = 2, linetype = "solid"), axis.text.y = element_text( color="black", size=20), axis.title.y = element_text(color="black", size=20), axis.text.x = element_text( color="black", size=20), axis.title.x = element_text(color="black", size=20)) p6 <- DimPlot(sim.int, reduction = "umap_cca", group.by = "Study") + theme(legend.position = "none", # axis.line.y = element_line( size = 2, linetype = "solid"), # axis.line.x = element_line( size = 2, linetype = "solid"), axis.text.y = element_text( color="black", size=20), axis.title.y = element_text(color="black", size=20), axis.text.x = element_text( color="black", size=20), axis.title.x = element_text(color="black", size=20)) p7 <- DimPlot(sim.int, reduction = "umap_harmony", group.by = "Study") + theme(legend.position = "none", # axis.line.y = element_line( size = 2, linetype = "solid"), # axis.line.x = element_line( size = 2, linetype = "solid"), axis.text.y = element_text( color="black", size=20), axis.title.y = element_text(color="black", size=20), axis.text.x = element_text( color="black", size=20), axis.title.x = element_text(color="black", size=20)) leg <- cowplot::get_legend(p5) gridExtra::grid.arrange(gridExtra::arrangeGrob(p5 + NoLegend() + ggtitle("Unintegrated"), p6 + NoLegend() + ggtitle("Seurat CCA") , p7 + NoLegend() + ggtitle("Harmony"), #p8 + NoLegend() + ggtitle("Scanorama"), nrow = 2), leg, ncol = 2, widths = c(20, 5)) ``` ## 3. Applying SCIntRuler to example data We first split the `sim_data` by `Study` and then run `GetCluster` and `NormData` to get Louvain clusters and normalized count matrix for each study. Furthermore, to perform the permutation test of relative-between cluster distance, `FindNNDist` can be then applied. We also have another function `FindNNDistC` that based on `Rcpp` and `C++` for faster matrix calculation. ```{r step 1, warning=FALSE,message=FALSE} sim.list <- SplitObject(sim_data, split.by = "Study") fullcluster <- GetCluster(sim.list,50,200) normCount <- NormData(sim.list) #distmat <- FindNNDist(fullcluster, normCount, 20) distmat <- FindNNDistC(fullcluster, normCount, 20) ``` ### Calculate SCIntRuler and and an associated visualization In this example data, we got a SCIntRuler score of 0.57, there is a noticeable but not large overlap of cell types across the data sets, showing a moderate level of shared information. The integration is essential to adjust for these effects and align the shared cell populations, ensuring that the integrated dataset accurately reflects the biological composition. Thus, the methods which can offer a balance between correcting for batch effects and maintaining biological variation would be the best. The UMAP visualizations across the simulations indicate that the choice of integration method significantly impacts the resulting data integration. ```{r step 2, warning=FALSE,message=FALSE} testres <- PermTest(fullcluster,distmat,15) PlotSCIR(fullcluster,sim.list,testres,"") sim_result <- list(fullcluster,normCount,distmat,testres) ``` ## SessionInfo ```{r,echo=FALSE} sessionInfo() ```