scOntoMatch_vignette

Yuyao Song

2023-10-27

Installation

## install from source
## library(devtools)
## devtools::install_github("YY-SONG0718/scOntoMatch")
library(scOntoMatch)
library(ontologyIndex)

Load data

We use the Tabula Muris and Tabula Sapiens Smartseq-2 lung dataset as example. scOntoMatch works on any number of input datasets. Two demo seurat object are attached in inst/extdata, where we sampled two cells per cell type (original annotation) and focus on the cell type hierarchy in the two datasets.

metadata = '../inst/extdata/metadata.tsv'

anno_col = 'cell_ontology_class'
onto_id_col = 'cell_ontology_id'

obo_file = '../inst/extdata/cl-basic.obo'
propagate_relationships = c('is_a', 'part_of')
ont <- ontologyIndex::get_OBO(obo_file, propagate_relationships = propagate_relationships)

Organize the data name and path as first and second column in a metadata file. Store the seurat object in RDS format and use getSeuratRds to read them in.

obj_list = getSeuratRds(metadata = metadata, sep = "\t")
## 
  |                                                        
  |                                                  |   0%
## start loading seurat rds objects
## 
  |                                                        
  |=========================                         |  50%
  |                                                        
  |==================================================| 100%
levels(factor((obj_list$TM_lung@meta.data$cell_ontology_class)))
##  [1] "B cell"                                         
##  [2] "NA"                                             
##  [3] "T cell"                                         
##  [4] "ciliated columnar cell of tracheobronchial tree"
##  [5] "classical monocyte"                             
##  [6] "epithelial cell of lung"                        
##  [7] "leukocyte"                                      
##  [8] "lung endothelial cell"                          
##  [9] "monocyte"                                       
## [10] "myeloid cell"                                   
## [11] "natural killer cell"                            
## [12] "stromal cell"
levels(factor((obj_list$TS_lung@meta.data$cell_ontology_class)))
##  [1] "adventitial cell"                      
##  [2] "alveolar fibroblast"                   
##  [3] "b cell"                                
##  [4] "basal cell"                            
##  [5] "basophil"                              
##  [6] "bronchial smooth muscle cell"          
##  [7] "capillary aerocyte"                    
##  [8] "capillary endothelial cell"            
##  [9] "cd4-positive, alpha-beta t cell"       
## [10] "cd8-positive, alpha-beta t cell"       
## [11] "classical monocyte"                    
## [12] "club cell"                             
## [13] "dendritic cell"                        
## [14] "endothelial cell of artery"            
## [15] "endothelial cell of lymphatic vessel"  
## [16] "fibroblast"                            
## [17] "lung ciliated cell"                    
## [18] "lung microvascular endothelial cell"   
## [19] "macrophage"                            
## [20] "mesothelial cell"                      
## [21] "neutrophil"                            
## [22] "nk cell"                               
## [23] "non-classical monocyte"                
## [24] "pericyte cell"                         
## [25] "plasma cell"                           
## [26] "plasmacytoid dendritic cell"           
## [27] "respiratory goblet cell"               
## [28] "serous cell of epithelium of bronchus" 
## [29] "type i pneumocyte"                     
## [30] "type ii pneumocyte"                    
## [31] "vascular associated smooth muscle cell"
## [32] "vein endothelial cell"

Match ontology annotation

Trim the ontology tree per dataset

It is common that within each dataset, there will be parent-children relationship between cell types. This is because some cells are able to be further classified into more fine-grained groups, while some other cells are only recognized as the respective parental cell type.

This is not a problem for analyzing individual datasets - we do want to keep those rare, identifiable cell populations distinct. However it could be a problem when we want to map annotation cross-dataset, since it is obscure what population the parent term contains in different datasets.

We provide ontoMultiMinimal for Merging descendant terms to existing ancestor terms in one dataset, to get a minimum ontology representation of the cell type tree.

Note it is optional to trim the ontology tree, and it is always possible to get back to the original annotation later during analysis.

obj_list_minimal = scOntoMatch::ontoMultiMinimal(obj_list = obj_list, ont = ont, anno_col = anno_col, onto_id_col = onto_id_col)
## translate annotation to ontology id
## translating TM_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## NA
## translating TS_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## alveolar fibroblast, capillary aerocyte, nk cell
## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## The following object is masked from 'package:base':
## 
##     intersect
## mapping from name: lung endothelial cell to name: epithelial cell of lung
## mapping from name: classical monocyte to name: myeloid cell
## mapping from name: T cell to name: leukocyte
## mapping from name: B cell to name: leukocyte
## mapping from name: monocyte to name: myeloid cell
## mapping from name: natural killer cell to name: leukocyte
## after matching to base level ontology, TM_lung has cell types:
## NA, ciliated columnar cell of tracheobronchial tree, epithelial cell of lung, leukocyte, myeloid cell, stromal cell
## mapping from name: plasmacytoid dendritic cell to name: dendritic cell
## after matching to base level ontology, TS_lung has cell types:
## adventitial cell, alveolar fibroblast, b cell, basal cell, basophil, bronchial smooth muscle cell, capillary aerocyte, capillary endothelial cell, cd4-positive, alpha-beta t cell, cd8-positive, alpha-beta t cell, classical monocyte, club cell, dendritic cell, endothelial cell of artery, endothelial cell of lymphatic vessel, fibroblast, lung ciliated cell, lung microvascular endothelial cell, macrophage, mesothelial cell, neutrophil, nk cell, non-classical monocyte, pericyte cell, plasma cell, respiratory goblet cell, serous cell of epithelium of bronchus, type i pneumocyte, type ii pneumocyte, vascular associated smooth muscle cell, vein endothelial cell

We can see that some cell types in TS_lung cannot match to an ontology term. Consider manual re-annotate. We advise that do always check literature before manual curation and make sure you want the ontology annotation!

obj_list$TS_lung@meta.data[[anno_col]] = as.character(obj_list$TS_lung@meta.data[[anno_col]])

## nk cell can certainly be matched
obj_list$TS_lung@meta.data[which(obj_list$TS_lung@meta.data[[anno_col]] == 'nk cell'), anno_col] = 'natural killer cell'

## there are type 1 and type 2 alveolar fibroblast which both belongs to fibroblast of lung

obj_list$TS_lung@meta.data[which(obj_list$TS_lung@meta.data[[anno_col]] == 'alveolar fibroblast'), anno_col] = 'fibroblast of lung'

## capillary aerocyte is a recently discovered new lung-specific cell type that is good to keep it
## Gillich, A., Zhang, F., Farmer, C.G. et al. Capillary cell-type specialization in the alveolus. Nature 586, 785–789 (2020). https://doi.org/10.1038/s41586-020-2822-7

Now we can trim again

obj_list_minimal = scOntoMatch::ontoMultiMinimal(obj_list = obj_list, ont = ont, anno_col = anno_col, onto_id_col = onto_id_col)
## translate annotation to ontology id
## translating TM_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## NA
## translating TS_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## capillary aerocyte
## mapping from name: lung endothelial cell to name: epithelial cell of lung
## mapping from name: classical monocyte to name: myeloid cell
## mapping from name: T cell to name: leukocyte
## mapping from name: B cell to name: leukocyte
## mapping from name: monocyte to name: myeloid cell
## mapping from name: natural killer cell to name: leukocyte
## after matching to base level ontology, TM_lung has cell types:
## NA, ciliated columnar cell of tracheobronchial tree, epithelial cell of lung, leukocyte, myeloid cell, stromal cell
## mapping from name: fibroblast of lung to name: fibroblast
## mapping from name: plasmacytoid dendritic cell to name: dendritic cell
## after matching to base level ontology, TS_lung has cell types:
## adventitial cell, b cell, basal cell, basophil, bronchial smooth muscle cell, capillary aerocyte, capillary endothelial cell, cd4-positive, alpha-beta t cell, cd8-positive, alpha-beta t cell, classical monocyte, club cell, dendritic cell, endothelial cell of artery, endothelial cell of lymphatic vessel, fibroblast, lung ciliated cell, lung microvascular endothelial cell, macrophage, mesothelial cell, natural killer cell, neutrophil, non-classical monocyte, pericyte cell, plasma cell, respiratory goblet cell, serous cell of epithelium of bronchus, type i pneumocyte, type ii pneumocyte, vascular associated smooth muscle cell, vein endothelial cell

Ontology tree for individual dataset

Functions are provided to plot cell type tree. Before trimming, there are parental-children relationships within both datasets.

plotOntoTree(ont = ont, 
                          onts = names(getOntologyId(obj_list$TM_lung@meta.data[['cell_ontology_class']], ont = ont)), 
                          ont_query = names(getOntologyId(obj_list$TM_lung@meta.data[['cell_ontology_class']], ont = ont)),
                          plot_ancestors = TRUE,  roots = 'CL:0000548',
                          fontsize=25)

plot of chunk plotOntoTree

plotOntoTree(ont = ont, 
                          onts = names(getOntologyId(obj_list$TS_lung@meta.data[['cell_ontology_class']], ont = ont)), 
                          ont_query = names(getOntologyId(obj_list$TS_lung@meta.data[['cell_ontology_class']], ont = ont)),
                          plot_ancestors = TRUE,  roots = 'CL:0000548',
                          fontsize=25)

plot of chunk plotOntoTree_two

After trimming, we get a minimal representation of cell type hierarchy per dataset.

plotOntoTree(ont = ont, 
                          onts = names(getOntologyId(obj_list_minimal$TM_lung@meta.data[['cell_ontology_base']], ont = ont)), 
                          ont_query = names(getOntologyId(obj_list_minimal$TM_lung@meta.data[['cell_ontology_base']], ont = ont)),
                          plot_ancestors = TRUE,  roots = 'CL:0000548',
                          fontsize=25)

plot of chunk plotOntoTree_minimal

plotOntoTree(ont = ont, 
                          onts = names(getOntologyId(obj_list_minimal$TS_lung@meta.data[['cell_ontology_base']], ont = ont)), 
                          ont_query = names(getOntologyId(obj_list_minimal$TS_lung@meta.data[['cell_ontology_base']], ont = ont)),
                          plot_ancestors = TRUE,  roots = 'CL:0000548',
                          fontsize=25)

plot of chunk plotOntoTree_minimal_two Now, each cell type in the two datasets is a leaf node in the cell type tree. They are ready to be mapped.

Match ontology annotation cross datasets

The core functionality of scOntoMatch is to find at which layer of cell type hierarchy we get one-to-one matching of cell types across datasets. Key idea is to look at the cell type hierarchies in these datasets together, find the last common ancestor cell types, and merge descendants to ancestors. We provide ontoMultiMatch for this purpose.

## perform ontoMatch on the original tree

obj_list_matched = scOntoMatch::ontoMultiMatch(obj_list = obj_list_minimal, anno_col = 'cell_ontology_base', onto_id_col = onto_id_col, ont = ont)
## translate annotation to ontology id
## translating TM_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## NA
## translating TS_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## capillary aerocyte
## intersection terms:
## processing TM_lung
## mapping from name: type I pneumocyte to name: epithelial cell of lung
## mapping from name: type II pneumocyte to name: epithelial cell of lung
## mapping from name: lung microvascular endothelial cell to name: epithelial cell of lung
## mapping from name: macrophage to name: myeloid cell
## mapping from name: B cell to name: leukocyte
## mapping from name: dendritic cell to name: leukocyte
## mapping from name: natural killer cell to name: leukocyte
## mapping from name: CD4-positive, alpha-beta T cell to name: leukocyte
## mapping from name: CD8-positive, alpha-beta T cell to name: leukocyte
## mapping from name: basophil to name: myeloid cell
## mapping from name: neutrophil to name: myeloid cell
## mapping from name: plasma cell to name: leukocyte
## mapping from name: classical monocyte to name: myeloid cell
## mapping from name: non-classical monocyte to name: myeloid cell
## after matching across datasets, TM_lung has cell types:
## NA, ciliated columnar cell of tracheobronchial tree, epithelial cell of lung, leukocyte, myeloid cell, stromal cell
## processing TS_lung
## mapping from name: type I pneumocyte to name: epithelial cell of lung
## mapping from name: type II pneumocyte to name: epithelial cell of lung
## mapping from name: lung microvascular endothelial cell to name: epithelial cell of lung
## mapping from name: macrophage to name: myeloid cell
## mapping from name: B cell to name: leukocyte
## mapping from name: dendritic cell to name: leukocyte
## mapping from name: natural killer cell to name: leukocyte
## mapping from name: CD4-positive, alpha-beta T cell to name: leukocyte
## mapping from name: CD8-positive, alpha-beta T cell to name: leukocyte
## mapping from name: basophil to name: myeloid cell
## mapping from name: neutrophil to name: myeloid cell
## mapping from name: plasma cell to name: leukocyte
## mapping from name: classical monocyte to name: myeloid cell
## mapping from name: non-classical monocyte to name: myeloid cell
## after matching across datasets, TS_lung has cell types:
## adventitial cell, basal cell, bronchial smooth muscle cell, capillary aerocyte, capillary endothelial cell, club cell, endothelial cell of artery, endothelial cell of lymphatic vessel, epithelial cell of lung, fibroblast, leukocyte, lung ciliated cell, mesothelial cell, myeloid cell, pericyte cell, respiratory goblet cell, serous cell of epithelium of bronchus, vascular associated smooth muscle cell, vein endothelial cell

Finally, we plot a combined cell type tree and highlighting the exixting cell types of each dataset.

plts = plotMatchedOntoTree(ont = ont, obj_list = obj_list_matched,
                                 anno_col = 'cell_ontology_mapped', 
                                 onto_id_col = onto_id_col,
                                 roots = 'CL:0000548', fontsize=25)
plts[[1]]

plot of chunk unnamed-chunk-2

plts[[2]]

plot of chunk plotMatchedOntoTree_two

Utility functions

getOntologyId and getOntologyName

getOntologyName(onto_id = c("CL:0000082"), ont = ont)
##                CL:0000082 
## "epithelial cell of lung"
getOntologyId(obj_list$TM_lung@meta.data[[anno_col]], ont = ont)
##                                        CL:0000082 
##                         "epithelial cell of lung" 
##                                        CL:0000084 
##                                          "T cell" 
##                                        CL:0000236 
##                                          "B cell" 
##                                        CL:0000499 
##                                    "stromal cell" 
##                                        CL:0000576 
##                                        "monocyte" 
##                                        CL:0000623 
##                             "natural killer cell" 
##                                        CL:0000738 
##                                       "leukocyte" 
##                                        CL:0000763 
##                                    "myeloid cell" 
##                                        CL:0000860 
##                              "classical monocyte" 
##                                        CL:0002145 
## "ciliated columnar cell of tracheobronchial tree" 
##                                        CL:1001567 
##                           "lung endothelial cell"
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] C/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SeuratObject_5.0.0 sp_2.1-1           ontologyIndex_2.11 scOntoMatch_0.1.1 
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.4         progressr_0.14.0    cli_3.6.1          
##  [4] knitr_1.44          rlang_1.1.1         xfun_0.40          
##  [7] purrr_1.0.2         generics_0.1.3      dotCall64_1.1-0    
## [10] glue_1.6.2          future.apply_1.11.0 listenv_0.9.0      
## [13] graph_1.78.0        stats4_4.3.1        grid_4.3.1         
## [16] evaluate_0.22       lifecycle_1.0.3     compiler_4.3.1     
## [19] codetools_0.2-19    Rcpp_1.0.11         rstudioapi_0.15.0  
## [22] future_1.33.0       lattice_0.22-5      Rgraphviz_2.44.0   
## [25] digest_0.6.33       parallelly_1.36.0   paintmap_1.0       
## [28] parallel_4.3.1      spam_2.10-0         magrittr_2.0.3     
## [31] ontologyPlot_1.6    Matrix_1.6-1.1      tools_4.3.1        
## [34] globals_0.16.2      BiocGenerics_0.46.0