# Package datasets

#### 2022-05-08

This vignette shows how the package datasets were created. The package data includes a list of cancer-related Reactome pathways, p53_pathways, and an example gene expression dataset, glioma. These are provided to be used in examples and for illustrative purposes.

# 1 Reactome pathways

A list of Reaectome pathways is provided for Homo sapiens (human). This list was initialized using the dnapath::get_reactome_pathways() function.

pathway_list <- dnapath::get_reactome_pathways("human")
## Obtaining reactome pathway information for species: Homo sapiens
## Combining pathways that have greater than 90 % overlap.

This list contains the gene sets for 885 pathways. Here are the first two pathways:

pathway_list[1:2]
## $Purine salvage ## [1] "100" "132" "161823" "1633" "1716" "270" "271" "272" ## [9] "2766" "3251" "353" "4860" "51292" ## ##$Nucleotide salvage
##  [1] "100"    "132"    "151531" "161823" "1633"   "1716"   "1890"   "270"
##  [9] "271"    "272"    "2766"   "3251"   "353"    "4860"   "51292"  "54963"
## [17] "7083"   "7084"   "7371"   "7378"   "8226"   "83549"  "978"

The list names correspond to the Reactome pathway name. Each pathway provides the entrezgene IDs of all genes in that pathway. For example, from the output above, we see that the first pathway in the list is the “purine salvage” pathway, and it contains the genes: 100, 132, 161823, etc.

In practice, all of these pathways can be used in the dnapath() analysis. However, in some studies we may only be interested in a subset of this list. For example, if we are studying a cancer-related gene expression dataset, we may have a specific interest in cancer-related pathways. There are two ways to obtain a subset from the pathway list: (1) using the pathway names or (2) using the pathway genes. These two approaches are demonstrated in turn.

## 1.1 Subsetting pathways using pathway names

Suppose we are interested in pathways that are related to “p53” or “pi3k” regulatory processes (these are known cancer-related pathways). One option is to subset pathways based on whether or not the pathway names include these keywords. The grep() function is used to do this:

# Obtain index of pathways containing "p53" or "pi3k" in their name.
index_pathways <- grep("((p53)|(pi3k))", names(pathway_list),
ignore.case = TRUE)

grep uses the regular expression “((p53)|(pi3k))”, which says to identify all pathway names that include the phrase “p53” or “pi3k”. However, we don’t want to leave out pathways that use capital “P53” or “PI3K”, so we set ignore.case = TRUE to ignore the case of the pathway names.

# Subset onto the "p53" and "pi3k" pathways.
cancer_pathways <- pathway_list[index_pathways]
cancer_pathways[1:3] # Print out the first three
## $CD28 dependent PI3K/Akt signaling ## [1] "10000" "117145" "1326" "207" "208" "2475" "253260" "2534" ## [9] "3932" "5170" "5290" "5295" "5296" "55615" "57761" "64223" ## [17] "79109" "8503" "9020" "940" "941" "942" ## ##$G beta:gamma signalling through PI3Kgamma
##  [1] "10000"  "10681"  "146850" "207"    "208"    "23533"  "2782"   "2783"
##  [9] "2784"   "2785"   "2786"   "2787"   "2788"   "2790"   "2791"   "2792"
## [17] "2793"   "387"    "5170"   "51764"  "5294"   "54331"  "55970"  "59345"
## [25] "94235"
##
## $Regulation of TP53 Degradation (See also: Regulation of TP53 Expression and Degradation) ## [1] "10000" "1017" "1029" "11200" "117584" "1616" "207" "208" ## [9] "2475" "253260" "4193" "4194" "472" "51230" "5170" "5515" ## [17] "5516" "5518" "5519" "5527" "55615" "6233" "64223" "6446" ## [25] "7157" "7311" "7314" "7316" "7874" "79109" "80196" "890" ## [33] "8900" "900" "9099" "983" "639" Subsetting on these cancer-related pathways results in 19 pathways. In general, the pathway list obtained from the dnapath::get_reactome_pathways() function can be subset based on the pathway names, as illustrated in this example. Here, we focused on cancer pathways, but the regular expression “((p53)|(pi3k))” can be replaced to search for any desired pathways. To learn more about the grep() function and regular expression, see chapter 17 of R Programming for Data Science by Roger Peng. ## 1.2 Subsetting pathways using genes As in the previous example, suppose we are interested in pathways that are related to “p53” or “pi3k”. However, this time we want to identify all pathways that involve the genes, not just pathways that are named after them. (Note that some pathways may contain the p53 gene but not have “p53” in its name.) Since the pathway list uses entrezgene IDs, the first step is to find the entrezgene ID for each gene of interest. The ID for P53 is 7157. PI3K actually refers to a family of enzymes; one class of genes in this family include PIK3CA, PIK3CB, PIK3CD, and PIK3CG, which have the entrezgene IDs 5290, 5291, 5293, and 5294. # Make a vector of all entrezgene IDs that we are interested in. genes_of_interest <- c(7157, 5290, 5291, 5293, 5294) # Obtain index of pathways containing any genes of interest. index_pathways <- which(sapply(pathway_list, function(pathway) { any(genes_of_interest %in% pathway) })) The sapply() function is used to loop over each pathway in the pathway list, and for each we determine if any of the genes of interest are in the pathway’s gene set. We end up with a set of indices identifying which pathways contain at least one of the five targeted genes. # Subset onto pathways containing p53 gene or pi3k family of genes. cancer_pathways <- pathway_list[index_pathways] cancer_pathways[1:3] # Print out the first three ##$Activation of BH3-only proteins
##  [1] "10000"  "10018"  "10971"  "140735" "1869"   "207"    "208"    "23368"
##  [9] "27113"  "2810"   "5366"   "5533"   "5534"   "5599"   "572"    "572"
## [17] "596"    "637"    "7027"   "7029"   "7157"   "7159"   "7161"   "7529"
## [25] "7531"   "7532"   "7533"   "7534"   "8626"   "8655"   "90427"
##
## $Signaling by ERBB2 ## [1] "10000" "10193" "10273" "10718" "10718" "11140" "145957" "145957" ## [9] "1729" "1839" "1839" "1950" "1950" "1956" "1956" "2064" ## [17] "2064" "2065" "2065" "2066" "2066" "2069" "2069" "207" ## [25] "208" "2534" "2549" "26469" "2885" "2885" "2886" "3084" ## [33] "3084" "3265" "3265" "3320" "3845" "3845" "387" "4145" ## [41] "4893" "4893" "51072" "5290" "5295" "5335" "5578" "5580" ## [49] "5581" "55914" "5753" "5782" "5782" "6233" "6464" "6464" ## [57] "6654" "6654" "6714" "685" "685" "7311" "7314" "7316" ## [65] "7525" "8065" "9101" "9542" "9542" ## ##$CD28 co-stimulation
##  [1] "10000"  "117145" "1326"   "207"    "208"    "2475"   "253260" "2534"
##  [9] "2885"   "3932"   "4067"   "5058"   "5062"   "5063"   "5170"   "5290"
## [17] "5295"   "5296"   "55615"  "57761"  "5879"   "64223"  "6714"   "7409"
## [25] "7525"   "79109"  "8503"   "9020"   "940"    "9402"   "941"    "942"
## [33] "998"

Subsetting results in 80 pathways that contain at least one of the five cancer-related genes of interest. In general, any set of entrezgene IDs can be used to subset the pathways obtained from the dnapath::get_reactome_pathways() function, as illustrated in this example.

## 1.3p53_pathways data

The dnapath package provides an example pathway list named p53_pathways. This is a small list provided for illustrative purposes and for use in examples. It includes T53 Reactome pathways, which was created using the steps described previously.

# Obtain index of pathways containing "p53" in their name.
index_pathways <- grep("(p53)", names(pathway_list),ignore.case = TRUE)
p53_pathways <- pathway_list[index_pathways]

The dataset was compressed and saved for the package using the code:

# Not run: Only for building the package.
usethis::use_data(p53_pathways)
tools::resaveRdaFiles(file.path("data"), compress = "auto")
tools::checkRdaFiles(file.path("data"))

# 2 Gene expression example

A small gene expression dataset is provided in dnapath. This example data is a Mesothelioma dataset containing gene expression and clinical data that were generated by The Cancer Genome Atlas (TCGA) and is downloaded using the LinkedOmics portal.

## 2.1meso data

"TCGA_MESO__MS__Clinical__Clinical__01_28_2016__BI__",
"Clinical__Firehose.tsi")
"TCGA_MESO__UNC__RNAseq__HiSeq_RNA__01_28_2016__BI__Gene",
"__Firehose_RSEM_log2.cct.gz")

### 2.1.1 Clinical dataset

The differential network analysis performed using dnapath() compares the gene-gene association structure between two groups (populations). In the Mesothelioma dataset, the clinical data can be used to identify two subgroups of interest. In this example dataset, the tumor stage (classified as stage i, ii, iii, or iv) will be used to partition the data. In practice, dnapath() could be used make pair-wise comparisons across these tumor stages, or perhaps some stages can be combined together. But for the purpose of creating a small, example dataset, we will only look at tumor stages ii and iv.

The first step is to process the downloaded data so that it contains one column of sample IDs and one column of tumor stage.

# First, we can glimpse at the raw data to see how it is structured.
clinical[1:5, 1:3]
## # A tibble: 5 × 3
##   attrib_name       TCGA.3H.AB3O TCGA.3H.AB3S
##   <chr>             <chr>        <chr>
## 1 years_to_birth    58           69
## 2 pathologic_stage  stageiii     stageii
## 3 pathology_T_stage t3           t2
## 4 pathology_N_stage n0           n0
## 5 pathology_M_stage m0           m0
# The columns contain samples, and rows contain different variables
# We want the transpose of this.
variable_names <- clinicalattrib_name # Store the variable names. # Transpose the matrix so that columns correspond to variables. clinical <- t(clinical[, -1]) # Don't include the "attrib_name" column as a row. colnames(clinical) <- variable_names # Rename columns using variable names. # Glimpse at the transposed dataset: clinical[1:4, 1:4] ## years_to_birth pathologic_stage pathology_T_stage ## TCGA.3H.AB3O "58" "stageiii" "t3" ## TCGA.3H.AB3S "69" "stageii" "t2" ## TCGA.3U.A98E "71" "stageiv" "t4" ## TCGA.3U.A98G "71" "stageiv" "t4" ## pathology_N_stage ## TCGA.3H.AB3O "n0" ## TCGA.3H.AB3S "n0" ## TCGA.3U.A98E "n2" ## TCGA.3U.A98G "n0" # Now that the rows correspond to subjects and columns to variables, # the next step is to add a new column that contains the sample IDs. clinical <- cbind(ID = rownames(clinical), clinical) rownames(clinical) <- NULL # Remove the row names. # Glimpse at the data. clinical[1:4, 1:4] ## ID years_to_birth pathologic_stage pathology_T_stage ## [1,] "TCGA.3H.AB3O" "58" "stageiii" "t3" ## [2,] "TCGA.3H.AB3S" "69" "stageii" "t2" ## [3,] "TCGA.3U.A98E" "71" "stageiv" "t4" ## [4,] "TCGA.3U.A98G" "71" "stageiv" "t4" # For the example dataset, we are only interested in ID and tumor stage. clinical <- clinical[, c("ID", "pathologic_stage")] # The two groups that will be compared are stage 2 and stage 4. # Subset rows onto those that have stage 2 or stage 4. clinical <- clinical[clinical[, "pathologic_stage"] %in% c("stageii", "stageiv"), ] clinical[1:5, ] ## ID pathologic_stage ## [1,] "TCGA.3H.AB3S" "stageii" ## [2,] "TCGA.3U.A98E" "stageiv" ## [3,] "TCGA.3U.A98G" "stageiv" ## [4,] "TCGA.SC.AA5Z" "stageiv" ## [5,] "TCGA.UT.A88D" "stageii" # We are left with only stage 2 and stage 4 samples: table(clinical[, 2]) ## ## stageii stageiv ## 16 16 The final dataset contains 16 samples for group 1 (tumor stage ii) and 16 samples for group 2 (tumor stage iv). Note that the groups don’t need equal sample sizes, but they happen to be equal in these data. ### 2.1.2 RNA-seq dataset The second step is to process the gene expression dataset. The LinkedOmics portal provides normalized counts from the RNA-seq experiments (the data are not raw reads, which must be aligned and annotated in order to obtain gene expression counts). For the purpose of creating a small example dataset, these data will be subset onto the 32 samples obtained in the previous section, and onto the 160 genes in the cancer_pathway list. # First, we can glimps at the raw data to see how it is structured. rnaseq[1:5, 1:5] ## # A tibble: 5 × 5 ## attrib_name TCGA.3H.AB3K TCGA.3H.AB3L TCGA.3H.AB3M TCGA.3H.AB3O ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 A1BG 6.25 5.02 8.02 6.57 ## 2 A1CF 0.687 0 0.462 0 ## 3 A2BP1 6.21 3.74 6.92 1.30 ## 4 A2LD1 5.26 5.70 7.49 7.03 ## 5 A2ML1 0.687 0 0 0 # As with the clinical data, we need to transpose the raw data so that # samples are in the rows. gene_symbols <- rnaseqattrib_name # Store the gene names

rnaseq <- t(rnaseq[, -1])
colnames(rnaseq) <- gene_symbols

rnaseq[1:5, 1:5]
##                A1BG   A1CF  A2BP1  A2LD1  A2ML1
## TCGA.3H.AB3K 6.2508 0.6869 6.2128 5.2615 0.6869
## TCGA.3H.AB3L 5.0232 0.0000 3.7424 5.6967 0.0000
## TCGA.3H.AB3M 8.0151 0.4622 6.9154 7.4918 0.0000
## TCGA.3H.AB3O 6.5719 0.0000 1.2995 7.0304 0.0000
## TCGA.3H.AB3S 5.2770 0.0000 6.7363 7.0906 0.0000
# Check if all 32 clinical samples contain gene expression data.
all(clinical[, "ID"] %in% rownames(rnaseq))
## [1] TRUE
# Subset the rnaseq data onto those in the clinical data
rnaseq <- rnaseq[rownames(rnaseq) %in% clinical[, "ID"], ]

# Finally, we must make sure the two dataset are aligned.
# There are many ways to do this, here is one:
#   For each ID in the clinical dataset, find the corresponding row in rnaseq.
#   The rows are then reordered to match the IDs in clinical.
rnaseq <- rnaseq[sapply(clinical[, "ID"],
function(ID) which(rownames(rnaseq) == ID)), ]

# Check that the IDs match:
all(rownames(rnaseq) == clinical[, "ID"])
## [1] TRUE
rnaseq[1:5, 1:5]
##                A1BG   A1CF  A2BP1  A2LD1 A2ML1
## TCGA.3H.AB3S 5.2770 0.0000 6.7363 7.0906     0
## TCGA.3U.A98E 4.1398 0.0000 6.0186 5.8955     0
## TCGA.3U.A98G 7.9146 0.0000 2.7608 6.8913     0
## TCGA.SC.AA5Z 5.8400 0.0000 7.3606 5.7489     0
## TCGA.UT.A88D 6.1513 0.6561 4.2798 5.1120     0

The final step of processing the gene expression data is to subset the columns to the genes contained in the cancer_pathway list. However, the raw data provide gene symbols rather than entrezgene IDs (which the Reactome pathways use). So, we must either relabel the gene symbols in the rnaseq dataset to entrezgene IDs, or relabel the entrezgene IDs in cancer_pathway to contain gene symbols. Either of these operations can be easily carried out using the symbol_to_entrez(), entrez_to_symbol(), and rename_genes() functions provided in the package. These two approaches are briefly demonstrated next.

Note: Internet connection is required to connect to biomaRt, which the entrez_to_symbol and symbol_to_entrez methods use to map entrezgene IDs and gene symbols. The dir_save argument can be set when running these to save the ID mapping obtained from biomaRt in the specified directory. This way, the mapping can be obtained once (using the internet connection) and accessed from memory in all future calls of these functions. The temporary directory tempdir() is used in this example.

First, we show how to convert gene symbols in the gene expression dataset to entrezgene IDs.

# Convert gene symbols -> entrezgene IDs
gene_symbols <- colnames(rnaseq) # Extract the gene symbols to relabel.
gene_mat <- symbol_to_entrez(gene_symbols, "human",
dir_save = tempdir()) # Obtain mapping.
gene_mat[1:5, ]
##   hgnc_symbol entrezgene_id
## 1        A1BG             1
## 2        A1CF         29974
## 3       A2BP1            -1
## 4       A2LD1            -1
## 5       A2ML1        144568
# The rename_genes() method is a multipurpose function that can be used to rename
# the genes in a vector (the current case), a pathway list, or the 'dnapath_list'
# object that is returned from dnapath() after performing the differential
# network analysis.
gene_entrez <- rename_genes(gene_symbols, gene_mat) # Rename the symbols.
colnames(rnaseq) <- gene_entrez # Update the columns using the entrezgene IDs.
rnaseq[1:5, 1:5] # Columns now contain entrezgene IDs
##                   1  29974     -1     -1 144568
## TCGA.3H.AB3S 5.2770 0.0000 6.7363 7.0906      0
## TCGA.3U.A98E 4.1398 0.0000 6.0186 5.8955      0
## TCGA.3U.A98G 7.9146 0.0000 2.7608 6.8913      0
## TCGA.SC.AA5Z 5.8400 0.0000 7.3606 5.7489      0
## TCGA.UT.A88D 6.1513 0.6561 4.2798 5.1120      0

This is the approach used to create the meso data that is provided in this package. In general, it is advised to use entrezgene IDs throughout the analysis pipeline, and only convert to symbols at the very end. Entrezgene IDs are unique identifiers for each gene, whereas gene symbols can sometimes be ambiguous. This ambiguity can lead to information loss if used early on the analysis. For example, notice the third column in above output, the “A2BP1” gene, was not mapped to an entrezgene ID. This is because “A2BP1” is an alias for the RBFOX1 gene. The symbol_to_entrez() function obtains a mapping for RBFOX1 but not for the alias A2BP1. Because an entrezgene ID was not identified for A2BP1, it will not be identified in any of the pathways during the differential network analysis.

Gene symbols should only be considerd at the end of the analysis. It can be conventient to rename the IDs into symbols, since symbols are more recognizable for viewing and interpreting the results.

Important note: In the Mesothelioma dataset, we are utilizing summarized gene expression counts as opposed to raw RNA-sequencing reads. If the raw reads are unavailable, then the user may not be able to control whether or not the genes are entrezgene IDs or gene symbols. In this case, we must be aware of the limitations and take caution if many gene symbols fail to map to entrezgene IDs. However, if raw reads are available, then it is recommended to align and annotate the reads using entrezgene IDs when obtaining the expression counts.

To finish the processing, the columns of rnaseq are subset onto the genes in p53_pathways.

# Subset the columns onto only those genes contained in 'p53_pathways'
index_genes <- which(colnames(rnaseq) %in% get_genes(p53_pathways))
rnaseq <- rnaseq[, index_genes]
dim(rnaseq) # 32 samples with 150 genes.
## [1]  32 143

For completeness, we will show how to convert the entrezgene IDs in the pathway list into gene symbols.

# Convert entrezgene IDs -> gene symbols
gene_entrez <- get_genes(p53_pathways) # Extract genes from pathway list.
gene_mat <- entrez_to_symbol(gene_entrez, "human",
dir_save = tempdir()) # Obtain mapping.
gene_mat[1:5, ]
##   entrezgene_id hgnc_symbol
## 1         10000        AKT3
## 2          1017        CDK2
## 3          1029      CDKN2A
## 4         11200       CHEK2
## 5        117584        RFFL
# Convert the entrezgene IDs into gene symbols in the pathway list.
new_pathway_list <- rename_genes(p53_pathways, gene_mat)
new_pathway_list[1:2] # Print the first two pathways.
## $Regulation of TP53 Degradation (See also: Regulation of TP53 Expression and Degradation) ## [1] "AKT3" "CDK2" "CDKN2A" "CHEK2" "RFFL" "DAXX" "AKT1" ## [8] "AKT2" "MTOR" "RICTOR" "MDM2" "MDM4" "ATM" "PHF20" ## [15] "PDPK1" "PPP2CA" "5516" "PPP2R1A" "PPP2R1B" "PPP2R5C" "PRR5" ## [22] "RPS27A" "MLST8" "SGK1" "TP53" "UBA52" "UBB" "7316" ## [29] "USP7" "MAPKAP1" "RNF34" "CCNA2" "CCNA1" "CCNG1" "USP2" ## [36] "CDK1" "PRDM1" ## ##$Regulation of TP53 Activity through Acetylation
##  [1] "AKT3"    "CHD3"    "CHD4"    "EP300"   "AKT1"    "AKT2"    "BRD1"
##  [8] "BRPF3"   "BRD7"    "HDAC1"   "HDAC2"   "ING2"    "PIN1"    "PIP4K2A"
## [22] "MEAF6"   "TP53"    "BRPF1"   "PIP4K2C" "KAT6A"   "PIP4K2B" "ING5"
## [29] "PIP4P1"  "MTA2"

### 2.1.3 Finalizing

The dnapath() function requires either a list of two gene expression datasets (each corresponding to a different group), or a single gene expression data and a ‘group’ vector that indicates which rows belong to which group. The meso data is formatted using the second option. The current rnaseq dataset that we have processed contains samples from both groups; the group information can be obtained from the clinical dataset. These two pieces of information can be conveniently stored as a list.

meso <- list(gene_expression = rnaseq,
groups = clinical[, "pathologic_stage"])

The list was compressed and saved for the package using the code:

# Not run: Only for building the package.
usethis::use_data(meso)
tools::resaveRdaFiles(file.path("data"), compress = "auto")
tools::checkRdaFiles(file.path("data"))

# 3 biomaRt

The biomaRt package is used to link gene symbols with entrezgene IDs. The entrez_to_symbol() and symbol_to_entrez() functions require an internet connection to connect with the biomaRt database.

This package contains an instance of the mapping between HGNC gene symbols and entregene IDs, which will be used if internet access is not available or if the biomaRt package is not installed. However, this dataset will not be applicable to species other than hsapiens and may not contain the most up-to-date mapping. A warning is given when it is used.

# Not run: Internet access and biomaRt package is required.
mart <- biomaRt::useMart(biomart = "ensembl",
dataset = "hsapiens_gene_ensembl")

biomart_hsapiens <- biomaRt::getBM(attributes = c("entrezgene_id", "hgnc_symbol"),
mart = mart)

library(dplyr) # For pipe operator "%>%".
biomart_hsapiens %>%
dplyr::filter(!is.na(entrezgene_id)) %>%
dplyr::group_by(entrezgene_id) %>%
dplyr::summarise(across(everything(), dplyr::first)) ->
biomart_hsapiens

This mapping was compressed and saved for the package using the code:

# Not run: Only for building the package.
usethis::use_data(biomart_hsapiens)
tools::resaveRdaFiles(file.path("data"), compress = "auto")
tools::checkRdaFiles(file.path("data"))