--- title: "Gene Set Enrichment Meta-Analysis with GSEMA package" author: - name: Juan Antonio Villator-García affiliation: - Department of Statistics and Operational Research. University of Granada - GENYO, Centre for Genomics and Oncological Research email: javillatoro@ugr.es - name: Pablo Jurado-Bascón affiliation: - Department of Statistics and Operational Research. University of Granada - GENYO, Centre for Genomics and Oncological Research - name: Pedro Carmona-Sáez affiliation: - Department of Statistics and Operational Research. University of Granada - GENYO, Centre for Genomics and Oncological Research email: pcarmona@ugr.es abstract: > GSEMA (Gene Set Enrichment Meta-Analysis) performs all the necessary steps of a gene set enrichment meta-analysis based on the combination of effect size. In preparation for calculating different effect sizes, first, a single sample enrichment analysis is performed. Based on the various enrichment values, the effect size is calculated as a difference of means. Subsequently, the package allows for the common steps of a meta-analysis based on the effect sizes. date: "`r BiocStyle::doc_date()`" package: "`r pkg_ver('GSEMA')`" vignette: > %\VignetteIndexEntry{Gene Set Enrichment Meta-Analysis with GSEMA package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteKeywords{Gene set Enrichment Analysis, Meta-analysis} output: BiocStyle::html_document: toc: true toc_float: true number_sections: true bibliography: GSEMA_refs.bib --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction *GSEMA* is a package designed to perform gene set enrichment meta-analysis. The gene set enrichment meta-analysis allows for obtaining the differentially regulated gene sets or pathways that are shared across various studies. Specifically, GSEMA applies a meta-analysis based on the combination of effect sizes obtained from single sample enrichment (SSE) analysis applied to individual the different studies. The different effect sizes of each study for each of the gene sets are calculated using a difference of means based on the enrichment scores obtained from SSE analysis. GSEMA offers various methods to obtain the statistics for the difference of means. Subsequently, GSEMA allows for applying the various steps typical of a meta-analysis, in a similar way to the gene expression meta-analysis (@Toro_2020), although different adjustments to the calculated statistics are performed in order to correct the possible existence of false positive bias. This vignette provides a tutorial-style introduction to all the steps necessary to properly conduct gene set enrichment meta-analysis using the *GSEMA* package. # Previous steps: Meta-analysis object *GSEMA* uses a specific object as input, which is a list of nested lists where each nested list corresponds to a study. This object can be created directly by the users or they can use **createObjectMApath()** function to create it. For the examples that are going to be shown, synthetic data will be used. We load the sample data into our R session. ```{r} library(GSEMA) data("simulatedData") ``` * study1Ex, study2Ex: two expression matrices. * study1Pheno, study2Pheno: two phenodata objects. * GeneSets: a list of gene sets with each element are the genes that belong to a pathway. * objectMApathSim: the meta-analysis object created from the different expression matrices and phenodatas. ## Meta-analysis object creation (objectMApath) As previously stated, the meta-analysis input in GSEMA is a list of nested lists. Each nested list contains two elements: * A gene set matrix with gene sets in rows and samples in columns *A vector of 0 and 1 indicating the group of each sample. 0 represents reference group (usually controls) and 1 represents experimental group (usually cases). This object can be created directly by the user or we can make use of **createObjectMApath()** function, which creates the *objectMApath* after indicating: * the reference group (refGroups) and the experimental group (expGroups) in the phenodata. * the gene sets (geneSets) that will be consider. * the single sample enrichment method to applied to calculate the gene sets matrices **createObjectMApath()** function allows to create the object needed to perform meta-analysis. In this case, it is necessary to indicate as input of the function the variables that contain the experimental and reference groups: * listEX: a list of dataframes or matrix (genes in rows and samples in columns). A list of ExpressionSets can be used too: ```{r} #List of expression matrices listMatrices <- list(study1Ex, study2Ex) ``` * listPheno: a list of phenodatas (samples in rows and covariables in columns). If the object listEX is a list of ExpressionSets this element can be null. ```{r} listPhenodata <- list(study1Pheno, study2Pheno) ``` * namePheno: a list or vector of the different column names or column positions from the pheno used for perfoming the comparison between groups. Each element of namePheno correspont to its equivalent element in the listPheno. (default a vector of 1, all the first columns of each elements of listPheno are selected) * expGroups: a list or vector of the group names or positions from namePheno variable used as experimental group (cases) to perform the comparison (default a vector of 1, all the first groups are selected). * refGroups: a list or vector of the group names or positions from namePheno variable used as reference group (controls) to perform the comparison (default a vector of 2, all the second groups are selected). It is important to note that if any element does not belong to the experimental or the reference group, that sample is not taken into account in the creation of meta-analysis object. Here, we have included an example to show how exactly the function is used: Since this function can be a bit complicated if there are many datasets, we recommend creating a vector to keep the column names of the phenodatas that contains the variable that identifies the groups to compare (**namePheno** argument). Moreover, we should create two others lits to indicate how to identify experimental (cases) and reference (controls) groups in these variables (**expGroups** and **refGroups** arguments). If we look at the example phenodatas we can observed that the groups variable is "condition". Experimental group is named as "Case" and reference group as "Healthy". ```{r} study1Pheno$Condition ``` * geneSets: a list of gene sets with each element are the genes that belong to a pathway: ```{r} head(names(GeneSets)) GeneSets[[1]] ``` * pathMethod: a character string indicating the method to calculate the gene sets matrices. The available methods are: + "GSVA": Gene Set Variation method (@Hanzelmann_2013) + "Zscore": Z-score method (@Lee_2008) + "ssGSEA": Single Sample Gene Set Enrichment Analysis method (@Barbie_2009) + "Singscore": Single sample scoring of molecular phenotypes (@Foroutan_2018) * minSize: Minimum size of the resulting gene sets after gene identifier mapping. By default, the minimum size is 7. * kdcf: Only neccesary for the GSVA method. Character vector of length 1 denoting the kernel to use during the non -parametric estimation of the cumulative distribution function of expression levels across samples. By default, kcdf="Gaussian" which is suitable when input expression values are continuous, such as microarray. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf="Poisson". * normalize: boolean specifying if the gen set matrices should be normalized. Default value "TRUE". * n.cores: Number of cores to use in the parallelization of the datsets. By default, n.cores=1. * internal.n.cores: Number of cores to use in the parallelization of the single sample enrichment methods. By default internal.n.cores= 1. In parallelization, several aspects must be considered. "n.cores" refers to the parallelization of studies or datasets. Therefore, if we have 3 studies, the maximum number for "n.cores" will be 3. internal.n.cores refers to the parallelization of single sample enrichment methods. This is especially recommended for the ssGSEA method. For Singscore and GSVA, it may also be advisable. The process is parallelized based on the samples in each study. Therefore, the larger the number of samples, the slower the process will be. The number of cores that the computer will use is the multiplication of both parameters n.cores * internal.n.cores = total cores. We all this information we can create the vector for **namePheno** argument and the two list for **expGroups** and **refGroups**: ```{r} listMatrices <- list(study1Ex, study2Ex) listPhenodata <- list(study1Pheno, study2Pheno) phenoGroups <- c("Condition","Condition", "Condition") phenoCases <- list("Case", "Case", "Case") phenoControls <- list("Healthy", "Healthy", "Healthy") objectMApathSim <- createObjectMApath( listEX = listMatrices, listPheno = listPhenodata, namePheno = phenoGroups, expGroups = phenoCases, refGroups = phenoControls, geneSets = GeneSets, pathMethod = "Zscore", n.cores = 1, internal.n.cores = 1, minSize = 7) ``` The result obtained is the proper object to perform meta-analysis (**objectMApath**). # Performing Meta-analysis *GSEMA* package has implemented gene set enrichment meta-analysis techniques based on the combination of effect sizes in the **metaAnalysisDEpath()** function. Although this function can be applied directly, it is advisable to consider some previous steps. ## Filtering paths with low expression Before performing the meta-analysis, it is important to filter out those pathways with low expression in the datasets. This is because the results of the meta-analysis could be biased by the presence of pathways with low expression, which may lead to an increase in the number of false positives. *GSEMA* provides a function called **filterPaths()* that allows to filter out those pathways with low expression. The inputs of this function are: * **objectMApath**: the meta-analysis object of *GSEMA* package. * **threshold**:A number that indicates the threshold to eliminate a gene set. For a eliminate a gene set is necessary that the median for both groups are less than the threshold. If threshold = "sd" the threshold will be the standard deviation of the gene set. The default value is 0.65. ```{r} objectMApathSim <- filteringPaths(objectMApathSim, threshold = 0.65) ``` ## Calculation of the effects sizes *GSEMA* has implemented three different estimator for calculating the effect sizes in each study: * The **"limma"** method used the **limma** package (@limmaR_2020) to calculate the effect size and the variance of the effect size. The effect size is calculated from the moderated Student's t computed by **limma**. From it, the estimator of Hedges'g and its corresponding variance are obtained. In this way, some of the false positives obtained by the "SMD" method are reduced. * The **"SMD"** (Standardized mean different) method calculates the effect size using the Hedges'g estimator @Hedges_1981. * The **"MD"** (raw mean different) calculates the effects size as the difference between the means of the two groups (@Borenstein_2021). ## Performing meta-analysis: **metaAnalysisDEpath()** The **metaAnalysisDEpath()** function allows to perform a meta-analysis in only one step, needing only the meta-analysis object created previously. This function has as input: * objectMApath: The meta-analysis object of *GSEMA* package. * effectS: A list of two elements. The first element is a dataframe with gene sets in rows and studies in columns. Each component of the dataframe is the effect of a gene set in a study. The second element of the list is also a dataframe with the same structure, but in this case each component of the dataframe represent the variance of the effect of a gene set in a study. This argument should be only used in the case that objectMApath argument is null. * measure: A character string that indicates the type of effect size to be calculated. The options are "limma", "SMD" and "MD". The default value is "limma". See details for more information. * WithinVarCorrect A logical value that indicates if the within variance correction should be applied. The default value is TRUE. See details for more information (cite) * typeMethod: a character that indicates the method to be performed: + "FEM": Fixed Effects model. + "REM": Random Effects model. * missAllow: a number between 0 and 1 that indicates the maximum proportion of missing values allows in a sample. If the sample has more proportion of missing values, the sample will be eliminated. In the other case, the missing values will be imputed by using the K-NN algorithm included in **impute** package (@Impute_2019). In case the **objectMA** has been previously imputed, this element is not necessary. * numData: a number between 0 and 1 that indicates the minimum number of datasets in which a pathway must be contained to be included. In the following example, we have applied a Random Effect model to the *GSEMA* object ("objectMApathSim") we have been working with so far. In addition we have applied the "limma" method to calculate the effect size. Finally, we have allowed a 0.3 proportion of missing values in a sample and a gene set must have been contained in all studies: ```{r} results <- metaAnalysisESpath(objectMApath = objectMApathSim, measure = "limma", typeMethod = "REM", missAllow = 0.3, numData = length(objectMApathSim)) ``` The output of this function is a dataframe with the results of the meta-analysis where rows are the genes and columns are the different variables provided by the meta-analysis: ```{r} results[1,] results[nrow(results),] ``` ## Meta analysis results The results are provided in a dataframe with the variables: * Com.ES: combined effect of the gene set. * ES.var: variance of the combined effect of the gene set. * Qval: total variance of the gene set. * tau2: between-study variance of the gene set. * zval: combined effect value for a standard normal. I can be use in order to find out if the gene set is overexpressed (positive value) or underexpressed (negative value). * Pval: P-value of the meta-analysis for the gene set. * FDR: P-value adjusted of the meta-analysis for the gene set. * numDataset: Number of the datasets in which the gene set is included. ## Visualization of the results: heatmap Finally, we can represent in a heatmap the significant gene sets in order to observe how they are regulated in each of the studies. In **heatmapPaths()** function we have to include both the object that has been used in the meta-analysis, the result of it and the applied method. In addition, this package offers three different scaling approaches (**scaling**) in order to compare properly the gene set enrichment scores of the studies in the heatmap: * "zscor": It calculates a z-score value for each gene, that is, the mean gene expression from each gene is subtracted from each gene expression value and then it is divided by the standard deviation. * "swr": Scaling relative to reference dataset approach @Lazar_2013. * "rscale": It uses the rescale function of the **scales** package to scale the gene expresion @scalesR_2020. * "none": no scaling approach is applied. Moreover, in **regulation** argument, we can choose if we want to represent the over regulated or under regulated gene sets: * "up": only up-regulated gene sets are represented. * "down: only down-regulated gene sets are represented * "all": up-regulated and down-regulated gene sets are represented. We can choose the number of significant gene sets (**numSig**) that we want to be shown on the graph and the adjusted p-value from which a gene set is considered as significant (**fdrSig**). In addition, the gene sets that are not presented in one sample are represented in gray. Here we present an example of the heatmap which have been obtained from the result of applying a random effects model to the object "**objectMApathSim**" and making use of a "zscor" scaling approach. ```{r} res <- heatmapPaths(objectMA=objectMApathSim, results, scaling = "zscor", regulation = "all",breaks=c(-2,2), fdrSig = 0.05, comES_Sig = 1, numSig=20, fontsize = 5) ``` # Additional information ## Calculating individual Effects size The **calculateESpath()** function returns the effects size in each of the studies. Moreover, it calculates the variance of each of the effects. ```{r} #Effects <- calculateESpath(objectMApath = objectMApathSim, measure = "limma") #Effects[[1]]["Simulated_Pathway",] #Effects[[2]]["Simulated_Pathway",] ``` #Session info ```{r} sessionInfo() ``` # References