---
title: "An Introduction to SOHPIE"
author: "Seungjun Ahn"
date: "October 20th, 2023"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{An Introduction to SOHPIE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

We introduce a Statistical Approach via Pseudo-value Information and Estimation (SOHPIE; pronounced as "Sofie"). This is a regression modeling method for differential network (DN) analysis that can include covariate information in analyzing microbiome data. SOHPIE is a software extension of our methodology work [1].

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


## Requirements
Please install these R packages prior to use SOHPIE.
```{r}
# library(robustbase) # To fit a robust regression.
# library(parallel) # To use mclapply() when reestimating the association matrix.
# library(dplyr)  # For the convenience of tabulating p-values, coefficients, and q-values.
# library(fdrtool) # For false discovery rate control.
# library(gtools) # To estimate an association matrix via SparCC.
```


```{r setup}
library(SOHPIE)
```

### Load the study data from SOHPIE R package:
Two sample datasets are available in this package. One (`combinedamgut`) is from the American Gut Project [2] and contains 138 taxa and 268 subjects. In this vignette, the first 30 out of 138 taxa will be used for the simple demonstration purpose. The other (`combineddietswap`) is from the geographical epidemiology study of diet swap intervention [3] that includes 112 taxa with 37 subjects (20 African Americans from Pittsburgh and 17 rural South Africans). The full data of each study are available in the SpiecEasi and microbiome R packages, respectively.

### Example I: American Gut Project Data
```{r}
set.seed(20050505)
data(combinedamgut) # A complete data containing columns with taxa and clinical covariates.
```


### Data processing for the toy example using sample dataset from American Gut Project:
The main grouping variable will be the indicator variable for the status of living with a dog. After the data processing, the indices of subjects will be available for each 'Not living with a dog (Group A)' vs. 'Living with a dog (Group B).' We need these indices for the estimation of group-specific $p \times p$ association matrices (and re-estimation of association matrices for pseudo-value calculations later).
```{r}
# Note: Again, we will use a toy example with the first 30 out of 138 taxa.
OTUtab = combinedamgut[ , 8:37]

# Clinical/demographic covariates (phenotypic data):
# Note: All of these covariates in phenodat below will be included in the regression 
#       when you use SOHPIE_DNA function later. Please make sure 
#       phenodat below include variables that will be analyzed only.
phenodat = combinedamgut[, 1:7] # first column is ID, so not using it.
```

```{r}
# Obtain indices of each grouping factor.
# In this example, a variable indicating the status of living with a dog was chosen (i.e. bin_dog).
# Accordingly, Groups A and B imply living without and with a dog, respectively.
newindex_grpA = which(combinedamgut$bin_dog == 0)
newindex_grpB = which(combinedamgut$bin_dog == 1)
```


### Fit a pseudo-value regression via SOHPIE_DNA() function:
Upon our data processing step above is complete, we can then fit a pseudo-value regression using `SOHPIE_DNA` function. An important note! Please provide the object name of each OTU table and clinical/demographic data (i.e. metadata) separately in the function. In addition, you must indicate the object names of the indices for each group of a binary indicator variable that is used as a main predictor variable (e.g. living with a dog vs. without a dog). Lastly, you must enter a trimming proportion `c`, which ranges from 0.5 to 1. 
```{r}
SOHPIEres <- SOHPIE_DNA(OTUdat = OTUtab, clindat = phenodat, 
                        groupA = newindex_grpA, groupB = newindex_grpB, c = 0.5)
```

### Additional features available in SOHPIE package:
Now, I would like to show you that SOHPIE has some convenient tools/functions after fitting a pseudo-value regression. There are functions that you can quickly extract names of taxa that are significantly differentially connected (DC; `DCtaxa_tab`), as well as p-values (`pval` and `pval_specific_var`), adjusted p-values (q-values; `qval` and `qval_specific_var`), coefficient estimates (`coeff` and `coeff_specific_var`), and standard errors (`stderrs` and `stderrs_specific_var`) of all variables that are considered in the regression or a specific variable.
```{r}
# qval() function will get you a table with q-values.
qval(SOHPIEres)
```
`qval_specific_var` function will be useful to retrieve the q-values of a specific variable, `bin_dog` in this example.  

```{r}
# Create an object to keep the table with q-values.
qvaltab <- qval(SOHPIEres)
# Retrieve a vector of q-values for a single variable of interest.
qval_specific_var(qvaltab = qvaltab, varname = "bin_dog")
```

`DCtaxa_tab` will return a list containing of (1) names and q-values of taxa that are significantly DC between two biological conditions and (2) names of DC taxa only. 
```{r}
# Please do NOT forget to provide the name of variable in DCtaxa_tab(groupvar = )
# and the level of significance (0.3 in this example).
DCtaxa_tab <- DCtaxa_tab(qvaltab = qvaltab, groupvar = "bin_dog", alpha = 0.3)
DCtaxa_tab
```


### Example II: Diet Exchange Study Data

```{r}
data(combineddietswap)

OTUtab = combineddietswap[ , 5:ncol(combineddietswap)]
phenodat = combineddietswap[ , 1:4] # first column is ID, so not using it.

```


```{r}
# Obtain indices for each groups 
# (i.e. African-Americans from Pittsburgh (AAM) vs. Africans from rural South Africa (AFR))
# at baseline (time = 1) and at 29-days (time = 6)

# Group A1 for AAM at baseline.
newindex_A1 = which(combineddietswap$timepoint == 1 & combineddietswap$nationality == "AAM")
# Group A6 for AAM at 29-days.
newindex_A6 = which(combineddietswap$timepoint == 6 & combineddietswap$nationality == "AAM")
# Group B1 for AFR at baseline.
newindex_B1 = which(combineddietswap$timepoint == 1 & combineddietswap$nationality == "AFR")
# Group A6 for AFR at 29-days.
newindex_B6 = which(combineddietswap$timepoint == 6 & combineddietswap$nationality == "AFR")
```

We are done with loading the data and obtaining indices 
for each group for each time point. We then move onto the analysis step!

The first step is to estimate and re-estimate association matrices for each of groups A1, A6, B1, and A6 shown above. Of note, a list output comprises `data.frame` objects as `assomat` and `reest.assomat`.
```{r}
est_asso_matA1 = asso_mat(OTUdat=OTUtab, group=newindex_A1)
est_asso_matA6 = asso_mat(OTUdat=OTUtab, group=newindex_A6)
est_asso_matB1 = asso_mat(OTUdat=OTUtab, group=newindex_B1)
est_asso_matB6 = asso_mat(OTUdat=OTUtab, group=newindex_B6)

## For each group, we take the difference of estimated 
## association matrices between time points (29-days minus baseline).
asso_mat_diffA61 = est_asso_matA6$assomat - est_asso_matA1$assomat
asso_mat_diffB61 = est_asso_matB6$assomat - est_asso_matB1$assomat

## Similarly, for each group, we take the difference of re-estimated 
## association matrices between time points (29-days minus baseline).
asso_mat_drop_diffA61 = mapply('-', est_asso_matA6$reest.assomat,
                               est_asso_matA1$reest.assomat, SIMPLIFY=FALSE)
asso_mat_drop_diffB61 = mapply('-', est_asso_matB6$reest.assomat,
                               est_asso_matB1$reest.assomat, SIMPLIFY=FALSE)
```

Then, for each group, we calculate changes in network centrality of each taxa between the association matrices estimated from the whole data and between the association matrices re-estimated from the leave-one-out sample.
```{r}
## For changes in network centrality between association matrices estimated from the whole data.
thetahat_grpA = thetahats(asso_mat_diffA61)
thetahat_grpB = thetahats(asso_mat_diffB61)
## For changes in network centrality between association matrices 
## re-estimated from the leave-one-out sample.
thetahat_drop_grpA = sapply(asso_mat_drop_diffA61, thetahats)
thetahat_drop_grpB = sapply(asso_mat_drop_diffB61, thetahats)
```

Next, we calculate jackknife pseudo-values.
```{r}
# Sample sizes for each group.
n_A <- length(newindex_A1) 
n_B <- length(newindex_B1)

# Jackknife pseudo-values.
thetatilde_grpA = thetatildefun(thetahat_grpA, thetahat_drop_grpA, n_A)
thetatilde_grpB = thetatildefun(thetahat_grpB, thetahat_drop_grpB, n_B)

thetatilde = rbind(thetatilde_grpA, thetatilde_grpB)
```

Fit a robust regression regressing covariates on pseudo-values. Please note that the metadata/phenotype data should contain a set of predictors to be fitted only, so choose them wisely. Additionally, a trimming proportion `c` of the least trimmed squares estimator should be entered between 0.5 and 1.
```{r}
fitmod = pseudoreg(pseudoval=thetatilde, clindat=phenodat, c=0.5)
```

Lastly, we can extract summary results from the fitted model from `fitmod` object above. A list of `data.frame` objects for coefficient estimates, p-values, and q-values will be available.

```{r}
summary.result = pseudoreg.summary(pseudo.reg.res=fitmod, 
                                   taxanames=colnames(OTUtab))

## In this study, the main grouping variable was nationality. 
# We can use qval_specific_var to see the q-values only.
# of nationality. Alternatively, we further use DCtaxa_tab to see
# the DC taxa with their p-values.
qval_specific_var(summary.result$q_values, varname = "nationalityAFR")
DCtaxa_tab(summary.result$q_values, groupvar = "nationalityAFR", alpha=0.05)

```
\newpage
## References
[1] Ahn S, Datta S. (2023). Differential Co-Abundance Network Analyses for Microbiome Data Adjusted for Clinical Covariates Using Jackknife Pseudo-Values. Under Review at $\textit{BMC Bioinformatics}$.

[2] McDonald D. et al. (2018). American Gut: an Open Platform for Citizen Science Microbiome Research. $\textit{mSystems}$. **3**(3), e00031–18

[3] O'Keefe SJ. et al. (2015). Fat, fibre and cancer risk in African Americans and rural Africans. $\textit{Nat Commun}$. **6**, 6342