---
title: "Using FLORAL for Microbiome Analysis"
output: 
  rmarkdown::html_vignette:
    md_extensions: [ 
      "-autolink_bare_uris" 
    ]
vignette: >
  %\VignetteIndexEntry{Using FLORAL for Microbiome Analysis}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width = "100%"
)
```

```{r setup, warning=FALSE, message=FALSE}
library(FLORAL)
library(dplyr)
library(patchwork)
```

## Data
We will be using data from the `curatedMetagenomicData` package. For easier installation, we saved a flat copy of the data, but the steps below show how that was created. 

```r
if (! "BiocManager" %in% installed.packages()) install.packages("BiocManager")
if (! "curatedMetagenomicData" %in% installed.packages()) BiocManager::install("curatedMetagenomicData")
if (! "patchwork" %in% installed.packages()) install.packages("patchwork")
library(curatedMetagenomicData)

# Take a look at the summary of the studies available:
curatedMetagenomicData::sampleMetadata  |> group_by(.data$study_name, .data$study_condition) |> count() |> arrange(.data$study_name) 
#As an example, let us look at the `YachidaS_2019` study between healthy controls and colorectal cancer (CRC) patients.
curatedMetagenomicData::curatedMetagenomicData("YachidaS_2019") 
# note --  if you are behind a firewall, see the solutions to 500 errors here:
# https://support.bioconductor.org/p/132864/
rawdata <- curatedMetagenomicData::curatedMetagenomicData("2021-03-31.YachidaS_2019.relative_abundance", dryrun = FALSE, counts = TRUE) |>  mergeData()
x <- SummarizedExperiment::assays(rawdata)$relative_abundance %>% t()
y <- rawdata@colData$disease

save(list = c("x", "y"),  file = file.path("inst", "extdata", "YachidaS_2019.Rdata"))
```

```{r getData}
load(system.file("extdata", "YachidaS_2019.Rdata", package="FLORAL"))
```

## Running FLORAL

We extracted the data from the `TreeSummarizedExperiment` object to two objects: the taxa matrix `x` and the "outcomes" vector `y` of whether a patient is healthy or has colorectal cancer (CRC). Note that for binary outcomes, the input vector `y` needs to be formatted with entries equal to either `0` or `1`. In addition, we need to specify `family = "binomial"` in `FLORAL` to fit the logistic regression model. To print the progress bar as the algorithm runs, please use `progress = TRUE`. 

```{r floral}

x <- x[y %in% c("CRC","healthy"),]
x <- x[,colSums(x >= 100) >= nrow(x)*0.2] # filter low abundance taxa

colnames(x) <- sapply(colnames(x), function(x) strsplit(x,split="[|]")[[1]][length(strsplit(x,split="[|]")[[1]])])

y <- as.numeric(as.factor(y[y %in% c("CRC","healthy")]))-1
fit <- FLORAL(x = x, y = y, family="binomial", ncv=10, progress=TRUE)
```

## Interpreting the Model

FLORAL, like other methods that have an optimization step, has two "best" solutions for $\lambda$ available: one minimizing the mean squared error ($\lambda_\min$), and one maximizing the value of $\lambda$ withing 1 standard error of the minimum mean squared error ($\lambda_{\text{1se}}$). These are referred to as the `min` and `1se` solutions, respectively.

We can see the mean squared error (MSE) and the coefficients vs log($\lambda$) as follows:

```{r plots,fig.height=4,fig.width=10,dpi=300}
fit$pmse + fit$pcoef
```

In both plots, the vertical dashed line and dotted line represent $\lambda_\min$ and $\lambda_{\text{1se}}$, respectively. In the MSE plot, the bands represent plus minus one standard error of the MSE. In the coefficient plot, the colored lines represent individual taxa, where taxa with non-zero values at $\lambda_\min$ and $\lambda_{\text{1se}}$ are selected as predictive of the outcome. 

To view specific names of the selected taxa, please see `fit$selected.feature$min` or `fit$selected.feature$1se` vectors. To view all coefficient estimates, please see `fit$best.beta$min` or `fit$best.beta$1se`. Without looking into ratios, one can crudely interpret positive or negative association between a taxon and the outcome by the positive or negative sign of the coefficient estimates. However, we recommend referring to the two-step procedure discussed below for a more rigorous interpretation based on ratios, which is derived from the log-ratio model assumption. 

```{r viewTaxa}

head(fit$selected.feature$min)

head(sort(fit$best.beta$min))

```

## The Two-step Procedure

In the previous section, we checked the lasso estimates without identifying specific ratios that are predictive of the outcome (CRC in this case). By default, `FLORAL` performs a two-step selection procedure to use `glmnet` and `step` regression to further identify taxa pairs which form predictive log-ratios. To view those pairs, use `fit$step2.ratios$min` or `fit$step2.ratios$1se` for names of ratios and `fit$step2.ratios$min.idx` or `fit$step2.ratios$1se.idx` for the pairs of indices in the original input count matrix `x`. Note that one taxon can occur in multiple ratios.

```{r view2step}

head(fit$step2.ratios$`1se`)

fit$step2.ratios$`1se.idx`

```

To further interpret the positive or negative associations between the outcome, please refer to the output `step` regression tables, where the effect sizes of the ratios can be found. 

While the corresponding p-values are also available, we recommend only using the p-values as a criterion to rank the strength of the association. We do not recommend directly reporting the p-values for inference, because these p-values were obtained after running the first step lasso model without rigorous post-selective inference. However, it is still valid to claim these selected log-ratios are predictive of the outcome, as demonstrated by the improved 10-fold cross-validated prediction errors.

```{r viewTable}

fit$step2.tables$`1se`

```

## Generating taxa selection probabilities

It is encouraged to run k-fold cross-validation for several times to account for the random fold splits. `FLORAL` provides `mcv.FLORAL` functions to repeat cross-validations for `mcv` times and on `ncore` cores. The output summarizes taxa selection probabilities, average coefficients based on $\lambda_\min$ and $\lambda_{\text{1se}}$. Interpretable plots can be created if `plot = TRUE` is specified.

```{r mcv}

mcv.fit <- mcv.FLORAL(mcv=2,
                      ncore=1,
                      x = x, 
                      y = y, 
                      family = "binomial", 
                      ncv = 3,
                      progress=TRUE)

```

```{r mcvplots,fig.height=6,fig.width=10,dpi=300}

mcv.fit$p_min

#Other options are also available
#mcv.fit$p_min_ratio
#mcv.fit$p_1se
#mcv.fit$p_1se_ratio

```

## Elastic net

Beyond lasso model, `FLORAL` also supports elastic net models by specifying the tuning parameter `a` between 0 and 1. Lasso penalty will be used when `a=1` while ridge penalty will be used when `a=0`. 

The `a.FLORAL` function can help investigate the prediction performance for different choices of `a` and return a plot of the corresponding prediction metric trajectories against the choice of $\lambda$.

```{r a.floral,out.width = '50%',fig.height=4,fig.width=4,dpi=300}

a.fit <- a.FLORAL(a = c(0.1,1),
                  ncore = 1,
                  x = x, 
                  y = y, 
                  family = "binomial", 
                  ncv = 3,
                  progress=TRUE)

a.fit

```