---
title: "Revse Ecology Analysis on Microbiome"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Revse Ecology Analysis on Microbiome}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE}
library(knitr)
library(RevEcoR)
opts_chunk$set(fig.width=8, fig.height=5)
set.seed(60823316) 
```
  
  
## RevEcoR -- Reverse ecology linking organism(s) and their environments
  
###### Yang Cao  <yiluheihei@gmail.com>, Fei Li <pittacus@gmail.com>
  
###### BeiJing Institute of Radiation -- Biotechnology Laboratory
  
### Table of Contens
  
* [Background](#background)
  
* [Introduction](#intro)
  
* [Installation](#install)
  
* [Downloding the metabolic data](#metadata)
  
* [Reconstruction of organism metabolic network](#reconstruct)
  
* [Identify seed set of  metabolic network](#seedset)
  
* [Predict species interactions](#cooperation)
  
* [Comparing predicted interactions and co-occurrences](#microbiome)
  
* [sessionInfo](#session)
  
* [references](#ref)
  
<a id="background"></a>

### Background
  
Ecology is the scientific With the rapid and inexpensive next-generation 
sequencing technologies The structure of complex biological systems reflects 
not only their function but also into the habitates in which they envoled and 
are adapted to. Reverse Ecology- an emerging new frontier in Evolutionary 
Sysmtems biology at the interface of computational biology, genomics and 
environmental science, which uses population genomics to study ecology with no
a priori assumptions about the organism(s) under consideration. This term was 
suggested by [Li *et al.*](#reversecology) during a conference on ecological 
genomics in Christchurch, New Zealand. It facilitates the translation of high 
through genomic data into large scale ecological data, and utilizes 
system-based method to obtain novel insights of pooly characterized 
microorganisms and relationships between microorgasnims or their environments 
in a superorganism.Traditional approach, however, can only applied to a small 
scale and for ralatively well-studied systems.
  
<a id="intro"></a> 

### Introduction
  
This manual is a brief introduction to structure, funcitons and usage of 
**RevEcoR** package. The **RevEcoR** package implements the reverse ecology 
framework. It can be used for reconstruction of metabolic environmnets using 
a cross-species analysis, idenditifying the set of compounds and predicting the 
species interactions on a large scale with a graph-based algorithm mentioned in
[Borenstein *et al.*](#seedset). The reverse ecology framework which takes 
advances of system biology and genomic metabolic modeling, aims to predict the 
ecological traits of poorly studied microorganisms, their interactions with 
other microorganisms, and the ecology of microbial communities from system-level
analysis of complex biological networks. 

Two softwares ([NetSeed](http://elbo.gs.washington.edu/software_netseed.html) 
and [NetCooperate](http://elbo.gs.washington.edu/software_netcooperate.html)), 
have developed by **Borenstein's** lab for studying the ecology of microorganisms.
Howevever, neither of them supports metabolic network reconstruction of species,
and both are limited in small scale analysis (two species). The **RevEcoR**
package provides an interface to microbiome reverse ecology analysis on a large
scale. The main features of this package consists of several steps:
  
1. Downding metabolic networks data from KEGG database: all species metabolic 
network data could be downloaded from the KEGG PATHWAY datase with the KEGG 
REST API (application programming interface).
  
2. Reconstruction metabolic networks of all species: a directed graph whose 
nodes represent compounds and whose edges represent reactions linking 
substrates to products.
  
3. Identify seed set of a specific organism: each metabolic network was 
decomposed into its strongly conneted components (SCC) using [Kasaraju's 
algorithm](#Kasaraju). The SCC forms a directed graph whose nodes are the 
components and whose edges are the orginal edges in the graph that connect 
nodes in two different components. Then, detectting the seed set with this 
SCCs.
  
4. Host-microbe and microbe-microbe cooperation analysis
  
<a id="install"></a>

### Installation


  
**RevEcoR** is free available on CRAN. You can install the latest released 
version from CRAN as following:

```{r,eval=FALSE} 
install.packages("RevEcoR") 
```

or the latest development version from github. To install packages from GitHub,
you first need install the **devtools** package on your system with 
`install.packages("devtools")`. Note that devtools sometimes needs some 
extra non-R software on your system -- more specifically, an Rtools download for
Windows or Xcode for OS X. There's more information about devtools
[here](https://github.com/hadley/devtools).
  
```{r,eval=FALSE} 
if (!require(devtools) 
  install.packages("devtools") 
devtools::install_github("yiluheihei/RevEcoR") 
```
  

  
After installation, you can load **RevEcoR** into current workspace by typing 
or pasting the following codes:
  
```{r eval=TRUE} 
library(RevEcoR) 
```
  
<a id="metadata"></a>

### Downloding the metabolic data of a specific organism

Both the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Integrated Microbial
Genomes database (IMG) [7] collect complete high-quality genome sequences and
metagenome sequences with a comprehensive set of publicly available bacterial,
archaeal, eukaryotic, and phage genomes, as well as engineered, environmental
and host associated metagenome samples. All the sequences and functional
annotation profile can be download directly. 

The KEGG database provides a REST-style API compared with IMG database. 
This package provides function `getOrgMetabolicData` to download the 
specific organism metabolic data from KEGG database, and return a list where each 
element consists of three elements: reaction, substrate and product. Data
in IMG, we can only download it manually. In addition, users can annotate their
private genomic data with KO terms on IMG systems or in their local machine to
obtain the annotation profile. 
  
```{r eval=FALSE} 
## download sample metabolic data from remote KEGG database 
buc <- getOrgMetabolicData("buc") 
data(kegg_buc) 
head(buc) 
```

<a id="reconstruct"></a> 
### Reconstruction of organism metabolic network
  
The crucial premise of Reverse Ecology is that genomic information can be 
converted into ecological information. Metabolism linking  miroorgasnim  with 
the biochemical environmnet acrossing compound exchanges: import of exogenous 
compounds and impact the composition of its environment via secretion of other
compounds. Thus, metabolic network is the reflection of interation between 
organisms and their environment.
  
Graph-based representation of metabolic reactions where nodes represent 
compounds and edges represent reactions is a  common tool in analyzing and 
studing metabolic networks. A directed edge from node a to b indicates that 
compound a is a substrate in some reaction that produces compound b.
  
Once the metabolic data is obtained, `reconstructGsMN` could be used to
reconstruction the metabolic network of a specific organism.
  
  
```{r eval=TRUE, htmlcap="Figure 1 Reconstruction metabolic network of *Buchnera aphidicola APS*", fig.lp="Figure 1", fig.width=8, fig.height=8} 
## species in KEGG 
buc.net <- reconstructGsMN(kegg_buc, RefData = NULL) 
igraph::print.igraph(buc.net) 
igraph::plot.igraph(buc.net, vertex.label=NA, vertex.size=5, edge.arrow.size=0.1)
  
## ko annotation profile species detected in a human microbiome in IMG (not in KEGG) 
annodir <- system.file("extdata/koanno.tab",package = "RevEcoR") 
metabolic.data <- read.delim(annodir,stringsAsFactors=FALSE) 
##load the reference metabolic data 
data(RefDbcache) 
g2 <- reconstructGsMN(metabolic.data, RefData = RefDbcache) 
```

 
<a id="seedset"></a>

### Identify seed set of  metabolic network
  
As the interactions with the environment was reflected in the metabolic 
networks, these networks could be used not only to infer metabolic function 
but alse to obtain insights into the growth environments in which the species 
evolved.
  
Apparantly, organisms can survive in a wide range of enviromnets and may 
activate only a subset of the pathways in the network of each environment, 
using a different set of exogenously acquired compounds (termed seed set). The
seed set of the network is defined as the minimal set of compounds in the 
network that allows the synthesis of all other compouds, and can serve as a 
good proxy for its environment and can be conceived as the essential and 
effective biochemical environment.
  
`getSeedSets` was used to detect the seed set of a metabolic network which 
returns a seedset-class object. Futhermore, some methods was supported for 
seedset-class, e.g length, nonseeds. For more details on seedset-class, see 
`?seedset-class`.It can help us to get the compound that organisms are 
exogenously acquired compouds from the environment, representing the 
organism's nutritional profile. This algorithm is based on a fast method 
Kasaraju algorithm for SCC decomposition which is implementaed in `Kasaraju`. 
For more details, see `?getSeedSets` and `?KasarajuSCC`.
  
```{r eval=TRUE, htmlcap="Figure 2The node colored with red represents the species' seed set",fig.lp="Figure 2", fig.width=8, fig.height=8}
## seed set prediction
seed.set <- getSeedSets(buc.net, 0.2) 
show(seed.set) 
head(seed.set@seeds)
## The node colored with red represents the species' seed set
nodes  <- igraph::V(buc.net)$name
seeds  <- unlist(seed.set@seeds)
seed.index  <- match(seeds,nodes)
node.color <- rep("SkyBlue2",length(nodes))
node.color[seed.index]  <- "red"
igraph::plot.igraph(buc.net, 
          vertex.label=NA, vertex.size=5, edge.arrow.size=0.1,
          vertex.color = node.color)
```


<a id="cooperation"></a> 

### Predict species interactions
  
The topology of metabolic networks can provide insight not only into the 
metabolic process that accur within each species, but also into interactions 
between different species. Here we provides `calculateCoopreationIndex` using 
three cooperation index: [competition index, coplementarity index](#rule) measure the microbe-microbe co-occurrence pattern. More details, see 
`?calculateCoopreationIndex`.
  
```{r} 
# ptr metabolic network 
data(kegg_ptr) 
##ptr.net <- reconstructGsMN(getOrgMetabolicData("ptr")) 
ptr.net <- reconstructGsMN(kegg_ptr) 
# cooperation analysis between buc and ptr 
cooperation.index <- calculateCooperationIndex(buc.net,ptr.net) 
cooperation.index 
```

Blow we applied **RevEcoR** to predict interactions among several human oral microbiota species whose co-occurrence patterns have previously been described. Our seven sample oral species metabolic dataset was downloaded from the IMG server; it consists of the following [Genomes Online Database (GOD)](http://www.genomesonline.org) IMG identification numbers: Gc0016386, Gi07614, Gi00264, Gc00809, Gc00643, Gi03876, and Gi07289. The interactions among the seven species is calculated as follwing:

The metabolic networks of seven species was reconstructed first:

```{r, eval = FALSE, echo=TRUE}
##metabolic network reconstruction of these seven species
net  <- lapply(anno.species, reconstructGsMN)
```

Then `caculateCooperationIndex` is used to calculated the species interactions among the seven species. 

```{r, eval=FALSE, echo=TRUE}
## caculate interactions among vious species
interactions  <- calculateCooperationIndex(net, p = TRUE)
## competition index
$competition.index
          Aa        Ao        Fn        Pg        Sg        So        Va
Aa 1.0000000 0.4736842 0.3157895 0.2280702 0.4210526 0.4385965 0.2456140
Ao 0.4736842 1.0000000 0.3684211 0.3333333 0.4736842 0.4736842 0.2456140
Fn 0.5000000 0.5833333 1.0000000 0.4166667 0.5833333 0.5555556 0.4166667
Pg 0.4193548 0.6129032 0.4838710 1.0000000 0.6129032 0.5161290 0.3870968
Sg 0.5454545 0.6136364 0.4772727 0.4318182 1.0000000 0.9090909 0.3863636
So 0.5813953 0.6046512 0.4651163 0.3720930 0.9302326 1.0000000 0.3953488
Va 0.4827586 0.4827586 0.5172414 0.4137931 0.5862069 0.5862069 1.0000000
## p value of competition index
$competition.index.p
Aa    Ao    Fn    Pg    Sg    So    Va
Aa 0.000 0.001 0.001 0.001 0.001 0.001 0.001
Ao 0.001 0.000 0.001 0.001 0.001 0.001 0.001
Fn 0.001 0.001 0.000 0.001 0.001 0.001 0.001
Pg 0.001 0.001 0.001 0.000 0.001 0.001 0.001
Sg 0.001 0.001 0.001 0.001 0.000 0.001 0.001
So 0.001 0.001 0.001 0.001 0.001 0.000 0.001
Va 0.001 0.001 0.001 0.001 0.001 0.001 0.000
## complementarity index
$complementarity.index
 Aa        Ao        Fn         Pg        Sg         So        Va
Aa 0.0000000 0.1052632 0.1228070 0.07017544 0.0877193 0.08771930 0.1228070
Ao 0.1403509 0.0000000 0.1403509 0.07017544 0.1228070 0.12280702 0.1403509
Fn 0.1944444 0.1666667 0.0000000 0.16666667 0.1111111 0.11111111 0.1388889
Pg 0.2258065 0.2258065 0.1612903 0.00000000 0.1612903 0.19354839 0.2258065
Sg 0.2272727 0.1818182 0.1590909 0.09090909 0.0000000 0.04545455 0.1590909
So 0.1860465 0.1395349 0.1860465 0.09302326 0.0000000 0.00000000 0.1395349
Va 0.2068966 0.1724138 0.1379310 0.17241379 0.1379310 0.13793103 0.0000000
## p value of complementarity index
$complementarity.index.p
 Aa    Ao    Fn    Pg    Sg    So    Va
Aa 0.000 0.001 0.001 0.001 0.001 0.001 0.001
Ao 0.001 0.000 0.001 0.001 0.001 0.001 0.001
Fn 0.001 0.001 0.000 0.001 0.001 0.001 0.001
Pg 0.001 0.001 0.001 0.000 0.001 0.001 0.001
Sg 0.001 0.001 0.001 0.001 0.000 0.001 0.001
So 0.001 0.001 0.001 0.001 0.001 0.000 0.001
Va 0.001 0.001 0.001 0.001 0.001 0.001 0.000
```

We found that Streptococcus oralis (So) and Streptococcus gordonii (Sg) had the lowest complementarity index (0.04 and 0.00, respectively) and the highest competition index (0.91 and 0.93, respectively) among all pairs. Appanrently, the p value of all interactions are biologcial/statistical significance. This indicates that these two species are antagonistic, which is consistent with previous findings.


<a id="microbiome"></a> 

### Comparing predicted interactions and co-occurrences
  
To further evaluate the predicted interactions of RevEcoR on a large scale, we
integrated functions mentioned above to investigate species interactions in 
the gut microbiome. We focused on a list of 116 prevalent gut species, whose 
genome sequence is available in IMG database and sequence coverage is more 
than 1% in at least one metagenomic sample of [124 individuals](#gut). Genome 
annotation profiles of this 116 species was collected from IMG database and was 
used to calculated the interactions (competition and complementarity index) for 
all pairs of species.

#### 1. Predicting species interactions in gut microbiome
  
For each species, we download the list of genes mapped to the Kyoto 
Encyclopedia of Genes and Genomes orthologous groups (KOs) was downloaded with
a in-house R script. This data, which was used to reconstruct the metabolic 
network of each species, have been saved as *gut_microbiome.rda* in 
subdirectory *data* of RevEcoR. We can load it as the following code:
  
```{r} 
data(gut_microbiome) 
## summary(gut_microbiome) 
```

Then, it was used to reconstuct the metabolic network, predict the seed set, 
and finally predict the pairs of interactions between different species:
  
```{r, eval = FALSE, echo = TRUE}
gut.nets <- lapply(gut_microbiome,reconstructGsMN)
seed.sets <- lapply(gut.nets,getSeedSets)
## Since calculation is on large scale, species interactions prediction may take several hours
gut.interactions <- calculateCooperationIndex(gut.nets)
competition.index <- gut.interactions$competition.index
complementarity.index <- gut.interactions$complementarity.index
```

#### 2. Obtaining co-occurrence scores in gut microbiome

Specially, it will help us to predict whether species compete with one another 
tend to co-occur or to exclude by comparing the species interactions and 
co-occurrences. We obtained co-occurrence scores directly from 
[*Levy R's* research](#rule), and saved it as *occurence.tab* in 
subdirectory *inst/extdata* of RevEcoR . Co-occurrence score was calculated based on 
species abundances across all samples and measured by the Jaccard similarity
index.
 
Load the co-occurrence data:
 
```{r, eval = TRUE, echo = TRUE}
occurrence.score <- read.delim(system.file("extdata/occurrence.tab",
  package = "RevEcoR"),stringsAsFactors = FALSE, quote = "")
```
 
#### 3. Comparing the interactions and co-occurrences
 
The Jaccard index measures similarity between finite sample sets, and is defined
as the size of the intersection divided by the size of the union of the sample 
sets. Thus, co-occurrence scores are generally symmetric whereas competition 
index or complementarity index are not according to their definitions. 

A symmetric version of two interaction index was gnerated by replacing each 
element of the interaction indices with the mean of each value and that 
transpose value.

```{r, eval=FALSE,echo=TRUE}
competition.index <- (competition.index + t(competition.index))/2
complementarity.index <- (complementarity.index + t(complementarity.index))/2
```
Subsequently, the Spearman correlation between the co-occurrence scores and the
two interaction indices was calculated. A permutation-based Mantel test, which 
is commonly used in ecology, was used to determin the significance of this 
correlation.
 
```{r, eval=FALSE,echo=TRUE}
## upper triangles, which is used to calculate the correlation
competition.upper <- competition.index[upper.tri(competition.index)]
occurrence.upper <- occurrence.score[upper.tri(occurrence.score)]
complementarity.upper <- complementarity.index[upper.tri(complementarity.index)]

## calculate the spearman correlation betwwen co-occurrence scores and two 
## interactions indices
competition.cor <- cor(competition.upper,occurrence.upper,method="spearman")
complementarity.cor <- cor(complementarity.upper,occurrence.upper,method="spearman")

## permutation-based mantel test. Random permutation the co-occurance score 
## 10000 times, P value is the fraction of correlations as high as or higher 
## than the original
if (require(magrittr)){
  null.stat <- replicate(10000,
    sample(1:116) %>% occurrence.score[.,.] %>%
      .[upper.tri(.)]
  )
  competition.null <- cor(competition.upper,null.stat)
  complementarity.null <- cor(complementarity.upper,null.stat)
  length(which(competition.null >= competition.cor)) ## 0 p.competition < 0.00001
  length(which(complementarity.null <= complementarity.cor)) ## 0 p.complementarity< 0.00001
}
```
 
We found that competition index is significant positively correlated with co-occurrence (cor = 0.261, P < 10-4, Mantel correlation test), whereas the complementarity index is significant negatively correlated with co-occurrence (cor = -0.259, P < 10-4, Mantel correlation test). This suggests that competition is liked to be the key factor
to promote the assembly of gut microorganisms


<a id="session"></a> 

### sessionInfo
  
The version number of R and packages loaded for generating the vignette were:
  
```{r, eval=TRUE} 
sessionInfo() 
``` 

<a id="ref"></a>

### references 
  
* <a name="reversecology">Li Y F, Costello J C, Holloway A K, et al. "Reverse 
ecology" and the power of population genomics[J]. Evolution, 2008, 62(12): 
2984-2994.</a>
  
* <a name="seedset">Borenstein E, Kupiec M, Feldman M W, et al. Large-scale 
reconstruction and phylogenetic analysis of metabolic environments[J]. 
Proceedings of the National Academy of Sciences, 2008, 105(38): 
14482-14487.</a>
  
* <a name="Kasaraju">Tarjan R. Depth-first search and linear graph 
algorithms[J]. SIAM journal on computing, 1972, 1(2): 146-160.</a>
  
* <a name="rule">Levy R, Borenstein E. Metabolic modeling of species 
interaction in the human microbiome elucidates community-level assembly 
rules[J]. Proceedings of the National Academy of Sciences, 2013, 110(31): 
12804-12809.</a>
  
* <a name="bsi">Borenstein E, Feldman M W. Topological signatures of species 
interactions in metabolic networks[J]. Journal of Computational Biology, 2009,
16(2): 191-200.</a>

* <a name="gut">Qin J, Li R, Raes J, et al. A human gut microbial gene catalogue
established by metagenomic sequencing[J]. Nature, 2010, 464(7285): 59-65.</a>