4 - Landscape and genetic graph comparison with graph4lg

UMR ThéMA - UBFC-CNRS, UMR Biogéosciences - UBFC-CNRS, ARP-Astrance

Paul SAVARY

Introduction

The rationale of graph4lg package in R is to make easier the construction and analysis of genetic and landscape graphs in landscape genetic studies (hence the name graph4lg, meaning Graphs for Landscape Genetics). This package provides users with tools for:

Each one of the included tutorials focuses on one of these points. This fourth and last tutorial will then focus on landscape and genetic graph comparison. It will describe the package functions allowing users to:

Data used in this tutorial

The package already includes genetic and spatial simulated data sets allowing users to discover its different functionalities. The first data set (data_simul) was simulated with CDPOP (Landguth and Cushman 2010) on a simulated landscape. It consists of 1500 individuals from 50 populations genotyped at 20 microsatellite loci. Individuals dispersed less when the cost-distance between populations was large. A landscape graph was created with Graphab (Foltête, Clauzel, and Vuidel 2012) whose nodes were the 50 simulated populations and the links were weighted by cost-distance values between populations. The project created with Graphab was included into the package such that the landscape graphs and the cost-distance matrix can be easily imported into the R environment.

The second data set (data_ex) was simulated as the first one but included only 10 populations. It is used to generate quick examples.

Here, we also rely on a data set created only for the vignettes (data_tuto) and containing several objects:

We will also use Graphab projects already created for the vignettes.

In landscape genetics, the analysis of the link between landscape and genetic data can be performed at several scales. Indeed, node-, link-, neighbourhood- and boundary-based analyses are distinguished (Wagner and Fortin 2013). Similarly, we can compare landscape and genetic graphs at these different scales, in particular when they share the same nodes. We present how to implement these comparisons using graph4lg. To see how to create landscape and genetic graphs with graph4lg, we invite users to read the second and third tutorials included in this package.

Comparing node metrics

First, landscape and genetic graphs can be compared by comparing connectivity metrics measured at the level of a habitat patch (landscape graph node) with the genetic response of the population living and sampled in this habitat patch (genetic graph node) in terms of genetic diversity and differentiation from the other populations. When using this approach, two graphs are similar if the correlation coefficient between the metric values calculated for the same nodes is high. This correlation would be interpreted as an evidence for the influence of habitat connectivity on population genetic structure.

The function graph_node_compar computes these correlations to compare two graphs at the node level. It takes as arguments:

In order to include node attributes in the correlation, users can use compute them and associate them with graph nodes with add_nodes_attr function (described in tutorial 2).

In the following example, we will import a landscape graph made of the habitat patches occupied by the 50 populations used in a gene flow simulation with CDPOP (pts_pop_simul and data_simul_genind data set). A genetic graph will be created from the simulated genetic data set. We will then compare two metrics computed at the node-level in these graphs.

land_graph is the landscape graph made of the forest habitat patches occupied by the 50 populations. Its 1225 links are weighted by cost-distances between patches (mat_ld). It is a complete graph.

land_graph <- gen_graph_topo(mat_w = mat_ld,
                             mat_topo = mat_ld,
                             topo = "comp")

# Plot the histogram of its link weights
plot_w_hist(graph = land_graph)

Using the function plot_w_hist, we see that its link weights almost follow a normal distribution with values ranging from 0 to 10.000 cost-distance units.

We will compute the mean of the inverse weight of every link connected to its nodes. This metric miw is akin to the Flux metric computed in Graphab. It takes high values if a patch is well connected and close to other patches. We use the function compute_node_metric

miw_lg <- compute_node_metric(graph = land_graph, metrics = "miw")
head(miw_lg)

We include the values of this metric as a graph node attribute with the function add_nodes_attr:

land_graph <- add_nodes_attr(graph = land_graph,
                             input = "df",
                             data = miw_lg,
                             index = "ID")

We now create a genetic graph from the simulated data set. To that purpose, we first compute a genetic distance matrix using the DPS genetic distance.

mat_dps <- mat_gen_dist(x = data_simul_genind, dist = "DPS")

We use this distance matrix to create a complete genetic graph with gen_graph_topo:

gen_comp_graph <- gen_graph_topo(mat_w = mat_dps,
                                 mat_topo = mat_dps,
                                 topo = "comp")

We plot the distribution of the link weights.

plot_w_hist(graph = gen_comp_graph, 
            fill = "darkblue")

We will compute the mean of the inverse weights of links connected to each node in the genetic graph gen_comp_graph and include it as a node attribute.

miw_comp <- compute_node_metric(graph = gen_comp_graph, metrics = "miw")
gen_comp_graph <- add_nodes_attr(graph = gen_comp_graph,
                                 input = "df",
                                 data = miw_comp,
                                 index = "ID")

We can now assess the correlation between these two metrics using graph_node_compar:

graph_node_compar(x = land_graph, y = gen_comp_graph,
                  metrics = c("miw", "miw"), method = "spearman",
                  weight = TRUE, test = TRUE)
#> [[1]]
#> [1] "Metric from graph x: miw"
#> 
#> [[2]]
#> [1] "Metric from graph y: miw"
#> 
#> [[3]]
#> [1] "Method used: spearman's correlation coefficient"
#> 
#> [[4]]
#> [1] "Sample size: 50"
#> 
#> [[5]]
#> [1] "Correlation coefficient: 0.723793517406963"
#> 
#> [[6]]
#> [1] "p-value of the significance test: 2.86499806974383e-09"

The two metrics are highly correlated, meaning that if a habitat patch is well connected, the population occupying this patch is less different from the other populations from a genetic point of view than if the habitat patch is not connected well. This is quite an expected result given that gene flow was largely driven by cost-distances between populations during the genetic simulations.

Using complete graphs, visual representations are unclear because of link overlap. Building a Gabriel graph is a way to simplify graph topology. Gabriel graphs links will be computed from the Euclidean geographical distances between populations.


mat_geo <- mat_geo_dist(data = pts_pop_simul,
                        ID = "ID", x = "x", y = "y")
#> Coordinates were treated as projected coordinates. Check whether
#>               it is the case.
mat_geo <- reorder_mat(mat_geo, order = row.names(mat_dps))

gen_gab_graph <- gen_graph_topo(mat_w = mat_dps,
                                mat_topo = mat_geo,
                                topo = "gabriel")

# Associate the values of miw from the complete graph to this graph

gen_gab_graph <- add_nodes_attr(gen_gab_graph,
                                data = miw_comp,
                                index = "ID")

# Plot the graph with node sizes proportional to MIW

plot_graph_lg(graph = gen_gab_graph, 
              crds = pts_pop_simul,
              mode = "spatial",
              node_size = "miw",
              link_width = "inv_w")

Comparing graph modules

Finally, two graphs have similar topological and connectivity properties if their modules match, i.e. if two nodes classified together in the same module when the partition in modules is computed from one graph are also classified together in the same module when the partition is computed from the other graph.

The function graph_modul_compar compares the nodes partitions into modules. To do that, it computes the Adjusted Rand Index (ARI) (Hubert and Arabie 1985), a standardised index which counts the number of node pairs classified in the same module in both graphs. The function also performs the nodes partition into modules by using the modularity calculations available in igraph package. We can specify the algorithm used to compute the modularity, the link weighting, the number of modules, among others. It takes as arguments:

Two different weightings can be used to create the modules of the two graphs:

In the following example, we compare land_graph_thr and gen_gab_graph with the default parameters (algo='fast_greedy' algorithm, optimal number of modules, links weighted by inverse distances (node_inter="distance")).

graph_modul_compar(x = land_graph_thr, 
                   y = gen_gab_graph)
#> [1] "7 modules in graph 1 and 6 modules in graph 2"
#> [1] "Adjusted Rand Index: 0.648832401029973"
#> [1] 0.6488324

The ARI value is relatively high meaning that modules are somehow similar.

We can check this result visually. We first compute the modules in both graphs and include them as node attributes.

module_land <- compute_graph_modul(graph = land_graph_thr, 
                                   algo = "fast_greedy",
                                   node_inter = "distance")

land_graph_thr <- add_nodes_attr(graph = land_graph_thr,
                                 data = module_land,
                                 index = "ID")

module_gen <- compute_graph_modul(graph = gen_gab_graph, 
                                   algo = "fast_greedy",
                                   node_inter = "distance")

gen_gab_graph <- add_nodes_attr(graph = gen_gab_graph,
                                 data = module_gen,
                                 index = "ID")

We then plot both graphs, colouring their nodes depending on their module:

plot_graph_lg(graph = land_graph_thr,
              mode = "spatial",
              crds = pts_pop_simul,
              module = "module")

plot_graph_lg(graph = gen_gab_graph,
              mode = "spatial",
              crds = pts_pop_simul,
              module = "module")

We easily visualise divergence and convergence in both module partitions.

Conclusion

graph4lg makes available a large range of tools to perform landscape genetic analyses relying upon genetic and landscape graphs. We hope that next improvements will draw on results obtained using these tools in empirical as well as simulation studies.

References

Foltête, Jean-Christophe, Céline Clauzel, and Gilles Vuidel. 2012. “A Software Tool Dedicated to the Modelling of Landscape Networks.” Environmental Modelling & Software 38: 316–27.
Hubert, Lawrence, and Phipps Arabie. 1985. “Comparing Partitions.” Journal of Classification 2 (1): 193–218.
Landguth, Erin L, and SA Cushman. 2010. “CDPOP: A Spatially Explicit Cost Distance Population Genetics Program.” Molecular Ecology Resources 10 (1): 156–61.
Matthews, Brian W. 1975. “Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme.” Biochimica Et Biophysica Acta (BBA)-Protein Structure 405 (2): 442–51.
Wagner, Helene H, and Marie-Josée Fortin. 2013. “A Conceptual Framework for the Spatial Analysis of Landscape Genetic Data.” Conservation Genetics 14 (2): 253–61.