---
title: "BinMat: Processes Binary Data Obtained from Fragment Analysis"
author: "Clarke van Steenderen"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{BinMat}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
![](logo.png)
*Department of Zoology and Entomology*
*The Centre for Biological Control* (https://www.ru.ac.za/centreforbiologicalcontrol/)
*Rhodes University, Grahamstown, Eastern Cape, South Africa*
*e-mail:* vsteenderen@gmail.com
The idea behind this package was taken from my M.Sc. project
"*A genetic analysis of the species and intraspecific lineages of* Dactylopius *Costa (Hemiptera:Dactylopiidae)*"
See the publication in the [Biological Control journal](https://www.sciencedirect.com/science/article/pii/S1049964420306538), and the thesis on [Research Gate](https://www.researchgate.net/publication/339124038_A_genetic_analysis_of_the_species_and_intraspecific_lineages_of_Dactylopius_Costa_Hemiptera_Dactylopiidae). The GUI version of this package is available on the [R Shiny online server](https://clarkevansteenderen.shinyapps.io/BINMAT/), or it is accessible via GitHub by typing:
```{}
shiny::runGitHub("BinMat", "clarkevansteenderen")
```
into the console in R.
---
## **OVERVIEW**
Processing and visualising trends in the binary data obtained from fragment analysis methods in molecular biology can be a time-consuming and cumbersome process, and typically entails complex workflows.
The BinMat package automates the analysis pipeline on one platform, and was written to process binary matrices derived from dominant marker genetic analyses.
The program consolidates replicate sample pairs in a dataset into consensus reads, produces summary statistics (peaks and error rates), and allows the user to visualise their data as non-metric multidimensional scaling (nMDS) plots and hierarchical clustering trees.
---
## **UPLOADING DATA**
**Data type 1**
Binary data containing replicate pairs that need be consolidated should be in the following example format, uploaded as CSV file (use the read.csv() function):
| | Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 |
|---------------------- |:------------: | :------------: |:------------: |:------------: |:------------: |
| Sample A replicate 1 | 0 | 0 | 1 | 1 | 1 |
| Sample A replicate 2 | 0 | 0 | 1 | 1 | 1 |
| Sample B replicate 1 | 1 | 1 | 0 | 0 | 0 |
| Sample B replicate 2 | 0 | 1 | 0 | 0 | 1 |
Note that replicate pairs must be directly underneath each other, and that each sample needs to have a unique name.
The following conditions are applied to binary data replicates:
1. A 0 and a 1 produces a "?"
2. A 0 and a 0 produces a "0"
3. A 1 and a 1 produces a "1"
The consolidated output for the above example would thus be:
| | Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 |
|---------------------- |:------------: | :------------: |:------------: |:------------: |:------------: |
| Sample A replicate 1 & 2 | 0 | 0 | 1 | 1 | 1 |
| Sample B replicate 1 & 2 | ? | 1 | 0 | 0 | ? |
---
**Data type 2**
Binary data that has already been consolidated must have grouping information in the second column, as shown in the example below. This should also be in CSV format. This matrix can be used to create an nMDS plot coloured by group.
| | Group | Locus 1 | Locus 2 | Locus 3 | Locus 4 | Locus 5 |
|-----------|:----------: |:------------: | :------------: |:------------: |:------------: |:------------: |
| Sample A | Africa | ? | 0 | 1 | 1 | 1 |
| Sample B | Asia | 0 | 0 | 1 | 1 | ? |
| Sample C | Europe | 1 | ? | 0 | 0 | 0 |
| Sample D | USA | ? | 1 | 0 | ? | 1 |
---
## **FUNCTIONS**
| Function | Details |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| check.data() | Checks for unwanted values (other than 1, 0, and ?). |
| consolidate() | Reads in a binary matrix comprising replicate pairs and consolidates each pair into a consensus read. For each replicate pair at each locus, 1 & 1 -> 1 (shared presence), 0 & 0 -> 0 (shared absence), 0 & 1 -> ? (ambiguity). |
| errors() | Calculates the Jaccard and Euclidean error rates for the dataset. Jaccard's error does not take shared absences of bands as being biologically meaningful. JE = (f10 + f01)/(f10 + f01 + f11) and EE = (f10 + f01)/(f10 + f01 + f11 + f00). At each locus, f01 and f10 indicates a case where a 0 was present in one replicate, and a 1 in the other. f11 indicates the shared presence of a band in both replicates, and f00 indicates a shared absence. |
| group.names() | Returns group names in the uploaded consolidated binary data. This will help in knowing which colours are assigned to which group name. |
| nmds() | Creates an nMDS plot from a consolidated binary matrix with grouping information. Colours of plotted points need to be specified. |
| peak.remove() | Removes samples with a peak number less than a specified value. |
| peaks.consolidated() | Returns total, maximum, and minimum number of peaks in a consolidated binary matrix. |
| peaks.original() | Returns total, maximum, and minimum number of peaks in the binary matrix comprising replicates. |
| scree() | Creates a scree plot for the nMDS. This indicates the optimum number of dimensions to use to minimise the stress value. The stress value is indicated by a red dotted line at 0.15. Values equal to or below this are considered acceptable. |
| shepard() | Creates a Shepard plot for the nMDS. This indicates the 'goodness of fit' of the original distance matrix vs the ordination representation. A high R-squared value is favourable. |
| upgma() | Creates a UPGMA hierarchical clustering tree, with a specified number of bootstrap repetitions. |
## **METHODS**
**Euclidean error = (f01 + f10) / (f01 + f10 + f00 + f11)**
The Euclidean error rate includes the shared absence of a band (f00).
**Jaccard error = (f01 + f10) / (f01 + f10 + f11)**
The Jaccard error rate does not take shared absences of bands into account. The error rate will thus be inflated compared to the Euclidean.
In the formulae above, 'f' refers to the frequency of each combination (i.e. 'f01 + f10' means the sum of all the occurences of a zero and a one, and a one and a zero).
An error value is calculated for each replicate pair, and an average obtained representing the whole dataset.
The error rates for the example samples below would be:
**Sample X rep 1: 0 1 1 0**
**Sample X rep 2: 1 0 1 0**
Euclidean error = (1+1) / (1+1+1+1) = 2/4 = 0.5
Jaccard error = (1+1) / (1+1+1) = 2/3 = 0.67
**Sample Y rep 1: 1 1 1 0**
**Sample Y rep 2: 1 1 0 0**
Euclidean error = (1) / (1+1+2) = 1/4 = 0.25
Jaccard error = (1) / (1+2) = 1/3 = 0.33
---
Average Euclidean error = (0.5+0.25) / 2 = 0.38
Average Jaccard error = (0.67+0.33) / 2 = 0.5
---
## **WORKED EXAMPLE**
There are four sample data sets embedded in the package:
1. BinMatInput_reps
2. BinMatInput_ordination
3. bunias_orientalis
4. nymphaea
The first two are small made-up datasets, and illustrate how the BinMat functions are implemented:
```{r setup, fig.height = 5, fig.width = 5}
library(BinMat)
data1 = BinMatInput_reps
data2 = BinMatInput_ordination
# data1 contains all the replicate pairs that need to be consolidated into a consensus output
# data2 contains a consolidated binary matrix with grouping information in the second column
# Check the data for unwanted values
check.data(data1)
# Get information about peak numbers for all replicates
peaks.original(data1)
# Consolidate the replicate pairs in the matrix
cons = consolidate(data1)
# View the original matrix
data1
# View the consolidated output
cons
# Get the Jaccard and Euclidean error rates
errors(cons)
# Get information about the peak numbers in the consolidated matrix
peaks.consolidated(cons)
# Create a hierarchical clustering tree using the UPGMA method
clustTree = upgma(cons, size = 0.6)
# Find samples with peaks less than a specified threshold value, and return the new, filtered data set
filtered_data1 = peak.remove(cons, thresh = 6)
filtered_data1
# data2 contains an already-consolidated matrix with grouping information. This is used to create a scree, shepard, and an nMDS plot.
# # Find samples with peaks less than a specified threshold value, and return the new, filtered data set
filtered_data2 = peak.remove(data2, thresh = 7)
filtered_data2
# Get the names of the groups specified in the second column
group.names(data2)
# Create an object containing colours for each group
# Colours: Africa = red, Australia = blue, Europe = dark green
clrs = c("red", "blue", "darkgreen")
# Create a scree plot to check how the number of dimensions for an nMDS plot will affect the resulting stress values
scree(data2)
# Create a shepard plot showing the goodness of fit for the original data vs the ordination data
shepard(data2)
# Create an nMDS plot for the data. Default dimension is 2
nmds(data2, colours = clrs, labs = TRUE)
```
## REAL-WORLD WORKED EXAMPLE
The **bunias_orientalis** and **nymphaea** datasets are real-world AFLP and ISSR files, respectively, and have already been consolidated by BinMat. These produce nMDS plots, as shown below:
### *Bunias orientalis*
![](BuniasOrientalis.png)
```{r fig.height = 5, fig.width = 5}
bunias = bunias_orientalis
group.names(bunias)
nmds(bunias, labs = FALSE, include_ellipse = TRUE, legend_pos = "right")
```
### *Nymphaea*
![](nymphaea_lotus.png){width=50%}
```{r fig.height = 7, fig.width = 7}
nymph = nymphaea
group.names(nymph)
colrs = c("dodgerblue", "black", "red", "green3", "orange", "darkblue", "gold2", "darkgreen", "darkred", "grey", "darkgrey", "magenta", "darkorchid", "purple", "brown", "coral3", "turquoise", "deeppink", "lawngreen", "deepskyblue", "tomato")
nmds(nymph, labs = FALSE, include_ellipse = FALSE, colours = colrs, legend_pos = "right", pt_size = 2)
```