Bacterial colony identification with the Bruker MALDI Biotyper is a high-throughput method with the built-in tools, provided that the selected bacteria belong to the internal database.
Scientific projects where the number of unknown bacteria is expected to be high needs reference-free methods to be able to reduce the redundancy of isolated bacterial colonies, a process called dereplication.
Strejcek et
al. (2018) proposed such a method by processing the spectra and
suggest similarity thresholds between spectra above which spectra, and
therefore the measured bacterial colonies, can be considered identical
at a given taxonomic rank. Their processing procedure is implemented in
the {maldipickr}
package and illustrated in the following vignette.
In addition, we provide functions to enable the dereplication of different batches of Bruker MALDI Biotyper runs and combine the results, in order to be able to delineate the clusters from a common similarity matrix.
More importantly, we provide a function to select a spectra to be picked in each cluster, a process called cherry-picking, depending on external metadata and potential out-groups to be excluded for the current cherry-picking steps.
From the imported raw data from the Bruker MALDI Biotyper, the processing of the spectra is based on the original implementation, and run the following tasks:
The full procedure is illustrated in the example below. While in this
case, all the resulting processed spectra, peaks and final spectra
metadata are stored in-memory, the process_spectra()
function enables storing these files locally for scalable
high-throughput analyses.
# Get an example directory of six Bruker MALDI Biotyper spectra
directory_biotyper_spectra <- system.file(
"toy-species-spectra",
package = "maldipickr"
)
# Import the six spectra
spectra_list <- import_biotyper_spectra(directory_biotyper_spectra)
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- process_spectra(spectra_list)
# Overview of the list architecture that is returned
# with the list of processed spectra, peaks identified and the
# metadata table
str(processed, max.level = 2)
#> List of 3
#> $ spectra :List of 6
#> ..$ species1_G2 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species2_E11:Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species2_E12:Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species3_F7 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species3_F8 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species3_F9 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> $ peaks :List of 6
#> ..$ species1_G2 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species2_E11:Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species2_E12:Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species3_F7 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species3_F8 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species3_F9 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> $ metadata: tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
# A detailed view of the metadata with the median signal-to-noise
# ratio (SNR) and the number of peaks
processed$metadata
#> # A tibble: 6 × 3
#> name SNR peaks
#> <chr> <dbl> <int>
#> 1 species1_G2 5.09 21
#> 2 species2_E11 5.54 22
#> 3 species2_E12 5.63 23
#> 4 species3_F7 4.89 26
#> 5 species3_F8 5.56 25
#> 6 species3_F9 5.40 25
During high-throughput analyses, multiples runs of Bruker MALDI Biotyper are expected resulting in several batches of spectra to be processed and compared. While their processing is natively independent, and could natively be run in parallel, the integration of the batches for their comparison needs an additional step.
The merge_processed_spectra()
function aggregates the processed spectra and bins together the detected
peaks, with a tolerance of 0.002
between the average peak values in the bin (see MALDIquant::binPeaks
),
which translate to a tolerance of 2000 ppm. This binning step results in
a n×p feature matrix (or
intensity matrix), with n rows for
n processed spectra (peak-less
spectra are discarded) and p
columns for the p peaks masses.
By default, as in the Strejeck et al. (2018) procedure, the intensity
values for spectra with missing peaks are interpolated from the
processed spectra signal. The current function enables the analyst to
decide whether to interpolate the values or leave missing peaks as
NA
which would then be converted to an null intensity
value.
# Get an example directory of six Bruker MALDI Biotyper spectra
directory_biotyper_spectra <- system.file(
"toy-species-spectra",
package = "maldipickr"
)
# Import the six spectra
spectra_list <- import_biotyper_spectra(directory_biotyper_spectra)
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- process_spectra(spectra_list)
# Merge the spectra to produce the feature matrix
fm <- merge_processed_spectra(list(processed))
# The feature matrix has 6 spectra as rows and
# 35 peaks as columns
dim(fm)
#> [1] 6 35
# Notice the difference when the interpolation is turned off
fm_no_interpolation <- merge_processed_spectra(
list(processed),
interpolate_missing = FALSE
)
sum(fm == 0) # 0
#> [1] 0
sum(fm_no_interpolation == 0) # 68
#> [1] 68
# Multiple runs can be aggregated using list()
# Merge the spectra to produce the feature matrix
fm_all <- merge_processed_spectra(list(processed, processed, processed))
# The feature matrix has 3×6=18 spectra as rows and
# 35 peaks as columns
dim(fm_all)
#> [1] 18 35
# If using a list, names will be dropped and are not propagated to the matrix.
#' \dontrun{
#' fm_all <- merge_processed_spectra(
#' list("A" = processed, "B" = processed, "C" = processed))
#' any(grepl("A|B|C", rownames(fm_all))) # FALSE
#' }
#'
Once all the batches of spectra have been processed together, we can
use a distance metric to evaluate how close the spectra are to one
another. Strejcek
et al. (2018) recommend the cosine metric to
compare the spectra and they use the fast implementation in the {coop}
package.
While we do not provide specific functions to generate the similarity
matrix, we illustrate below how it can be easily computed. Note that the
feature matrix from merge_processed_spectra()
has spectra as rows and peaks values as columns. So to get a similarity
matrix between spectra, either the feature matrix must be transposed or
a dedicated function must be used.
# A. Compute the similarity matrix on the transposed feature matrix
# using Pearson correlation coefficient
sim_matrix <- stats::cor(t(fm), method = "pearson")
# B.1 Install the coop package
# install.packages("coop")
# B.2 Compute the similarity matrix on the rows of the feature matrix
sim_matrix <- coop::tcosine(fm)
When the similarity matrix is computed between all pairs of the studied spectra, the next step is to delineate clusters of spectra in order to dereplicate the measured bacterial colonies, that is to find which are nearly identical colonies.
The delineate_with_similarity()
is agnostic of the similarity metric used provided that the upper bound
is one and that a numeric threshold relevant to the metric used is
given. We recommend the cosine metric or the Pearson product moment.
Hierarchical clustering will then group spectra in the same cluster only if the similarity between the spectra is above (or equal to) the provided threshold. The default and recommended method is the complete linkage, also known as the farthest neighbor, to ensure that the within-group minimum similarity of each cluster respects the threshold.
Finally, a table summarizes for each spectra, to which cluster number it was assigned to and the size of the cluster, which is the total number of spectra in the cluster.
# Toy similarity matrix between the six example spectra of
# three species. The cosine metric is used and a value of
# zero indicates dissimilar spectra and a value of one
# indicates identical spectra.
cosine_similarity <- matrix(
c(
1, 0.79, 0.77, 0.99, 0.98, 0.98,
0.79, 1, 0.98, 0.79, 0.8, 0.8,
0.77, 0.98, 1, 0.77, 0.77, 0.77,
0.99, 0.79, 0.77, 1, 1, 0.99,
0.98, 0.8, 0.77, 1, 1, 1,
0.98, 0.8, 0.77, 0.99, 1, 1
),
nrow = 6,
dimnames = list(
c(
"species1_G2", "species2_E11", "species2_E12",
"species3_F7", "species3_F8", "species3_F9"
),
c(
"species1_G2", "species2_E11", "species2_E12",
"species3_F7", "species3_F8", "species3_F9"
)
)
)
# Delineate clusters based on a 0.92 threshold applied
# to the similarity matrix
delineate_with_similarity(cosine_similarity, threshold = 0.92)
#> # A tibble: 6 × 3
#> name membership cluster_size
#> <chr> <int> <int>
#> 1 species1_G2 1 4
#> 2 species2_E11 2 2
#> 3 species2_E12 2 2
#> 4 species3_F7 1 4
#> 5 species3_F8 1 4
#> 6 species3_F9 1 4
Once the table of clusters is generated from the similarity matrix, a reference spectrum can be assigned to each cluster.
We choose to define high-quality spectra as representative spectra of the clusters using internal information. That is, representative spectra have, within their cluster, the highest median signal-to-noise ratio and then the highest number of detected peaks.
The function set_reference_spectra()
does not change the order of the cluster table but merely adds an
additional column is_reference
to indicate whether the
corresponding spectrum is representative of the cluster.
# Get an example directory of six Bruker MALDI Biotyper spectra
# Import the six spectra and
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- system.file(
"toy-species-spectra",
package = "maldipickr"
) %>%
import_biotyper_spectra() %>%
process_spectra()
# Toy similarity matrix between the six example spectra of
# three species. The cosine metric is used and a value of
# zero indicates dissimilar spectra and a value of one
# indicates identical spectra.
cosine_similarity <- matrix(
c(
1, 0.79, 0.77, 0.99, 0.98, 0.98,
0.79, 1, 0.98, 0.79, 0.8, 0.8,
0.77, 0.98, 1, 0.77, 0.77, 0.77,
0.99, 0.79, 0.77, 1, 1, 0.99,
0.98, 0.8, 0.77, 1, 1, 1,
0.98, 0.8, 0.77, 0.99, 1, 1
),
nrow = 6,
dimnames = list(
c(
"species1_G2", "species2_E11", "species2_E12",
"species3_F7", "species3_F8", "species3_F9"
),
c(
"species1_G2", "species2_E11", "species2_E12",
"species3_F7", "species3_F8", "species3_F9"
)
)
)
# Delineate clusters based on a 0.92 threshold applied
# to the similarity matrix
clusters <- delineate_with_similarity(
cosine_similarity,
threshold = 0.92
)
# Set reference spectra with the toy example
set_reference_spectra(clusters, processed$metadata)
#> # A tibble: 6 × 6
#> name membership cluster_size SNR peaks is_reference
#> <chr> <int> <int> <dbl> <int> <lgl>
#> 1 species1_G2 1 4 5.09 21 FALSE
#> 2 species2_E11 2 2 5.54 22 FALSE
#> 3 species2_E12 2 2 5.63 23 TRUE
#> 4 species3_F7 1 4 4.89 26 FALSE
#> 5 species3_F8 1 4 5.56 25 TRUE
#> 6 species3_F9 1 4 5.40 25 FALSE
An alternative to the similarity matrix approach from the previous
section is to rely on the taxonomic identification of the spectra to
delineate clusters. To do so, we must use the Bruker MALDI Biotyper
report from the Compass software that summarize the identification of
the microorganisms using its internal database. Once the report or
reports are imported (in R using read_biotyper_report()
),
the function delineate_with_identification()
will group spectra based on their identifications.
report_unknown <- read_biotyper_report(
system.file("biotyper_unknown.csv", package = "maldipickr")
)
delineate_with_identification(report_unknown)
#> Generating clusters from single report
#> # A tibble: 4 × 3
#> name membership cluster_size
#> <chr> <int> <int>
#> 1 unknown_isolate_1 2 1
#> 2 unknown_isolate_2 3 1
#> 3 unknown_isolate_3 1 2
#> 4 unknown_isolate_4 1 2
Clusters generated from taxonomic identifications can not use the
function set_reference_spectra()
as the latter relies on peaks information that is not disclosed in the
Biotyper report.
Therefore, users interested in cherry-picking spectra using taxonomic
identifications should use the pick_spectra()
function described below with the combination of the input and output
tables of the delineate_with_identification()
function to pick for instance spectra with the highest log score (using
criteria_column = "bruker_log"
).
Raw spectra can also be processed and clustered by another approach,
named SPeDE
,
developed by Dumolin et al. (2019). The resulting dereplication step
produces a comma separated table. The example below illustrates how to
import this table into R to be consistent with the dereplication table
generated within the {maldipickr}
package.
# Reformat the output from SPeDE table
# https://github.com/LM-UGent/SPeDE
import_spede_clusters(
system.file("spede.csv", package = "maldipickr")
)
#> # A tibble: 6 × 5
#> name membership cluster_size quality is_reference
#> <chr> <dbl> <int> <chr> <lgl>
#> 1 species1_G2 1 1 GREEN TRUE
#> 2 species2_E11 2 2 ORANGE FALSE
#> 3 species2_E12 2 2 GREEN TRUE
#> 4 species3_F7 3 1 GREEN TRUE
#> 5 species3_F8 4 2 ORANGE FALSE
#> 6 species3_F9 4 2 GREEN TRUE
When isolating bacteria from an environment, experimenters want to be
thorough but also work-, time- and cost-savvy. One approach is to reduce
the redundancy of the bacterial isolates by analyzing their MALDI-TOF
spectra from the Bruker Biotyper. All the steps previously described in
this vignette consisted of processing the spectra to be able to pick
only non-redundant spectra, using the pick_spectra()
function.
The function, as illustrated in the examples below, can pick spectra using different types of inputs:
delineate_with_similarity()
or import_spede_clusters()
functions; see example 1)Spectra, and clusters, can also be excluded from the cherry-picking
decision, a procedure termed masking here. We distinguish two
types of mask that are implemented in the pick_spectra()
function:
Advanced users can also provide directly a cluster table with a custom sort by cluster to accommodate complex design.
Ultimately, the function delivers a table with as many rows as the
cluster table with an additional logical column named
to_pick
to indicate whether the colony associated with the
spectra should be picked (TRUE
) or not picked
(FALSE
).
# 0. Load a toy example of a tibble of clusters created by
# the `delineate_with_similarity` function.
clusters <- readRDS(
system.file("clusters_tibble.RDS",
package = "maldipickr"
)
)
# 1. By default and if no other metadata are provided,
# the function picks reference spectra for each clusters.
#
# N.B: The spectra `name` and `to_pick` columns are moved to the left
# only for clarity using the `relocate()` function.
#
pick_spectra(clusters) %>%
dplyr::relocate(name, to_pick) # only for clarity
#> # A tibble: 6 × 7
#> name to_pick membership cluster_size SNR peaks is_reference
#> <chr> <lgl> <int> <int> <dbl> <dbl> <lgl>
#> 1 species1_G2 FALSE 1 4 5.09 21 FALSE
#> 2 species2_E11 FALSE 2 2 5.54 22 FALSE
#> 3 species2_E12 TRUE 2 2 5.63 23 TRUE
#> 4 species3_F7 FALSE 1 4 4.89 26 FALSE
#> 5 species3_F8 TRUE 1 4 5.56 25 TRUE
#> 6 species3_F9 FALSE 1 4 5.40 25 FALSE
# 2.1 Simulate OD600 values with uniform distribution
# for each of the colonies we measured with
# the Bruker MALDI Biotyper
set.seed(104)
metadata <- dplyr::transmute(
clusters,
name = name, OD600 = runif(n = nrow(clusters))
)
metadata
#> # A tibble: 6 × 2
#> name OD600
#> <chr> <dbl>
#> 1 species1_G2 0.364
#> 2 species2_E11 0.772
#> 3 species2_E12 0.735
#> 4 species3_F7 0.973
#> 5 species3_F8 0.740
#> 6 species3_F9 0.201
# 2.2 Pick the spectra based on the highest
# OD600 value per cluster
pick_spectra(clusters, metadata, "OD600") %>%
dplyr::relocate(name, to_pick) # only for clarity
#> # A tibble: 6 × 8
#> name to_pick membership cluster_size SNR peaks is_reference OD600
#> <chr> <lgl> <int> <int> <dbl> <dbl> <lgl> <dbl>
#> 1 species1_G2 FALSE 1 4 5.09 21 FALSE 0.364
#> 2 species2_E11 TRUE 2 2 5.54 22 FALSE 0.772
#> 3 species2_E12 FALSE 2 2 5.63 23 TRUE 0.735
#> 4 species3_F7 TRUE 1 4 4.89 26 FALSE 0.973
#> 5 species3_F8 FALSE 1 4 5.56 25 TRUE 0.740
#> 6 species3_F9 FALSE 1 4 5.40 25 FALSE 0.201
# 3.1 Say that the wells on the right side of the plate are
# used for negative controls and should not be picked.
metadata <- metadata %>% dplyr::mutate(
well = gsub(".*[A-Z]([0-9]{1,2}$)", "\\1", name) %>%
strtoi(),
is_edge = is_well_on_edge(
well_number = well, plate_layout = 96, edges = "right"
)
)
# 3.2 Pick the spectra after discarding (or soft masking)
# the spectra indicated by the `is_edge` column.
pick_spectra(clusters, metadata, "OD600",
soft_mask_column = "is_edge"
) %>%
dplyr::relocate(name, to_pick) # only for clarity
#> # A tibble: 6 × 10
#> name to_pick membership cluster_size SNR peaks is_reference OD600 well
#> <chr> <lgl> <int> <int> <dbl> <dbl> <lgl> <dbl> <int>
#> 1 species1… FALSE 1 4 5.09 21 FALSE 0.364 2
#> 2 species2… TRUE 2 2 5.54 22 FALSE 0.772 11
#> 3 species2… FALSE 2 2 5.63 23 TRUE 0.735 12
#> 4 species3… TRUE 1 4 4.89 26 FALSE 0.973 7
#> 5 species3… FALSE 1 4 5.56 25 TRUE 0.740 8
#> 6 species3… FALSE 1 4 5.40 25 FALSE 0.201 9
#> # ℹ 1 more variable: is_edge <lgl>
# 4.1 Say that some spectra were picked before
# (e.g., in the column F) in a previous experiment.
# We do not want to pick clusters with those spectra
# included to limit redundancy.
metadata <- metadata %>% dplyr::mutate(
picked_before = grepl("_F", name)
)
# 4.2 Pick the spectra from clusters without spectra
# labeled as `picked_before` (hard masking).
pick_spectra(clusters, metadata, "OD600",
hard_mask_column = "picked_before"
) %>%
dplyr::relocate(name, to_pick) # only for clarity
#> # A tibble: 6 × 11
#> name to_pick membership cluster_size SNR peaks is_reference OD600 well
#> <chr> <lgl> <int> <int> <dbl> <dbl> <lgl> <dbl> <int>
#> 1 species1… FALSE 1 4 5.09 21 FALSE 0.364 2
#> 2 species2… TRUE 2 2 5.54 22 FALSE 0.772 11
#> 3 species2… FALSE 2 2 5.63 23 TRUE 0.735 12
#> 4 species3… FALSE 1 4 4.89 26 FALSE 0.973 7
#> 5 species3… FALSE 1 4 5.56 25 TRUE 0.740 8
#> 6 species3… FALSE 1 4 5.40 25 FALSE 0.201 9
#> # ℹ 2 more variables: is_edge <lgl>, picked_before <lgl>