Getting started with wrTopDownFrag

Introduction

This package contains tools for the use in TopDown Proteomics. Proteomics is referred to the technique of idenifying all proteins in a given sample. Typically this is done by using mass spectrometry. This technqiue returns primarily ‘m/z’ (mass over charge) measures, in most cases this can be resolved to molecular masses.

Since mass spectrometry is rather well suited to identifying small molecules, it has become common practice to first digest proteins into smaller units (ie peptides), which can be easily identified by mass spetrometry. This technique is referred to as “bottom-up” proteomics. Additional high energy fragmentation of (proteins or) peptides directly within the mass spetrometer also helps very much improving the identification rate (MS-MS or MS2). However, with a large number of peptiodes that can be identified this way it remains challenging to assign correcltly all these peptides to proteins and variants thereof.

To overcome some of the drawbacks associated with the bottom-up approach, more recent developments of mass spectrometers allow the identification of full length proteins (from samples with few proteins). This approach is called “top-down proteomics” (see also Chen et al, 2018, Skinner et al, 2018 or Li et al, 2018). In this context high energy random fragmentation of proteins/peptides within the mass spectrometer plays an important role, too. This approach produces fragments conainting on of the original start/end-sites (terminal fragments) and, depending on the energy settings, furher internal fragments. Of course, larger parent proteins/peptides will give even more complex patterns of internal fragments. The pattern of resulting fragments for a given precursor protein/peptide allows better identification and provides further valuable information about the (3-dimensional) conformation of the initial proteins (see also Haverland et al, 2017).

This project got started to help analyzing internal fragments from FT-ICR mass-spectrometry. At the time of beginning none or only very limited tools were available for this task (this situation has been changing since 2019). Because there is already software available for transforming initial lists of m/z values into monoisotopic values, here the aim was to continue further to the identification of m/z peaks after a deconvolution step to assign the most likely peptide/proptein sequence. Please refer eg to Wikipedia: monoisotopic mass for details on molecular mass and deconvolution. Initial developments were performed based on data from FT-ICR mass-spectrometry, but the overal concept may be applied to any kind of mass-spectrometry data.

When developing this package particular attention was brought to the fact that entire proteins may very well still carry embedded ions in catalytic cores or other (rare) modifications. With the tools pesented here, the link between parent- and children fragments was not further taken into account since we did not expect 100 percent pure parent species entering the fragmentation step.

In summary, this package aims to provide tools for the identification of proteins from monoisotopic m/z lists, and in particular, to consider and identify all possible internal fragments resulting from fragmentation performed during mass-spectromery analysis. Please note that this version will soon be updated to further address issues with identification of predicted fragments.

To get started, we need to load the packages wrMisc and wrProteo, available from CRAN. And of course we need to charge this package.

library(wrMisc)
library(wrProteo)
library(wrTopDownFrag)
#> 
#> Attaching package: 'wrTopDownFrag'
#> The following object is masked from 'package:wrMisc':
#> 
#>     combinatIntTable

Further information about the package versions used for making this vignette can be found in the appendix ‘Session-info’.

For manipulating peptide/protein sequences we will use functions for working with one-letter code amino-acid sequences provided by package wrProteo.

Nomenclature

This describes how chemical modifications on amino-acids (like oxygenation) are abbreviated and what exact chemical modification it referrs to. In term of nomenclature we’ll stick to these abbreviations :

## common settings
str(AAfragSettings())
#> List of 4
#>  $ knownMods        :List of 9
#>   ..$ noMod   : chr ""
#>   ..$ Nterm   : chr [1:3] "a" "b" "c"
#>   ..$ Cterm   : chr [1:3] "x" "y" "z"
#>   ..$ NCterm  : chr [1:2] "d" "i"
#>   ..$ ny      : NULL
#>   ..$ intern  : chr [1:8] "a" "b" "c" "f" ...
#>   ..$ spcAA   : chr [1:9] "p" "h" "k" "o" ...
#>   ..$ spcNterm: chr [1:2] "g" "n"
#>   ..$ spcCterm: chr "u"
#>  $ specAAMod        :List of 9
#>   ..$ p: chr [1:3] "T" "S" "Y"
#>   ..$ h: chr [1:4] "S" "T" "E" "D"
#>   ..$ k: chr [1:4] "Q" "K" "R" "N"
#>   ..$ o: chr "M"
#>   ..$ r: chr "C"
#>   ..$ s: chr "S"
#>   ..$ m: chr [1:2] "R" "K"
#>   ..$ n: chr "Q"
#>   ..$ u: chr "R"
#>  $ modChem          : chr [1:22, 1:2] "" "a" "b" "c" ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:2] "ty" "mod"
#>  $ neutralLossOrGain: chr [1:7] "b" "d" "f" "h" ...

Here we can see that ‘a’,‘b’ and ‘c’-ions are grouped as ‘Nterm’ or that the phosporylation modification ‘p’ is taken into consideration at ‘S’,‘T’ or ‘Y’ amino-acid residues. The section $modChem describes/defines exactely how many molecules will be added or removed with the various modifications available in this package.

Molecular (mono-isotopic) masses of the atomes used here are taken the package wrProteo, intially they were taken from Unimod. They can be easily updated, if in the future, (mono-isotopic) molecular masses will be determined with higher precision (ie more digits).

Obtaining Molecular Mass For Chemical Structures (using wrProteo)

The package wrProteo is used to convert summed chemical formulas into molecular masses. The masse values returned account for non-charged molecules.

# Standard way to obtain the (monoisotopic) mass of water (H20) or a phosphorylation (PO3)
# Note that the number a molecule appears must be written in front of the molecule (no number means one occurance)
massDeFormula(c("2HO", "P3O"))
#>     2H1O     1P3O 
#> 18.01056 78.95851

# Undereith this runs (for H20):
2*.atomicMasses()["H","mono"] +.atomicMasses()["O","mono"]   # H2O
#> [1] 18.01056

Atomic masses can be calulated either as ‘average mass’ or ‘monoisotopic mass’ (Wikipedia: monoisotopic mass), the latter is commonly used in mass-spectrometry and will be used by default in this package.

Molecular Mass Of Peptides And Proteins

At this level we can compute the expected mass of uncharged proteins/peptides as defined by their size. Of couse, protein isomers (ie same total composition but different sequence) get the same mass.

# Let's define two small amino-acid sequences 
protP <- c(pepK="KPEPTIDRPEP", pepC="CEPEPTRT", pepC2="PECEPTRT")

# The sequence converted to mass  
convAASeq2mass(protP)   
#> KPEPTIDRPEP    CEPEPTRT    PECEPTRT 
#>   1277.6616    931.4069    931.4069

As mentioned, this package assumes that experimental values have already been deconvoluted, ie that only the mass of the mono-charged peak provided and isotopic patterns have been reduced to the main representative isotope. In line with this assumption, default predictions are mono-isotopic masses (but can be changed via the argument massTy).

Fragmenting Peptide/Protein -Sequences

With ‘Fragmentation’ techniques we refer to technqiues like shooting electrons or IR-waves allowing to break larger molecules into smaller molecules (of different composition).

In order to check if fragmentation yields random cleavage or raher directed distribution of cleavage sites, it is necessary to predict all possible cleavage sites. The complexity of this simple task increases about exponentially with protein size. In fact, the complexity increases so much, that only a few (longer) full length proteins can be treated at once within a reasonable amount of time. With large proteins (more than 600 aa length) this may consume considerable amounts of RAM ! When designing this package care has been taken to run infractructure intensive tasks as efficent as possible, but working with complex samples/proteomes it is still beyond technical limits.

Below a very simplified view on the theoretical number of terminal and internal fragments is shown. Fixed modifications do not change the number of expected fragments (as they only shift the mass). Note, that this simplification for the plot below does not include variable modifications (see also the next section).

marks <- data.frame(name=c("Ubiquitin\n76aa","Glutamate dehydrogenase 1\n501aa"), length=c(76,501))
layout(matrix(1:2,ncol=2))
plotNTheor(x=20:750, log="", mark=marks)
plotNTheor(x=20:750, log="xy", mark=marks)
mtext("log/log scale", cex=0.8, line=0.1)

Fragmenting a Protein Sequence

Random cleavege for collection of proteins can be obtained using the functions or

Please note, that a number of functions of this package return mass-values for theoretical non-charged peptides ! So when comparing with other ressources (like Prospector), please keep in mind that most functions (eg makeFragments() ) do not directly calculate/predict masses for ‘b-’ or ‘y-’ ions. Using atomicMasses() and/or convAASeq2mass() you can easily adjust initially predicted neutral masses to respective ions. Fragmentation procedures may give different type of ions, depending on equipent and specific settings. Thus, very important input from the user is needed to adjust initial neutral fragments into to expected types of ions.

When predicting actual ions you should also consider that many reactions may have internal loss of water (or other neutral loss reactions). Thus in such instances the mass of H2O should be subtracted, too.

protP <- c(pepK="KPEPTIDRPEP", pepC="CEPEPTRT")
## Basic output 
fragmentSeq(protP[1], minSize=3, internFragments=TRUE, pref="pepK")
#> fragmentSeq : 1 out of 45 fragments not unique
#>  full.pepK.1-11   Nter.pepK.1-3   Nter.pepK.1-4   Nter.pepK.1-5   Nter.pepK.1-6 
#>   "KPEPTIDRPEP"           "KPE"          "KPEP"         "KPEPT"        "KPEPTI" 
#>   Nter.pepK.1-7   Nter.pepK.1-8   Nter.pepK.1-9  Nter.pepK.1-10  Cter.pepK.2-11 
#>       "KPEPTID"      "KPEPTIDR"     "KPEPTIDRP"    "KPEPTIDRPE"    "PEPTIDRPEP" 
#>  Cter.pepK.3-11  Cter.pepK.4-11  Cter.pepK.5-11  Cter.pepK.6-11  Cter.pepK.7-11 
#>     "EPTIDRPEP"      "PTIDRPEP"       "TIDRPEP"        "IDRPEP"         "DRPEP" 
#>  Cter.pepK.8-11  Cter.pepK.9-11 inter.pepK.2-10  inter.pepK.2-4  inter.pepK.2-5 
#>          "RPEP"           "PEP"     "PEPTIDRPE"           "PEP"          "PEPT" 
#>  inter.pepK.2-6  inter.pepK.2-7  inter.pepK.2-8  inter.pepK.2-9 inter.pepK.3-10 
#>         "PEPTI"        "PEPTID"       "PEPTIDR"      "PEPTIDRP"      "EPTIDRPE" 
#> inter.pepK.4-10 inter.pepK.5-10 inter.pepK.6-10 inter.pepK.7-10 inter.pepK.8-10 
#>       "PTIDRPE"        "TIDRPE"         "IDRPE"          "DRPE"           "RPE" 
#>  inter.pepK.3-9  inter.pepK.3-5  inter.pepK.3-6  inter.pepK.3-7  inter.pepK.3-8 
#>       "EPTIDRP"           "EPT"          "EPTI"         "EPTID"        "EPTIDR" 
#>  inter.pepK.4-9  inter.pepK.5-9  inter.pepK.6-9  inter.pepK.7-9  inter.pepK.4-8 
#>        "PTIDRP"         "TIDRP"          "IDRP"           "DRP"         "PTIDR" 
#>  inter.pepK.4-6  inter.pepK.4-7  inter.pepK.5-8  inter.pepK.6-8  inter.pepK.5-7 
#>           "PTI"          "PTID"          "TIDR"           "IDR"           "TID"

As mentioned above, the functions fragmentSeq() or makeFragments() produce neutral peptide-mass predictions (and not at final ions).

## Elaborate output
protP2 <- cbind(na=names(protP), se=protP, ma=wrProteo::convAASeq2mass(protP, seqName=TRUE))
pepT1 <- makeFragments(protTab=protP2, minFragSize=3, maxFragSize=9, internFra=TRUE)

head(pepT1)
#>      no  seq    orig   origNa ty     seqNa                beg end precAA tailAA
#> [1,] "3" "pep"  "pepK" "pepK" "Nter" "1277.661565819.1-3" "1" "3" NA     "K"   
#> [2,] "4" "pep"  "pepC" "pepC" "Nter" "931.4069310368.1-3" "1" "3" NA     "C"   
#> [3,] "2" "pepC" "pepC" "pepC" "full" "931.4069310368.1-4" "1" "4" NA     NA    
#> [4,] "6" "epC"  "pepC" "pepC" "Cter" "931.4069310368.2-4" "2" "4" "p"    NA    
#> [5,] "1" "pepK" "pepK" "pepK" "full" "1277.661565819.1-4" "1" "4" NA     NA    
#> [6,] "5" "epK"  "pepK" "pepK" "Cter" "1277.661565819.2-4" "2" "4" "p"    NA    
#>      ambig          mass               modSpec
#> [1,] "duplSequence" "18.010016106391"  ""     
#> [2,] "duplSequence" "18.010016106391"  ""     
#> [3,] "isoMass"      "121.019200584191" ""     
#> [4,] "isoMass"      "121.019200584191" ""     
#> [5,] "isoMass"      "146.104979124091" ""     
#> [6,] "isoMass"      "146.104979124091" ""

dim(pepT1)
#> [1]  6 13

## The repartition between types of fragments :
table(pepT1[,"ty"])
#> 
#> Cter Nter full 
#>    2    2    2

## Types of ambiguities encountered
table(pepT1[,"ambig"])
#> 
#> duplSequence      isoMass 
#>            2            4

Even such a small example gives already 61 possible peptides (without counting modifications). Of course, it is quite common to obtain a high degree of ambiguities with short peptides since they are less likely to contain unqique sequences.

Example 1 : Single Peptide Fragmentation and Comparison To Prospector

Let’s look at the sequence HLVDEPQNLIK . This sequence matches to Bos taurus Albumin (B0JYQ0, B0JYQ0_BOVIN), aa402 - aa412 from Bos tauris Albumin Accession AAI51547.1 or (Albumin precursor) P02769.4 or NP_851335.1; or 3V03_A (aa378 - 388) .


Alb.seq <- "HLVDEPQNLIK" 
(Alb.mass <- wrProteo::convAASeq2mass("HLVDEPQNLIK", massTy="mono")) 
#> HLVDEPQNLIK 
#>    1304.709
## Now the mass for the MH+ ion
(Alb.MHp <- wrProteo::convAASeq2mass("HLVDEPQNLIK", massTy="mono") + wrProteo::.atomicMasses()["H","mono"] - wrProteo::.atomicMasses()["e","mono"]) 
#> HLVDEPQNLIK 
#>    1305.716

From Prospector we get for the mono-mass : 1305.7161

## delta prospector
Alb.MHp - 1305.7161
#>  HLVDEPQNLIK 
#> 2.681479e-05

We observe a delta mass to ProteinProspector tool of 2.681479e-05 .

Next, we’ll fragment this sequence

AlbFrag <- makeFragments(Alb.seq, minFragSize=2, maxFragSize=17, internFra=FALSE)
head(AlbFrag)
#>      no   seq    orig          origNa        ty     seqNa      beg  end  precAA
#> [1,] "19" "IK"   "HLVDEPQNLIK" "HLVDEPQNLIK" "Cter" "p1.10-11" "10" "11" "L"   
#> [2,] "2"  "HL"   "HLVDEPQNLIK" "HLVDEPQNLIK" "Nter" "p1.1-2"   "1"  "2"  NA    
#> [3,] "3"  "HLV"  "HLVDEPQNLIK" "HLVDEPQNLIK" "Nter" "p1.1-3"   "1"  "3"  NA    
#> [4,] "18" "LIK"  "HLVDEPQNLIK" "HLVDEPQNLIK" "Cter" "p1.9-11"  "9"  "11" "N"   
#> [5,] "4"  "HLVD" "HLVDEPQNLIK" "HLVDEPQNLIK" "Nter" "p1.1-4"   "1"  "4"  NA    
#> [6,] "17" "NLIK" "HLVDEPQNLIK" "HLVDEPQNLIK" "Cter" "p1.8-11"  "8"  "11" "Q"   
#>      tailAA ambig mass               modSpec
#> [1,] NA     NA    "259.189043104491" ""     
#> [2,] "V"    NA    "268.152991949191" ""     
#> [3,] "D"    NA    "367.221405865391" ""     
#> [4,] NA     NA    "372.273107084891" ""     
#> [5,] "E"    NA    "482.248348897391" ""     
#> [6,] NA     NA    "486.316034532091" ""
AlbFrag[which(AlbFrag[,"ty"]=="Nter"), c("seq","mass","beg","ty")]
#>       seq          mass               beg ty    
#>  [1,] "HL"         "268.152991949191" "1" "Nter"
#>  [2,] "HLV"        "367.221405865391" "1" "Nter"
#>  [3,] "HLVD"       "482.248348897391" "1" "Nter"
#>  [4,] "HLVDE"      "611.290941993591" "1" "Nter"
#>  [5,] "HLVDEP"     "708.343705845591" "1" "Nter"
#>  [6,] "HLVDEPQ"    "836.402283356991" "1" "Nter"
#>  [7,] "HLVDEPQN"   "950.445210804191" "1" "Nter"
#>  [8,] "HLVDEPQNL"  "1063.52927478459" "1" "Nter"
#>  [9,] "HLVDEPQNLI" "1176.61333876499" "1" "Nter"

and predict b & y ions (we have to transform neutral peptides into the corresponding ions)

Alb.b <- as.numeric(AlbFrag[which(AlbFrag[,"ty"]=="Nter"),c("mass")]) - 1*wrProteo::.atomicMasses()["O","mono"] - wrProteo::.atomicMasses()["H","mono"] - wrProteo::.atomicMasses()["e","mono"] 
Alb.y <- as.numeric(AlbFrag[which(AlbFrag[,"ty"]=="Cter"),c("mass")]) + wrProteo::.atomicMasses()["H","mono"] - 1*wrProteo::.atomicMasses()["e","mono"] 

Alb.a <- as.numeric(AlbFrag[which(AlbFrag[,"ty"]=="Nter"),c("mass")]) - wrProteo::.atomicMasses()["C","mono"] - 2*wrProteo::.atomicMasses()["O","mono"] - wrProteo::.atomicMasses()["H","mono"] - 1*wrProteo::.atomicMasses()["e","mono"] 
Alb.x <- as.numeric(AlbFrag[which(AlbFrag[,"ty"]=="Cter"),c("mass")]) + 1*wrProteo::.atomicMasses()["C","mono"] + 1*wrProteo::.atomicMasses()["O","mono"] - 1*wrProteo::.atomicMasses()["H","mono"] - 1*wrProteo::.atomicMasses()["e","mono"] 

AlbFrag.int <- makeFragments(Alb.seq, minFragSize=2, maxFragSize=17, internFra=TRUE)
dim(AlbFrag.int)
#> [1] 55 13

Comparison to Prospector

## Masses from Prospector for a,x,b- and y-ions
Alb.prs.a <- c(223.1553, 322.2238,  437.2507,   566.2933,   663.3461,   791.4046,   905.4476,   1018.5316,  1131.6157)
Alb.prs.x <- rev(c(1194.6365,   1081.5524, 982.4840, 867.4571, 738.4145, 641.3617, 513.3031, 399.2602, 286.1761)) #, 173.0921))

Alb.prs.b <- c(251.1503, 350.2187, 465.2456, 594.2882, 691.3410, 819.3995, 933.4425, 1046.5265, 1159.6106)
Alb.prs.y <- rev(c(1168.6572, 1055.5732, 956.5047, 841.4778, 712.4352, 615.3824, 487.3239, 373.2809, 260.1969))

cbind(wr.b=Alb.b, d.b=Alb.b - Alb.prs.b, wr.y=Alb.y, d.y= Alb.y - Alb.prs.y)
#>            wr.b           d.b      wr.y           d.y
#>  [1,]  251.1497 -0.0005962849  260.1963 -0.0005804433
#>  [2,]  350.2181 -0.0005823687  373.2804 -0.0005164629
#>  [3,]  465.2451 -0.0005393367  487.3233 -0.0005890157
#>  [4,]  594.2877 -0.0005462405  615.3819 -0.0005115043
#>  [5,]  691.3404 -0.0005823885  712.4347 -0.0005476523
#>  [6,]  819.3990 -0.0005048771  841.4772 -0.0005545561
#>  [7,]  933.4419 -0.0005774299  956.5042 -0.0005115241
#>  [8,] 1046.5260 -0.0005134495 1055.5726 -0.0005976079
#>  [9,] 1159.6101 -0.0005494691 1168.6567 -0.0005336275

It seems the b & y masses from Prospector do not account for the single positive charge (ie a single electron’s mass may not have been removed, since the delta-mass corresponds approximately to an eletron’s mass) .

Example 2 : A Detailed Toy Example For Predicting Fragments

Let us work on this short protein / peptide as example :

prot1 <- "RTVAAPSVFIFPPSDEQLKSG"

Let’s make this a MH+ ion

(prot1.MHp <- wrProteo::convAASeq2mass(prot1, massTy="mono") + wrProteo::.atomicMasses()["H","mono"] -wrProteo::.atomicMasses()["e","mono"])
#> RTVAAPSVFIFPPSDEQLKSG 
#>              2246.182
 
## Besides, lateron we'll also need the mass of water
(H2O.mass <- 2*wrProteo::.atomicMasses()["H","mono"]  + wrProteo::.atomicMasses()["O","mono"])
#> [1] 18.01056

We observe a delta mass of -4.099151e-05 Da when comparing to ProteinProspector (due to rounding at the last digit of Prospector).

Terminal Fragments

For (further) fragmentation you can get the following table for ourt protein “RTVAAPSVFIFPPSDEQLKSG” from the ProteinProspector tool . Let’s assume only b- and y-ions are produced.

##      b                         y
##     ---    1     R   21     ---
##  258.1561    2   T   20  2090.0804
##  357.2245    3   V   19  1989.0328
##  428.2616    4   A   18  1889.9644  <== use b4 as example
##  499.2987    5   A   17  1818.9272  <=    and corresp y17
##       ...  .    ...        ...

Please remember, when breaking proteins there is also loss of water (part of loss of neutrals). Thus, to obtain the mass of the b-ion ‘RTVA’ using wrTopDown we need to subtract mass.water and add mass.H+ to the peptide ‘RTVA’ (which resumes in subtracting 1 ‘H’, 1 ‘O’ and 1 electron).

(prot1.b4.MHp <- wrProteo::convAASeq2mass("RTVA", massTy="mono") - wrProteo::.atomicMasses()["H","mono"] - wrProteo::.atomicMasses()["O","mono"] - wrProteo::.atomicMasses()["e","mono"]) 
#>     RTVA 
#> 428.2616

Now we have a delta mass to Prospector of -6e-6 Da.

For the correpsonding b-ion :

(prot1.y18.MHp <- wrProteo::convAASeq2mass("APSVFIFPPSDEQLKSG", massTy="mono") + wrProteo::.atomicMasses()["H","mono"] - wrProteo::.atomicMasses()["e","mono"])  
#> APSVFIFPPSDEQLKSG 
#>          1818.927

Above, we have a delta mass to Prospector of 4.1e-5 Da.

Thus, the full ion = b-ion + y-ion - proton, for checking we’ll subtract (y18.MHp + y18.MHp -H20 + proton) from MHp

(prot1.MHp) - (prot1.b4.MHp + prot1.y18.MHp - wrProteo::.atomicMasses()["H","mono"] + wrProteo::.atomicMasses()["e","mono"])     
#> RTVAAPSVFIFPPSDEQLKSG 
#>         -4.547474e-13

We can see the error is marginally small (4e-13).

Next, let’s predict/produce all fragments using the function makeFragments() :

prot1Frag <- makeFragments(prot1, minFragSize=4, maxFragSize=17, internFra=FALSE)
prot1Frag[c(2,27),]            # our prevous b- & y-ion (as neutral peptide mass)
#>      no   seq                 orig              origNa                  ty    
#> [1,] "1"  "RTVA"              "2245.1742825563" "RTVAAPSVFIFPPSDEQLKSG" "Nter"
#> [2,] "15" "APSVFIFPPSDEQLKSG" "2245.1742825563" "RTVAAPSVFIFPPSDEQLKSG" "Cter"
#>      seqNa     beg end  precAA tailAA ambig mass               modSpec
#> [1,] "p1.1-4"  "1" "4"  NA     "."    NA    "445.264333312591" ""     
#> [2,] "p1.5-21" "5" "21" "5"    NA     NA    "1817.91941677019" ""

Since so far we have neutral peptide masses, we’ll convert the first N-ter fragment as b-ion : pept.mass - H20 + H+

as.numeric(prot1Frag[2,"mass"]) - H2O.mass + wrProteo::.atomicMasses()["H","mono"] - wrProteo::.atomicMasses()["e","mono"]  
#> [1] 428.261

Similarly, we’ll convert the corresponding C-ter fragment as y-ion : pept.mass + H+

as.numeric(prot1Frag[27,"mass"]) + wrProteo::.atomicMasses()["H","mono"] - wrProteo::.atomicMasses()["e","mono"]  
#> [1] 1818.927

Internal Fragments

From our initial protein/peptide prot1 (ie RTVAAPSVFIFPPSDEQLKSG) let’s look at the possible fragment “PSDEQL”:

From Prospector we get the following internal fragment(s) for “PSDEQL” (amongst many other internal fragments)

## Prospector internal fragments for "PSDEQL":
## Internal Sequence        b           a       b-NH3       b-H2O
##             ...        ...       ...       ...       ...
##          IFPPSD   657.3243  629.3293       ---  639.3137
##          PSDEQL   670.3042    642.3093  653.2777  652.2937

Now we predict all fragments -including internal fragments- for this protein/peptide.

prot1FragInt <- makeFragments(prot1, minFragSize=4, maxFragSize=17, internFra=TRUE)
dim(prot1FragInt)
#> [1] 161  13

Thus, we get 158 fragments predicted. Please note, that fragments from makeFragments() are given as neutral peptide mass (without loss of neutrals) !

Now we can check which one corresponds to the sequence “PSDEQL” (and store its index). Then, we display a few lines of fragments (with our target peptide at last position):

PSDEQL.idx <- which(prot1FragInt[,2] =="PSDEQL")                                   
prot1FragInt[PSDEQL.idx -(1:0),]
#>      no    seq      orig              origNa                  ty     
#> [1,] "149" "IFPPSD" "2245.1742825563" "RTVAAPSVFIFPPSDEQLKSG" "inter"
#> [2,] "104" "PSDEQL" "2245.1742825563" "RTVAAPSVFIFPPSDEQLKSG" "inter"
#>      seqNa      beg  end  precAA tailAA ambig mass               modSpec
#> [1,] "p1.10-15" "10" "15" "2"    NA     NA    "674.326993148891" ""     
#> [2,] "p1.13-18" "13" "18" "5"    NA     NA    "687.306985988291" ""

If we transform the (neutral) mass for “PSDEQL” as ‘b’-fragment (-H20 +H) we can match the mass of 670.30 given by Prospector (we have a delta of -4.6e-5 Da)

(PSDEQL.MH <- as.numeric(prot1FragInt[PSDEQL.idx,"mass"]) - H2O.mass + wrProteo::.atomicMasses()["H","mono"])
#> [1] 670.3042
(PSDEQL.MHp <- as.numeric(prot1FragInt[PSDEQL.idx,"mass"]) - H2O.mass + wrProteo::.atomicMasses()["H","mono"] - wrProteo::.atomicMasses()["e","mono"])
#> [1] 670.3037

c(Prospector=670.3042) - c(PSDEQL.MH, PSDEQL.MHp)
#> [1] -4.633409e-05  5.022458e-04

Please note, that the values from the table we obtained from Prospector lists in column ‘b’ values that rather to correspond to MH fragments and not to MH+ fragment-ions.

Fixed And Variable Modifications

In real-world biology protein modifcations are common. Conceptually one can distuinguish two cases : With ‘fixed modificatons’ it is presumed that all protein moleculs of a given species (ie sequence) do carry exactely the same modification(s). Alteratively one may suggest that only a portion of the molecules for a given protein-species carry a given modification.

Fixed modifications do not increase the search space, since a given value corresponding to the change of mass gets added or subtracted. In the case of varaible modifications this increases the search space since the modificed as well as the unmodified mass will be considered when comparig to experimental masses. In particular, when multiple amino-acids on the same protein may get modified alternatively this increases the search space. Then one may consider multiple modifications, for example 0 to 2 out of 5 Serine residues in the protein sequence may carry a phosporylation …

In order to reduce the apparent complexity, several conceptual compromises have been taken. ) The presence of charged amino-acids has been used to dismiss all fragments not containing a charged amino-acid, default has been set to positive charge (K,H and R). ) When identifying variable modifications, the non-modified isoform must be identified first to take the variable modification in account.

Basic Identification Including Fixed Modifications

Now we are ready to compare a list of experimental m/z values to a set of protein sequences …

Tolerance

In mass spectrometry it is common to use relative tolerance-limits when it comes to identification as ‘ppm’. This means that peaks not having any predicted peptides/ions in a predefined ppm-range will be omitted. When the search space gets very crowded, ie when many peptide-fragments are predicted, there is of course a considerable risk that some predicted fragments/ions are so close that multiple predicted peptides/ions have to be consiered in a a given ppm-range.

Identification Example

First let’s make a little toy example using a hypothetic protein/peptide sequence. The molecular masses below have been predicted using the Prospector tool from UCSF. We’ll add than a litle bit of random noise as a simultation for experimental data. The table below is not exhaustive for all potentially occurring fragments, as experimental data would not be expected to be complete either. As modifications to test some cases of loss of water (H20) and loss of ammonia (NH3) were prepared.

Please remember, that makeFragments() predicts neutral fragments. Thus, in case of terminal fragments you need to adjust masses accordingly. However, if internal fragments were generated via the same processus (eg assuming only b- and y- ions were generated) on both N- and C- terminus, you don’t need to add corrective factors besides adding an H+ (if you were working in positive mode) to the predicted mass.

# The toy peptide/protein sequnce
protP <- c(pepK="KPEPTIDRPEP", pepC="CEPEPTRT")

obsMass1 <- cbind(a=c(424.2554,525.3031,638.3872,753.4141,909.5152,1006.5680,1135.6106),
  b=c(452.2504,553.2980,666.3821,781.4090,937.5102,1034.5629,1163.6055),
  x=c(524.2463,639.2733,752.3573,853.4050,950.4578,1079.5004,1176.5531),
  y=c(498.2671,613.2940,726.3781,827.4258,924.4785,1053.5211,1150.5739),
  bdH=c(434.2398,535.2875,648.3715,763.3985,919.4996,1016.5524,1145.5949),
  ydH=c(480.2565,1132.5633,595.2835,708.3675,809.4152,906.4680,1035.5106),
  bi=c(498.2307,583.3198,583.3198,611.3148,680.3726,712.3624,712.3624),
  bidH=c(662.3620,694.3519,694.3519,791.4046,791.4046,791.4046,888.4574),
  bidN=c(663.3461,695.3359,695.3359,792.3886,792.3886,792.3886,889.4414),
  ai=c(652.3777,684.3675,684.3675,781.4203,781.4203,781.4203,878.4730) )
rownames(obsMass1) <- c("P","T","I","D","R","P","E")      # only for N-term (a & b)

## This example contains several iso-mass cases
length(obsMass1)
#> [1] 70
length(unique(as.numeric(obsMass1)))
#> [1] 59

Iso-peptides will show up with the same mass :

## have same mass   
##       480.2201   DRPE-H2O & y4-H2O
##       583.3198   PTIDR & TIDRP ;         565.3093    PTIDR-H2O & TIDRP-H2O 
##       809.4152   y7-H2O  & EPTIDRP  
##       684.3675   TIDRPE-CO & EPTIDR-CO ; 694.3519    EPTIDR-H2O & TIDRPE-H2O
##       712.3624   TIDRPE & EPTIDR

# Now we'll add some random noise (to mimick experimental data)
set.seed(2020)
obsMass2 <- as.numeric(obsMass1)
obsMass2 <- obsMass2 + runif(length(obsMass2), min=-2e-4, max=2e-4)

Now let’s use these numbers as experimental values and compare them to all theoretical values : For example, the peptide-sequences ‘PTIDR’ and ‘TIDRP’ may be derived from our first toy-sequence and contain exactely the same atoms and thus will have iso-masses. Without any further information it is impossible to know which one of them is/was truly present in a given sample. Thus, the corresponding mass will be called an ambiguous identification.

identP1 <- identifFixedModif(prot=protP[1], expMass=obsMass2, minFragSize=3, maxFragSize=7, modTy=list(basMod=c("b","y")))     # should find 10term +10inter
#> identifFixedModif : makeFragments : fragmentSeq : 1 out of 35 fragments not unique
#> identifFixedModif : Removing 11 out of 35 (initial) peptides not containing any charge-catching residues
#> identifFixedModif : 48 out of 70 experim masses in range of 24 predicted (max 828)
#> identifFixedModif : findCloseMatch : Note that some names do overlap (beware not to confuse...) !
#> identifFixedModif :  1st pass: compare 24 predicted (incl 7 ambiguous : isoMass
#>      with  48 input (measured) masses, found  18 groups of matches to experimental masses (total=22))
 
## This function returns a list   
str(identP1)  
#> List of 6
#>  $ massMatch  :List of 18
#>   ..$ 2 : Named num 0.00668
#>   .. ..- attr(*, "names")= chr "8"
#>   ..$ 3 : Named num 0.43
#>   .. ..- attr(*, "names")= chr "9"
#>   ..$ 4 : Named num -0.0681
#>   .. ..- attr(*, "names")= chr "10"
#>   ..$ 5 : Named num -0.0768
#>   .. ..- attr(*, "names")= chr "11"
#>   ..$ 9 : Named num 0.285
#>   .. ..- attr(*, "names")= chr "22"
#>   ..$ 19: Named num -0.317
#>   .. ..- attr(*, "names")= chr "43"
#>   ..$ 27: Named num [1:2] -0.2007 -0.0225
#>   .. ..- attr(*, "names")= chr [1:2] "44" "45"
#>   ..$ 30: Named num [1:2] -0.2007 -0.0225
#>   .. ..- attr(*, "names")= chr [1:2] "44" "45"
#>   ..$ 8 : Named num -0.182
#>   .. ..- attr(*, "names")= chr "23"
#>   ..$ 18: Named num -0.126
#>   .. ..- attr(*, "names")= chr "46"
#>   ..$ 26: Named num 0.0181
#>   .. ..- attr(*, "names")= chr "47"
#>   ..$ 7 : Named num -0.271
#>   .. ..- attr(*, "names")= chr "24"
#>   ..$ 17: Named num [1:2] -0.0707 -0.1253
#>   .. ..- attr(*, "names")= chr [1:2] "48" "49"
#>   ..$ 25: Named num [1:2] -0.0707 -0.1253
#>   .. ..- attr(*, "names")= chr [1:2] "48" "49"
#>   ..$ 6 : Named num -0.236
#>   .. ..- attr(*, "names")= chr "25"
#>   ..$ 15: Named num 0.13
#>   .. ..- attr(*, "names")= chr "40"
#>   ..$ 16: Named num 0.13
#>   .. ..- attr(*, "names")= chr "40"
#>   ..$ 21: Named num 0.13
#>   .. ..- attr(*, "names")= chr "40"
#>  $ preMa      : chr [1:24, 1:14] "1" "2" "3" "4" ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:24] "KPEPTIDRPEP.1-3.b" "KPEPTIDRPEP.1-4.b" "KPEPTIDRPEP.1-5.b" "KPEPTIDRPEP.1-6.b" ...
#>   .. ..$ : chr [1:14] "no" "seq" "orig" "origNa" ...
#>  $ pepTab     : chr [1:24, 1:13] "1" "2" "3" "4" ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:13] "no" "seq" "orig" "origNa" ...
#>  $ recalibFact: num 0
#>  $ recalibData: NULL
#>  $ docTi      : Named num [1:7] 1.75e+09 1.75e+09 1.75e+09 1.75e+09 1.75e+09 ...
#>   ..- attr(*, "names")= chr [1:7] "ini_identifFixedModif" "makeFragments" "countPotModifAAs" "addMassModif" ...
## $masMatch identifies each

The matches/identifications in more readable format :

identP1$preMa[ which(identP1$preMa[,"no"] %in% (names(identP1$massMatch))),c("seq","orig","ty","beg","end","precAA","finMass","mod")] 
#>                     seq       orig   ty      beg end  precAA finMass           
#> KPEPTIDRPEP.1-4.b   "KPEP"    "pepK" "Nter"  "1" "4"  NA     "452.250360270091"
#> KPEPTIDRPEP.1-5.b   "KPEPT"   "pepK" "Nter"  "1" "5"  NA     "553.298038744191"
#> KPEPTIDRPEP.1-6.b   "KPEPTI"  "pepK" "Nter"  "1" "6"  NA     "666.382102724591"
#> KPEPTIDRPEP.1-7.b   "KPEPTID" "pepK" "Nter"  "1" "7"  NA     "781.409045756591"
#> KPEPTIDRPEP.8-11.y  "RPEP"    "pepK" "Cter"  "8" "11" ""     "498.267072966791"
#> KPEPTIDRPEP.7-10.by "DRPE"    "pepK" "inter" "7" "10" ""     "498.230687460491"
#> KPEPTIDRPEP.5-9.by  "TIDRP"   "pepK" "inter" "5" "9"  "K"    "583.319836818791"
#> KPEPTIDRPEP.4-8.by  "PTIDR"   "pepK" "inter" "4" "8"  "p"    "583.319836818791"
#> KPEPTIDRPEP.7-11.y  "DRPEP"   "pepK" "Cter"  "7" "11" ""     "613.294015998791"
#> KPEPTIDRPEP.6-10.by "IDRPE"   "pepK" "inter" "6" "10" ""     "611.314751440891"
#> KPEPTIDRPEP.4-9.by  "PTIDRP"  "pepK" "inter" "4" "9"  "p"    "680.372600670791"
#> KPEPTIDRPEP.6-11.y  "IDRPEP"  "pepK" "Cter"  "6" "11" ""     "726.378079979191"
#> KPEPTIDRPEP.5-10.by "TIDRPE"  "pepK" "inter" "5" "10" "K"    "712.362429914991"
#> KPEPTIDRPEP.3-8.by  "EPTIDR"  "pepK" "inter" "3" "8"  "e"    "712.362429914991"
#> KPEPTIDRPEP.5-11.y  "TIDRPEP" "pepK" "Cter"  "5" "11" "K"    "827.425758453291"
#> KPEPTIDRPEP.2-8.by  "PEPTIDR" "pepK" "inter" "2" "8"  "p"    "809.415193766991"
#> KPEPTIDRPEP.4-10.by "PTIDRPE" "pepK" "inter" "4" "10" "p"    "809.415193766991"
#> KPEPTIDRPEP.3-9.by  "EPTIDRP" "pepK" "inter" "3" "9"  "e"    "809.415193766991"
#>                     mod 
#> KPEPTIDRPEP.1-4.b   "b" 
#> KPEPTIDRPEP.1-5.b   "b" 
#> KPEPTIDRPEP.1-6.b   "b" 
#> KPEPTIDRPEP.1-7.b   "b" 
#> KPEPTIDRPEP.8-11.y  "y" 
#> KPEPTIDRPEP.7-10.by "by"
#> KPEPTIDRPEP.5-9.by  "by"
#> KPEPTIDRPEP.4-8.by  "by"
#> KPEPTIDRPEP.7-11.y  "y" 
#> KPEPTIDRPEP.6-10.by "by"
#> KPEPTIDRPEP.4-9.by  "by"
#> KPEPTIDRPEP.6-11.y  "y" 
#> KPEPTIDRPEP.5-10.by "by"
#> KPEPTIDRPEP.3-8.by  "by"
#> KPEPTIDRPEP.5-11.y  "y" 
#> KPEPTIDRPEP.2-8.by  "by"
#> KPEPTIDRPEP.4-10.by "by"
#> KPEPTIDRPEP.3-9.by  "by"

Final Thoughts For Calculating Fragments

As mentioned, several functions (like makeFragments() ) calculate mass for uncharged fragments. The type (and resulting mass) of final ions depends on fragmentation conditions (and energy). In case of terminal ions, you should pay attention to adjusting to whatever type of ion you’re expecting and adjust masses accordingly.

When it comes to internal ions, you may not have to adjust masses besides adding an H+ (when detecting in positive mode and having deconvoluted to MH+), if the respective internal ions were generated via the same fragmention process (eg only b- & y- breaks).

If you really want to consider both b- & y- as well as a- & x- reactions the same time, you’ll have to predict all combinations thereof (since you might have different reaction-families on each end of the peptides) and this will increase the number of fragments to be considered considerably !

Variable modifications can be predicted, but they increase the complexity of final comparisons, thus it is suggested only to consider them if you are realy expecting such modifications. If you are convinced that in your reactions have frequent additional (variable) loss of NH3, you should expand the outcome of makeFragments() accordingly (however quickly increasing the complexity…).

However, due to real-world imprecision during the process of measuring m/z, here we used a 5ppm default tolerance.

This brings us to one of the reasons why fragmentation is so important in proteomics : Without fragmentation mass spectrometry of entire proteins may be easily in trouble to resolve most of naturally occuring iso-variants or even related proteins.

Due to fragmentation numerous overlapping fragments will/may occur and will finally allow reconstructing the most parts of original protein.

Thank you for you interest in this package. This package is still under development, new functions may be added to the next version.