Welcome to the world of glycan analysis! If you’ve ever tried to work with glycans computationally, you know the struggle: these tree-like molecules are notoriously difficult to represent and analyze compared to their linear cousins like proteins or DNA. That’s where glyrepr comes to the rescue.
Think of glyrepr as your glycan translator — it teaches your computer how to “speak glycan” fluently. Whether you’re dealing with compositions (what’s in the glycan) or structures (how it’s connected), this package has got you covered.
library(glyrepr)Before we dive in, let’s establish our vocabulary. Don’t worry — it’s simpler than it sounds!
| Term | What It Means | Example |
|---|---|---|
| Composition | The “ingredients list” — how many of each sugar | Hex(5)HexNAc(2) |
| Structure | The “blueprint” — how sugars are connected | Man(a1-3)Man(b1-4)GlcNAc |
| Monosaccharide | A single sugar unit (the building blocks) | Gal, Man, Hex |
| Linkage | The “glue” between sugars | a1-3, b1-4 |
| Substitution | Chemical decorations on sugars | 6Ac, 3Me |
🔍 Pro tip: We distinguish between generic sugars (like mystery boxes labeled “Hex”) and concrete sugars (like specific boxes labeled “Galactose”).
Let’s start with something straightforward: glycan compositions. Think of these as ingredient lists for your favorite recipes.
There are three ways to create compositions, each with its own superpower:
Method 1: The Direct Approach
# Just tell R what you have
glycan_composition(c(Hex = 5, HexNAc = 2), c(Gal = 1, GalNAc = 1))
#> <glycan_composition[2]>
#> [1] Hex(5)HexNAc(2)
#> [2] Gal(1)GalNAc(1)Method 2: The Programmatic Way
# Perfect when you're processing data from files or databases
comp_list <- list(c(Hex = 5, HexNAc = 2), c(Gal = 1, GalNAc = 1))
as_glycan_composition(comp_list)
#> <glycan_composition[2]>
#> [1] Hex(5)HexNAc(2)
#> [2] Gal(1)GalNAc(1)Method 3: The Parser
# Copy-paste from your mass spec software? No problem!
as_glycan_composition(c("Hex(5)HexNAc(2)", "Gal(1)GalNAc(1)"))
#> <glycan_composition[2]>
#> [1] Hex(5)HexNAc(2)
#> [2] Gal(1)GalNAc(1)Here’s something cool: when you run these examples in your R console, you’ll see the concrete monosaccharides (like Gal and GalNAc) displayed in beautiful colors! These follow the SNFG standard — the universal “color code” for glycans. Think of it as the glycan rainbow 🌈.
count_mono()Now here’s where glyrepr shows its intelligence:
comp <- glycan_composition(
c(Hex = 5, HexNAc = 2), # generic sugars
c(Gal = 1, Man = 1, GalNAc = 1) # concrete sugars
)
# How many galactose residues?
count_mono(comp, "Gal")
#> [1] NA 1
# How many hexose residues? (This includes Gal and Man!)
count_mono(comp, "Hex")
#> [1] 5 2Notice how count_mono() is smart enough to know that galactose and mannose are both hexoses? That’s the power of understanding glycan hierarchies!
Compositions are nice, but structures are where glyrepr truly shines. This is like going from knowing the ingredients to understanding the actual recipe and cooking method.
Let’s work with some real glycan structures. These strings below are called the “IUPAC-condensed” glycan text representations. They might look cryptic, but they’re actually quite readable once you get the hang of it. To learn about them, check out this article.
iupacs <- c(
"Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # The famous N-glycan core
"Gal(b1-3)GalNAc(a1-", # O-glycan core 1
"Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-", # O-glycan core 2
"Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-", # A branched mannose tree
"GlcNAc6Ac(b1-4)Glc3Me(a1-" # With some decorations
)
struc <- as_glycan_structure(iupacs)
struc
#> <glycan_structure[5]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(a1-
#> [3] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [4] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [5] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> # Unique structures: 5Here’s where glyrepr gets really clever. Notice that “# Unique structures: 5” message? This isn’t just informational — it’s the key to lightning-fast performance.
Let’s see this optimization in action:
# Create a big dataset with lots of repetition
large_struc <- rep(struc, 1000) # 5,000 structures total
large_struc
#> <glycan_structure[5000]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(a1-
#> [3] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [4] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [5] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> [6] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [7] Gal(b1-3)GalNAc(a1-
#> [8] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [9] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [10] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> ... (4990 more not shown)
#> # Unique structures: 5Still showing “# Unique structures: 5”! This means glyrepr is storing only 5 unique graphs internally, not 5,000. This is like having a smart library system that stores only one copy of each book, no matter how many people want to read it.
Let’s put this to the test:
library(tictoc)
tic("Converting 5 structures")
result_small <- convert_to_generic(struc)
toc()
#> Converting 5 structures: 0.012 sec elapsed
tic("Converting 5,000 structures")
result_large <- convert_to_generic(large_struc)
toc()
#> Converting 5,000 structures: 0.014 sec elapsedMind = blown! 🤯 The performance is nearly identical because glyrepr only processes each unique structure once, then cleverly expands the results.
glyrepr comes with several handy tools for structure manipulation:
Strip away the connections:
remove_linkages(struc)
#> <glycan_structure[5]>
#> [1] Man(??-?)[Man(??-?)]Man(??-?)GlcNAc(??-?)GlcNAc(??-
#> [2] Gal(??-?)GalNAc(??-
#> [3] Gal(??-?)[GlcNAc(??-?)]GalNAc(??-
#> [4] Man(??-?)[Man(??-?)]Man(??-?)[Man(??-?)]Man(??-
#> [5] GlcNAc6Ac(??-?)Glc3Me(??-
#> # Unique structures: 5Remove the decorations:
# Let's look at our decorated structure first
struc[5]
#> <glycan_structure[1]>
#> [1] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> # Unique structures: 1
# Now remove the decorations (6Ac and 3Me)
remove_substituents(struc[5])
#> <glycan_structure[1]>
#> [1] GlcNAc(b1-4)Glc(a1-
#> # Unique structures: 1Ever wondered what’s actually in those complex structures? Easy:
comp <- as_glycan_composition(struc)
comp
#> <glycan_composition[5]>
#> [1] Man(3)GlcNAc(2)
#> [2] Gal(1)GalNAc(1)
#> [3] Gal(1)GlcNAc(1)GalNAc(1)
#> [4] Man(5)
#> [5] Glc(1)GlcNAc(1)Me(1)Ac(1)Need to export your data or use it elsewhere?
# Get the original string representations
as.character(struc)
#> [1] "Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-"
#> [2] "Gal(b1-3)GalNAc(a1-"
#> [3] "Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-"
#> [4] "Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-"
#> [5] "GlcNAc6Ac(b1-4)Glc3Me(a1-"
as.character(comp)
#> [1] "Man(3)GlcNAc(2)" "Gal(1)GalNAc(1)"
#> [3] "Gal(1)GlcNAc(1)GalNAc(1)" "Man(5)"
#> [5] "Glc(1)GlcNAc(1)Me(1)Ac(1)"glyrepr objects are first-class citizens in the tidyverse:
suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(dplyr))
df <- tibble(
id = seq_along(struc),
structures = struc,
names = c("N-glycan core", "Core 1", "Core 2", "Branched Man", "Decorated")
)
df %>%
mutate(n_man = count_mono(structures, "Man")) %>%
filter(n_man > 1)
#> # A tibble: 2 × 4
#> id structures names n_man
#> <int> <struct> <chr> <int>
#> 1 1 Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1- N-glycan core 3
#> 2 4 Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1- Branched Man 5Congratulations! You’ve just learned the fundamentals of glycan representation in R. Here’s what you can explore next:
glymotif package for finding patterns in glycan structuresThe glycoverse is your oyster! 🦪
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 tibble_3.3.0 tictoc_1.2.1 purrr_1.1.0 glyrepr_0.7.4
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.5 cli_3.6.5 knitr_1.48 rlang_1.1.6
#> [5] xfun_0.46 highr_0.11 stringi_1.8.7 generics_0.1.4
#> [9] jsonlite_1.8.8 glue_1.8.0 backports_1.5.0 htmltools_0.5.8.1
#> [13] rstackdeque_1.1.1 sass_0.4.9 rmarkdown_2.27 evaluate_1.0.3
#> [17] jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
#> [21] stringr_1.5.2 compiler_4.4.1 igraph_2.1.4 pkgconfig_2.0.3
#> [25] digest_0.6.37 R6_2.6.1 utf8_1.2.6 tidyselect_1.2.1
#> [29] pillar_1.11.0 magrittr_2.0.4 bslib_0.8.0 checkmate_2.3.3
#> [33] tools_4.4.1 cachem_1.1.0