DeSciDe Vignette

DeSciDe (Deciphering Scientific Discoveries)

DeSciDe is a package designed to help streamline omics data analysis. Many methods of data analysis exist for the generation and characterization of gene lists, however, selection of genes for further investigation is still heavily influenced by prior knowledge, with practitioners often studying well characterized genes, reinforcing bias in the literature. This package aims to aid in the identification of both well-studied, high-confidence hits as well as novel hits that may be overlooked due to lack of prior literature precedence.

This package takes a curated list of genes from a user’s omics dataset and a list of cellular stimuli or cellular contexts pertaining to the experiment at hand. The list of genes is searched in the STRING database, and informative metrics are calculated and used to rank the gene list by network connectivity. Then the genes list and terms list are searched for co-occurrence of each gene and search term combination to identify the literature precedence of each gene in the context of the search terms provided. The PubMed results are then used to rank the genes list by number of publications associated with the search terms.

The two ranks for each gene are then plotted on a scatter plot to visualize the relationship of the genes’ literature precedence and network connectivity. This visual aid can be used to identify highly connected, well-studied genes that serve as high confidence hits clustered around the origin and highly connected, low precedence genes that serve as novel hits clustered in the top left of the graph. The highly connected, low precedence genes are known to interact with the other genes in the list, but have not been studied in the same experimental context, providing novel targets to pursue for follow up studies.

Additional graphical outputs are generated to visualize the STRING network and PubMed results. The package has also been set up to allow the user to change various steps of the analysis to fit their needs, such as searching STRING for all connections or just physical, adjusting the threshold of classification of genes, exporting figures and data tables for use in publications. Below, we highlight how to implement DeSciDe to study an example data set.

Example Usage of DeSciDe

For the following examples we will use a list of 40 genes and 3 search terms. We will call these lists “genes” and “terms”. Here we import the lists from CSV files, however the user can choose to manually create these lists or import how they see best fit.

# Import genes list and terms list from CSV
genes <- read.csv("genes.csv", header = FALSE)[[1]]
terms <- read.csv("terms.csv", header = FALSE)[[1]]
#> Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
#> incomplete final line found by readTableHeader on 'terms.csv'

We now have a list of our genes and terms to execute in our code:

genes
#>  [1] "BRD4"      "CHD4"      "SUZ12"     "CDC6"      "CHD3"      "ORC1"     
#>  [7] "FUS"       "MCM4"      "AURKB"     "MRPS14"    "RPL26L1"   "RPS13"    
#> [13] "RPS23"     "BRD2"      "CHD5"      "KDM2A"     "SUV420H1"  "SUV420H2" 
#> [19] "KAT7"      "SETD8"     "MSH3"      "ERCC3"     "MARCKS"    "CDCA2"    
#> [25] "CDCA5"     "CDCA7L"    "ATAD2B"    "NUDT21"    "KIF22"     "RPS25"    
#> [31] "VDAC3"     "CPSF7"     "DIEXF"     "POLR1E"    "XIRP2"     "ZFP91"    
#> [37] "ARHGAP11A" "CCDC137"   "KLHL15"    "OR10G3"
terms
#> [1] "Acidic Patch" "Chromatin"    "Nucleosome"

We can now run DeSciDe on this list in the most simple form. This will produce our figures and our data table of results.

results <- descide(genes_list = genes, terms_list = terms)

Results Expected from DeSciDe

Table
#> # A tibble: 6 × 14
#>   Gene     `Acidic Patch` Chromatin Nucleosome Total PubMed_Rank Degree
#>   <chr>             <int>     <int>      <int> <dbl>       <int>  <dbl>
#> 1 BRD4                  2       718         29   749           1     14
#> 2 SUV420H1              2        39          8    49           2      4
#> 3 CHD4                  1       202        131   334           3      8
#> 4 KDM2A                 1        49          5    55           4      6
#> 5 SUZ12                 0       237         13   250           5      5
#> 6 CDC6                  0       191          6   197           6     10
#> # ℹ 7 more variables: Clustering_Coefficient_Percent <dbl>,
#> #   Clustering_Coefficient_Fraction <chr>, Connected_Component_id <dbl>,
#> #   Nodes_in_Connected_Component <dbl>,
#> #   total_number_of_connected_components <dbl>, Connectivity_Rank <int>,
#> #   Category <chr>
Heatmap

STRING Network

Clustering

Connectivity vs. Precedence

All Functions Available for DeSciDe

Each step of DeSciDe can be run individually if you wish to do so. Below we briefly list each function and all of their arguments. To see more information, you can use the help function in R studio to see the R documentation for each function (i.e. ?descide or ?plot_connectivity_precedence)

Function to run entire pipeline

descide(
  genes_list,
  terms_list,
  rank_method = "weighted",
  species = 9606,
  network_type = "full",
  score_threshold = 400,
  threshold_percentage = 20,
  export = FALSE,
  file_directory = NULL,
  export_format = "csv"
)

Function to plot heatmap

plot_heatmap(pubmed_search_results, file_directory = NULL, export = FALSE)

Function to search STRING

search_string_db(
  genes_list,
  species = 9606,
  network_type = "full",
  score_threshold = 400
)

Function to plot STRING network

plot_string_network(
  string_db,
  string_ids,
  file_directory = NULL,
  export = FALSE
)

Function to plot STRING clustering metrics

plot_clustering(string_results, file_directory = NULL, export = FALSE)

Function to create summary file and classify genes based on ranks.

combine_summary(
  pubmed_search_results,
  string_results,
  file_directory = NULL,
  export_format = "csv",
  export = FALSE,
  threshold_percentage = 20
)

Function to plot connectivity vs. precedence.

plot_connectivity_precedence(
  combined_summary,
  file_directory = NULL,
  export = FALSE
)