scicomptools Vignette

Overview

The scicomptools package is built to house self-contained tools built by the NCEAS Scientific Computing Team. These functions are written based on the needs of various synthesis working groups so each function operates more or less independently of the others. We hope these tools will be valuable to users in and outside of the context for which they were designed!

This vignette describes some of the main functions of scicomptools using the examples.

library(scicomptools)

Extract Summary Statistics Tables

Our most requested task–by a wide margin–is to help streamline an analytical workflow by stripping the summary statistics (e.g., test statistic, P value, degrees of freedom, etc.) from whichever statistical test the group has decided to use. As we’ve developed extraction workflows for various tests we have centralized them in the stat_extract function.

Currently, stat_extract accepts model fit objects from lmerTest::lmer, stats::lm, stats::nls, stats::t.test, nlme::lme, ecodist::MRM, and RRPP::trajectory.analysis. All model fits should be passed to the mod_fit argument. If the model is a trajectory analysis, the traj_angle argument must be supplied as well (either “deg” or “rad” for whether the multivariate angle should be expressed in degrees or radians). Once the models are fit, they can be passed to stat_extract for extraction of typically-desired summary values.

Note: stat_extract determines the model type automatically so this need not be specified.

# Fit models of several accepted types
## Student's t-Test
mod_t <- t.test(x = 1:10, y = 7:20)

## Nonlinear Least Squares
mod_nls <- fm1DNase1 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), data = DNase)

# Extract the relevant information
## t-Test
scicomptools::stat_extract(mod_fit = mod_t)
#>   Estimate       DF  T_Value      P_Value
#> 1      5.5 21.98221 -5.43493 1.855282e-05
#> 2     13.5 21.98221 -5.43493 1.855282e-05

## NLS
scicomptools::stat_extract(mod_fit = mod_nls)
#>   Term Estimate  Std_Error  T_Value      P_Value
#> 1 Asym 2.485319 0.06287043 39.53081 1.519040e-88
#> 2 xmid 1.518117 0.06396583 23.73325 2.704648e-56
#> 3 scal 1.098307 0.02442013 44.97549 2.203937e-97

As more tests are used by working groups we plan on adding them to the set of model objects that stat_extract supports. Feel free to post a GitHub issue if you have a model type in mind that you’d like to request we expand stat_extract to support.

Create Google Drive Table of Contents

Google Drive is by far the most common platform that the working groups we support leverage to store data. The googledrive package makes integrating script inputs and outputs with specified Drive folders easy but the internal folder structure of these Google Drives often becomes convoluted as projects evolve and priorities shift.

We wrote the drive_toc function to identify all folders in a given Drive folder (either the top-level link or any sub-folder link will work) and create a diagram of this hierarchy to make it simpler for group’s to visualize their Drive’s “table of contents”.

This function has a url argument which requires a Drive ID (i.e., a link passed through googledrive::as_id). There is also an ignore_names argument that allows users to specify folder names that they would like excluded from the table of contents; this is useful if many sub-folders contain a “deprecated” or “archive” folder as this would greatly clutter the full table of contents to include repeated folders like these. Finally, this process can take a few seconds for complicated Drives so a message is returned by default for which folder the function is identifying sub-folders. The quiet argument defaults to FALSE so that progress is reported but it can be set to TRUE if you desire no message be returned.

scicomptools::drive_toc(url = googledrive::as_id("https://drive.google.com/drive/u/0/folders/folder-specific-characters"),
                        ignore_names = c("Archive", "Deprecated", "Backups"),
                        quiet = FALSE)

Read in Multiple Excel Sheets

Working groups often store their data in Microsoft Excel files with multiple sheets (often one sheet per data type or per collection location). We wrote read_xl_sheets to import every sheet in a user-specified MS Excel file and store each sheet in a list. If every sheet has the same columns, you can then easily unlist them into a flat dataframe (rather than the somewhat laborious process of reading in each sheet separately).

read_xl_sheets only has a single argument (file_name) which accepts the name of (and path to) the MS Excel file to read.

# Read in sheets
sheet_list <- scicomptools::read_xl_sheets(file_name = system.file("extdata", "faux_data.xlsx", 
                                                     package = "scicomptools"))

# Show structure
dplyr::glimpse(sheet_list)
#> List of 2
#>  $ my_data: tibble [5 × 5] (S3: tbl_df/tbl/data.frame)
#>   ..$ col_1: chr [1:5] "a" "b" "c" "d" ...
#>   ..$ col_2: chr [1:5] "A" "A" "A" "B" ...
#>   ..$ col_3: num [1:5] 2 43 2 2 2
#>   ..$ col_4: num [1:5] 234 435 32 45 4
#>   ..$ col_5: num [1:5] 236 478 34 47 6
#>  $ data_2 : tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ col_2      : chr [1:2] "A" "B"
#>   ..$ temperature: num [1:2] 10 20

We also wrote a companion function named read_xl_format where the specific formatting of all cells in the sheets of an Excel file is extracted. This is useful if fill color or text formatting is used to denote information that is not redundant with columns (i.e., information that is lost when reading an Excel sheet into an API). read_xl_format uses the same syntax as read_xl_sheets to maximize interoperability.

# Read in *format of* sheets
form_list <- scicomptools::read_xl_format(file_name = system.file("extdata", "faux_data.xlsx", 
                                                     package = "scicomptools"))

# Show structure of that
dplyr::glimpse(form_list)
#> Rows: 36
#> Columns: 13
#> $ sheet         <chr> "my_data", "my_data", "my_data", "my_data", "my_data", "…
#> $ address       <chr> "A1", "B1", "C1", "D1", "E1", "A2", "B2", "C2", "D2", "E…
#> $ row           <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4,…
#> $ col           <int> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4,…
#> $ cell_contents <chr> "col_1", "col_2", "col_3", "col_4", "col_5", "a", "A", "…
#> $ comment       <chr> NA, NA, NA, NA, "Nick Lyon:Hello world", NA, NA, NA, NA,…
#> $ formula       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "SUM(C2:D2)", NA, NA…
#> $ bold          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ italic        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ underline     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ font_size     <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
#> $ font_color    <chr> "FF000000", "FF000000", "FF000000", "FF000000", "FF00000…
#> $ cell_color    <chr> NA, NA, NA, NA, NA, NA, NA, "FFFFFFFF", NA, NA, NA, NA, …

Handling Two Working Directories

Many of the working groups that we support work both on local computers as well as in remote servers. We leverage Git and GitHub to pass code updates back and forth between the two environments but updating the working directory every time such a pivot is made is cumbersome.

We have thus written wd_loc for this purpose! wd_loc allows users to specify two working directory paths and toggle between them using a logical argument.

The local argument accepts a logical value. If local = TRUE, then the file path specified in the local_path argument is returned as a character string. If local = FALSE then the file path specified in the remote_path argument is used. local_path defaults to base::getwd() so it need not be specified if you are using RStudio Projects.

You can run wd_loc inside of base::setwd if desired though we recommend assigning the file path to a “path” object and invoking that whenever import or export must be done.

scicomptools::wd_loc(local = TRUE,
                     local_path = getwd(),
                     remote_path = file.path("path on server"))
#> /Users/.../scicomptools/vignettes

Checking Tokens

Some operations require passing a user’s personal access token (a.k.a. “PAT”) to RStudio. These workflows can fail unexpectedly if the token is improperly set or expires so it is valuable to double check whether the token is still embedded.

The token_check function checks whether a token is attached and messages whether or not one is found. Currently this function supports checks for Qualtrics and GitHub tokens but this can be expanded if need be (post a GitHub issue if you have another token in mind!). Additionally, if secret = TRUE (the default) the success message simply identifies whether a token is found. If secret = FALSE the success message prints the token string in the console.

scicomptools::token_check(api = "github", secret = TRUE)