Introduction to SimEngine

Overview

SimEngine is an open-source R package for structuring, maintaining, running, and debugging statistical simulations on both local and cluster-based computing environments.

Getting started

The goal of many statistical simulations is to compare the behavior of two or more statistical methods; we use this framework to demonstrate the SimEngine workflow. Most statistical simulations of this type include three basic phases: (1) generate data, (2) run one or more methods using the generated data, and (3) compare the performance of the methods.

To briefly illustrate how these phases are implemented using , we use a simple example of estimating the rate parameter \(\lambda\) of a \(\text{Poisson}(\lambda)\) distribution. To anchor the simulation in a real-world situation, one can imagine that a sample of size \(n\) from this Poisson distribution models the number of patients admitted daily to a hospital over the course of \(n\) consecutive days. Suppose that the data consist of \(n\) independent and identically distributed observations \(X_1, X_2, \ldots, X_n\) drawn from a Poisson(\(\lambda\)) distribution. Since the \(\lambda\) parameter of the Poisson distribution is equal to both the mean and the variance, one may ask whether the sample mean (denoted \(\hat{\lambda}_{M,n}\)) or the sample variance (denoted \(\hat{\lambda}_{V,n}\)) is a better estimator of \(\lambda\).

1) Load the package and create a simulation object

After loading the package, the first step is to create a simulation object (an R object of class sim_obj) using the new_sim() function. The simulation object contains all data, functions, and results related to the simulation.

library(SimEngine)
#> Loading required package: magrittr
#> Welcome to SimEngine! Full package documentation can be found at:
#>  https://avi-kenny.github.io/SimEngine
sim <- new_sim()

2) Code a function to generate data

Many simulations involve a function that creates a dataset designed to mimic a real-world data-generating mechanism. Here, we write and test a simple function to generate a sample of n observations from a Poisson distribution with \(\lambda = 20\).

create_data <- function(n) {
  return(rpois(n=n, lambda=20))
}

create_data(n=10)
#>  [1] 22 16 21 21 15 17 20 24 17 23

3) Code the methods (or other functions)

With SimEngine, any functions declared (or loaded via source()) are automatically stored in the simulation object when the simulation runs. In this example, we test the sample mean and sample variance estimators of the \(\lambda\) parameter. For simplicity, we write this as a single function and use the type argument to specify which estimator to use.

est_lambda <- function(dat, type) {
  if (type=="M") { return(mean(dat)) }
  if (type=="V") { return(var(dat)) }
}
dat <- create_data(n=1000)
est_lambda(dat=dat, type="M")
#> [1] 19.976
est_lambda(dat=dat, type="V")
#> [1] 20.63206

4) Set the simulation levels

Often, we wish to run the same simulation multiple times. We refer to each run as a simulation replicate. We may wish to vary certain features of the simulation between replicates. In this example, perhaps we choose to vary the sample size and the estimator used to estimate \(\lambda\). We refer to the features that vary as simulation levels; in the example below, the simulation levels are the sample size (n) and the estimator (estimator). We refer to the values that each simulation level can take on as level values; in the example below, the n level values are 10, 100, and 1000, and the estimator level values are "M" (for “sample mean”) and "V" (for “sample variance”). By default, SimEngine runs one simulation replicate for each combination of level values — in this case, six combinations — although the user will typically want to increase this; 1,000 or 10,000 replicates per combination is typical.

sim %<>% set_levels(
  estimator = c("M", "V"),
  n = c(10, 100, 1000)
)

Note that we make extensive use of the pipe operators (%>% and %<>%) from the magrittr package; if you have never used pipes, see the magrittr documentation.

5) Create a simulation script

The simulation script is a user-written function that assembles the pieces above (generating data, analyzing the data, and returning results) to code the flow of a single simulation replicate. Within a script, the current simulation level values can be referenced using the special variable L. For instance, in the running example, when the first simulation replicate is running, L$estimator will equal "M" and L$n will equal 10. In the next replicate, L$estimator will equal "M" and L$n will equal 100, and so on, until all level value combinations are run. The simulation script will automatically have access to any functions or objects that have been declared in the global environment.

sim %<>% set_script(function() {
  dat <- create_data(n=L$n)
  lambda_hat <- est_lambda(dat=dat, type=L$estimator)
  return (list("lambda_hat"=lambda_hat))
})

The simulation script should always return a list containing one or more key-value pairs, where the keys are syntactically valid names. The values may be simple data types (numbers, character strings, or boolean values) or more complex data types (lists, dataframes, model objects, etc.); see the Advanced Usage documentation for how to handle complex data types. Note that in this example, the estimators could have been coded instead as two different functions and then called from within the script using the use_method() function.

6) Set the simulation configuration

The set_config() function controls options related to the entire simulation, such as the number of simulation replicates to run for each level value combination and the parallelization type, if desired (see the Parallelization documentation). Packages needed for the simulation should be specified using the packages argument of set_config() (rather than using library() or require()). We set num_sim to 100, and so SimEngine will run a total of 600 simulation replicates (100 for each of the six level value combinations).

sim %<>% set_config(
  num_sim = 100,
  packages = c("ggplot2", "stringr")
)
#> 
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:SimEngine':
#> 
#>     vars

7) Run the simulation

All 600 replicates are run at once and results are stored in the simulation object.

sim %<>% run()
#> Done. No errors or warnings detected.

8) View and summarize results

Once the simulation replicates have finished running, the summarize() function can be used to calculate common summary statistics, such as bias, variance, mean squared error (MSE), and confidence interval coverage.

sim %>% summarize(
  list(stat="bias", name="bias_lambda", estimate="lambda_hat", truth=20),
  list(stat="mse", name="mse_lambda", estimate="lambda_hat", truth=20)
)
#>   level_id estimator    n n_reps bias_lambda  mse_lambda
#> 1        1         M   10    100   0.1510000  1.94630000
#> 2        2         V   10    100  -0.4021111 74.12680617
#> 3        3         M  100    100   0.1160000  0.17006800
#> 4        4         V  100    100  -0.1113414  9.69723645
#> 5        5         M 1000    100   0.0160700  0.01579209
#> 6        6         V 1000    100   0.1373756  0.85837283

In this example, we see that the MSE of the sample variance is much higher than that of the sample mean and that MSE decreases with increasing sample size for both estimators, as expected. From the n_reps column, we see that 100 replicates were successfully run for each level value combination. Results for individual simulation replicates can also be directly accessed via the sim$results dataframe.

head(sim$results)
#>   sim_uid level_id rep_id estimator  n      runtime lambda_hat
#> 1       1        1      1         M 10 0.0003800392       20.1
#> 2       7        1      2         M 10 0.0001640320       18.3
#> 3       8        1      3         M 10 0.0001699924       20.5
#> 4       9        1      4         M 10 0.0001580715       21.4
#> 5      10        1      5         M 10 0.0001571178       18.6
#> 6      11        1      6         M 10 0.0001580715       19.5

Above, the sim_uid uniquely identifies a single simulation replicate and the level_id uniquely identifies a level value combination. The rep_id is unique within a given level value combination and identifies the index of that replicate within the level value combination. The runtime column shows the runtime of each replicate (in seconds).

9) Update a simulation

After running a simulation, a user may want to update it by adding additional level values or replicates; this can be done with the update_sim() function. Prior to running update_sim(), the functions set_levels() and/or set_config() are used to declare the updates that should be performed. For example, the following code sets the total number of replicates to 200 (i.e., adding 100 replicates to those that have already been run) for each level value combination, and adds one additional level value for n.

sim %<>% set_config(num_sim = 200)
sim %<>% set_levels(
  estimator = c("M", "V"),
  n = c(10, 100, 1000, 10000)
)

After the levels and/or configuration are updated, update_sim() is called.

sim %<>% update_sim()
#> Done. No errors or warnings detected.

Another call to summarize() shows that the additional replicates were successfully:

sim %>% summarize(
  list(stat="bias", name="bias_lambda", estimate="lambda_hat", truth=20),
  list(stat="mse", name="mse_lambda", estimate="lambda_hat", truth=20)
)
#>   level_id estimator     n n_reps  bias_lambda   mse_lambda
#> 1        1         M    10    200  0.163000000  2.147800000
#> 2        2         V    10    200  0.146555556 77.451554321
#> 3        3         M   100    200  0.040450000  0.190544500
#> 4        4         V   100    200 -0.003796970  9.689442951
#> 5        5         M  1000    200  0.012795000  0.016100085
#> 6        6         V  1000    200  0.083129349  0.728864992
#> 7        7         M 10000    200  0.004467500  0.002133542
#> 8        8         V 10000    200 -0.007833964  0.078401278