03 Caching SpaDES simulations

Eliot J. B. McIntire

November 09 2023

As part of a reproducible work flow, caching of various function calls are a critical component. Down the road, it is likely that an entire work flow from raw data to publication, decision support, report writing, presentation building etc., could be built and be reproducible anywhere, on demand. The reproducible::Cache function is built to work with any R function. However, it becomes very powerful in a SpaDES context because we can build large, powerful applications that are transparent and tied to the raw data that may be many conceptual steps upstream in the workflow. To do this, we have built several customizations within the SpaDES package. Important to this is dealing correctly with the simList, which is an object that has slot that is an environment. But more important are the various tools that can be used at higher levels, i.e., not just for “standard” functions.

1 Caching as part of SpaDES

Some of the details of the simList-specific features of this Cache function include:

In a SpaDES context, there are several levels of caching that can be used as part of a reproducible workflow. Each level can be used to a modeller’s advantage; and, all can be – and are often – used concurrently.

1.1 At the spades level

And entire call to spades can be cached. This will have the effect of eliminating any stochasticity in the model as the output will simply be the cached version of the simList. This is likely most useful in situations where reproducibility is more important than “new” stochasticity (e.g., building decision support systems, apps, final version of a manuscript).

library(terra)
library(reproducible)
library(SpaDES.core)

mySim <- simInit(
  times = list(start = 0.0, end = 3.0),
  params = list(
    .globals = list(stackName = "landscape", burnStats = "testStats"),
    randomLandscapes = list(.plotInitialTime = NA),
    fireSpread = list(.plotInitialTime = NA)
  ),
  modules = list("randomLandscapes", "fireSpread"),
  paths = list(modulePath = getSampleModules(tempdir()))
)

This functionality can be achieved within a spades call.

# compare caching ... run once to create cache
system.time({
  outSim <- spades(Copy(mySim), cache = TRUE, notOlderThan = Sys.time())
})
## Nov09 07:43:41 simInit Using setDTthreads(1). To change: 'options(spades.DTthreads = X)'.
## Nov09 07:43:41 chckpn:init total elpsd: 0.056 secs | 0 checkpoint init 0
## Nov09 07:43:41 save  :init total elpsd: 0.058 secs | 0 save init 0
## Nov09 07:43:41 prgrss:init total elpsd: 0.059 secs | 0 progress init 0
## Nov09 07:43:41 load  :init total elpsd: 0.06 secs | 0 load init 0
## Nov09 07:43:41 rndmLn:init total elpsd: 0.061 secs | 0 randomLandscapes init
## Nov09 07:43:41 rndmLn:init New objects created:
## Nov09 07:43:41 rndmLn:init 1:            landscape
## Nov09 07:43:41 frSprd:init total elpsd: 0.12 secs | 0 fireSpread init 1
## Nov09 07:43:41 frSprd:init New objects created:
## Nov09 07:43:41 frSprd:init 1:            testStats
## Nov09 07:43:41 frSprd:burn total elpsd: 0.14 secs | 1 fireSpread burn 5
## Nov09 07:43:41 frSprd:stats total elpsd: 0.16 secs | 1 fireSpread stats 5
## Nov09 07:43:41 frSprd:burn total elpsd: 0.16 secs | 2 fireSpread burn 5
## Nov09 07:43:41 frSprd:stats total elpsd: 0.18 secs | 2 fireSpread stats 5
## Nov09 07:43:41 frSprd:burn total elpsd: 0.18 secs | 3 fireSpread burn 5
## Nov09 07:43:41 frSprd:stats total elpsd: 0.2 secs | 3 fireSpread stats 5
## simList saved in
## SpaDES.core:::.pkgEnv$.sim
## It will be deleted at next spades() call.
## Saving large object (fn: spades, cacheId: 16831f12a867a5ab) to Cache: 83.6 Mb
##  Done!
##    user  system elapsed 
##   1.680   0.069   1.765

Note that if there were any visualizations (here we turned them off with .plotInitialTime = NA above) they will happen the first time through, but not the cached times.

# faster 2nd time
system.time({
  outSimCached <- spades(Copy(mySim), cache = TRUE)
})
##   ...(Object to retrieve (fn: spades, 16831f12a867a5ab.rds))
##      loaded cached result from previous spades call
##    user  system elapsed 
##   0.267   0.000   0.276
all.equal(outSim, outSimCached) 
##  [1] "Names: 3 string mismatches"                                       
##  [2] "Length mismatch: comparison on first 4 components"                
##  [3] "Component 2: Modes: numeric, NULL"                                
##  [4] "Component 2: Lengths: 4, 0"                                       
##  [5] "Component 2: target is numeric, current is NULL"                  
##  [6] "Component 3: target is NULL, current is PackedSpatRaster"         
##  [7] "Component 4: Modes: S4, numeric"                                  
##  [8] "Component 4: Lengths: 1, 3"                                       
##  [9] "Component 4: Attributes: < Modes: list, NULL >"                   
## [10] "Component 4: Attributes: < Lengths: 4, 0 >"                       
## [11] "Component 4: Attributes: < names for target but not for current >"
## [12] "Component 4: Attributes: < current is not list-like >"

1.2 Module-level caching

If the parameter .useCache in the module’s metadata is set to TRUE, then every event in the module will be cached. That means that every time that module is called from within a spades() call, Cache will be called. Only the objects inside the simList that correspond to the inputObjects or the outputObjects from the module metadata will be assessed for caching. For general use, module-level caching would be mostly useful for modules that have no stochasticity, such as data-preparation modules, GIS modules etc.

In this example, we will use the cache on the randomLandscapes module. This means that each subsequent call to spades will result in identical outputs from the randomLandscapes module (only!). This would be useful when only one random landscape is needed simply for trying something out, or putting into production code (e.g., publication, decision support, etc.).

# Module-level
params(mySim)$randomLandscapes$.useCache <- TRUE
system.time({
  randomSim <- spades(Copy(mySim), .plotInitialTime = NA,
                      notOlderThan = Sys.time(), debug = TRUE)
})
## Nov09 07:43:43 simInit Using setDTthreads(1). To change: 'options(spades.DTthreads = X)'.
## Nov09 07:43:43 chckpn:init eventTime moduleName eventType eventPriority
## Nov09 07:43:43 chckpn:init 0         checkpoint init      0
## Nov09 07:43:43 save  :init 0         save       init      0
## Nov09 07:43:43 prgrss:init 0         progress   init      0
## Nov09 07:43:43 load  :init 0         load       init      0
## Nov09 07:43:43 rndmLn:init 0         randomLandscapes init      1
## Nov09 07:43:44 rndmLn:init Saving large object (fn: doEvent.randomLandscapes, cacheId: 0bd85a65c211b611) to Cache: 83.5 Mb
##  Done!
## Nov09 07:43:45 rndmLn:init New objects created:
## Nov09 07:43:45 rndmLn:init 1:            landscape
## Nov09 07:43:45 frSprd:init 0         fireSpread       init      1
## Nov09 07:43:45 frSprd:init New objects created:
## Nov09 07:43:45 frSprd:init 1:            testStats
## Nov09 07:43:45 frSprd:burn 1         fireSpread       burn      5
## Nov09 07:43:45 frSprd:stats 1         fireSpread       stats     5
## Nov09 07:43:45 frSprd:burn 2         fireSpread       burn      5
## Nov09 07:43:45 frSprd:stats 2         fireSpread       stats     5
## Nov09 07:43:45 frSprd:burn 3         fireSpread       burn      5
## Nov09 07:43:45 frSprd:stats 3         fireSpread       stats     5
## simList saved in
## SpaDES.core:::.pkgEnv$.sim
## It will be deleted at next spades() call.
##    user  system elapsed 
##   1.580   0.029   1.617
# faster the second time
system.time({
  randomSimCached <- spades(Copy(mySim), .plotInitialTime = NA, debug = TRUE)
})
## Nov09 07:43:45 simInit Using setDTthreads(1). To change: 'options(spades.DTthreads = X)'.
## Nov09 07:43:45 chckpn:init eventTime moduleName eventType eventPriority
## Nov09 07:43:45 chckpn:init 0         checkpoint init      0
## Nov09 07:43:45 save  :init 0         save       init      0
## Nov09 07:43:45 prgrss:init 0         progress   init      0
## Nov09 07:43:45 load  :init 0         load       init      0
## Nov09 07:43:45 rndmLn:init 0         randomLandscapes init      1
## Nov09 07:43:45 rndmLn:init   ...(Object to retrieve (fn: doEvent.randomLandscapes, 0bd85a65c211b611.rds))
## Nov09 07:43:45 rndmLn:init      loaded cached copy of randomLandscapes module
## Nov09 07:43:45 rndmLn:init         (and added a memoised copy)
## Nov09 07:43:45 rndmLn:init New objects created:
## Nov09 07:43:45 rndmLn:init 1:            landscape
## Nov09 07:43:45 frSprd:init 0         fireSpread       init      1
## Nov09 07:43:45 frSprd:init New objects created:
## Nov09 07:43:45 frSprd:init 1:            testStats
## Nov09 07:43:45 frSprd:burn 1         fireSpread       burn      5
## Nov09 07:43:45 frSprd:stats 1         fireSpread       stats     5
## Nov09 07:43:45 frSprd:burn 2         fireSpread       burn      5
## Nov09 07:43:45 frSprd:stats 2         fireSpread       stats     5
## Nov09 07:43:45 frSprd:burn 3         fireSpread       burn      5
## Nov09 07:43:46 frSprd:stats 3         fireSpread       stats     5
## simList saved in
## SpaDES.core:::.pkgEnv$.sim
## It will be deleted at next spades() call.
##    user  system elapsed 
##   0.411   0.000   0.418

Test that only layers produced in randomLandscapes are identical, not fireSpread.

layers <- list("DEM", "forestAge", "habitatQuality", "percentPine", "Fires")
same <- lapply(layers, function(l)
  identical(randomSim$landscape[[l]], randomSimCached$landscape[[l]]))
names(same) <- layers
print(same) # Fires is not same because all non-init events in fireSpread are not cached
## $DEM
## [1] FALSE
## 
## $forestAge
## [1] FALSE
## 
## $habitatQuality
## [1] FALSE
## 
## $percentPine
## [1] FALSE
## 
## $Fires
## [1] FALSE

1.3 Event-level caching

If the parameter .useCache in the module’s metadata is set to a character or character vector, then that or those event(s), identified by their name, will be cached. That means that every time the event is called from within a spades call, Cache will be called. Only the objects inside the simList that correspond to the inputObjects or the outputObjects as defined in the module metadata will be assessed for caching inputs or outputs, respectively. The fact that all and only the named inputObjects and outputObjects are cached and returned may be inefficient (i.e., it may cache more objects than are necessary) for individual events.

Similar to module-level caching, event-level caching would be mostly useful for events that have no stochasticity, such as data-preparation events, GIS events etc. Here, we don’t change the module-level caching for randomLandscapes, but we add to it a cache for only the “init” event for fireSpread.

params(mySim)$fireSpread$.useCache <- "init"
system.time({
  randomSim <- spades(Copy(mySim), .plotInitialTime = NA,
                      notOlderThan = Sys.time(), debug = TRUE)
})
## Nov09 07:43:46 simInit Using setDTthreads(1). To change: 'options(spades.DTthreads = X)'.
## Nov09 07:43:46 chckpn:init eventTime moduleName eventType eventPriority
## Nov09 07:43:46 chckpn:init 0         checkpoint init      0
## Nov09 07:43:46 save  :init 0         save       init      0
## Nov09 07:43:46 prgrss:init 0         progress   init      0
## Nov09 07:43:46 load  :init 0         load       init      0
## Nov09 07:43:46 rndmLn:init 0         randomLandscapes init      1
## Nov09 07:43:47 rndmLn:init Saving large object (fn: doEvent.randomLandscapes, cacheId: 0bd85a65c211b611) to Cache: 83.5 Mb
##  Done!
## Nov09 07:43:47 rndmLn:init New objects created:
## Nov09 07:43:47 rndmLn:init 1:            landscape
## Nov09 07:43:47 frSprd:init 0         fireSpread       init      1
## Nov09 07:43:48 frSprd:init Saving large object (fn: doEvent.fireSpread, cacheId: 2964a28c18867ac4) to Cache: 83.6 Mb
##  Done!
## Nov09 07:43:49 frSprd:init New objects created:
## Nov09 07:43:49 frSprd:init 1:            testStats
## Nov09 07:43:49 frSprd:burn 1         fireSpread       burn      5
## Nov09 07:43:49 frSprd:stats 1         fireSpread       stats     5
## Nov09 07:43:49 frSprd:burn 2         fireSpread       burn      5
## Nov09 07:43:49 frSprd:stats 2         fireSpread       stats     5
## Nov09 07:43:49 frSprd:burn 3         fireSpread       burn      5
## Nov09 07:43:49 frSprd:stats 3         fireSpread       stats     5
## simList saved in
## SpaDES.core:::.pkgEnv$.sim
## It will be deleted at next spades() call.
##    user  system elapsed 
##   2.980   0.029   3.030
# faster the second time
system.time({
  randomSimCached <- spades(Copy(mySim), .plotInitialTime = NA, debug = TRUE)
})
## Nov09 07:43:49 simInit Using setDTthreads(1). To change: 'options(spades.DTthreads = X)'.
## Nov09 07:43:49 chckpn:init eventTime moduleName eventType eventPriority
## Nov09 07:43:49 chckpn:init 0         checkpoint init      0
## Nov09 07:43:49 save  :init 0         save       init      0
## Nov09 07:43:49 prgrss:init 0         progress   init      0
## Nov09 07:43:49 load  :init 0         load       init      0
## Nov09 07:43:49 rndmLn:init 0         randomLandscapes init      1
## Nov09 07:43:49 rndmLn:init   ...(Object to retrieve (fn: doEvent.randomLandscapes, 0bd85a65c211b611.rds))
## Nov09 07:43:49 rndmLn:init      loaded cached copy of randomLandscapes module
## Nov09 07:43:49 rndmLn:init         (and added a memoised copy)
## Nov09 07:43:49 rndmLn:init New objects created:
## Nov09 07:43:49 rndmLn:init 1:            landscape
## Nov09 07:43:49 frSprd:init 0         fireSpread       init      1
## Nov09 07:43:49 frSprd:init   ...(Object to retrieve (fn: doEvent.fireSpread, 2964a28c18867ac4.rds))
## Nov09 07:43:50 frSprd:init      loaded cached copy of init event in fireSpread module.
## Nov09 07:43:50 frSprd:init New objects created:
## Nov09 07:43:50 frSprd:init 1:            testStats
## Nov09 07:43:50 frSprd:burn 1         fireSpread       burn      5
## Nov09 07:43:50 frSprd:stats 1         fireSpread       stats     5
## Nov09 07:43:50 frSprd:burn 2         fireSpread       burn      5
## Nov09 07:43:50 frSprd:stats 2         fireSpread       stats     5
## Nov09 07:43:50 frSprd:burn 3         fireSpread       burn      5
## Nov09 07:43:50 frSprd:stats 3         fireSpread       stats     5
## simList saved in
## SpaDES.core:::.pkgEnv$.sim
## It will be deleted at next spades() call.
##    user  system elapsed 
##   0.607   0.000   0.619

1.4 Function-level caching

Any function can be cached using: Cache(FUN = functionName, ...).

This will be a slight change to a function call, such as: projectRaster(raster, crs = crs(newRaster)) to Cache(projectRaster, raster, crs = crs(newRaster)).

ras <- terra::rast(terra::ext(0, 1e3, 0, 1e3), res = 1, vals = 1)
system.time({
  map <- Cache(SpaDES.tools::neutralLandscapeMap(ras),
               cachePath = cachePath(mySim),
               userTags = "neutralLandscapeMap",
               notOlderThan = Sys.time())
})
## Warning: In (SpaDES.tools::neutralLandscapeMap(ras))(): nlm_mpd changes the
## dimensions of the RasterLayer if even ncols/nrows are choosen.
##    user  system elapsed 
##   0.442   0.039   0.489
# faster the second time
system.time({
  mapCached <- Cache(SpaDES.tools::neutralLandscapeMap(ras),
                     cachePath = cachePath(mySim),
                     userTags = "neutralLandscapeMap")
})
##   ...(Object to retrieve (fn: SpaDES.tools::neutralLandscapeMap, 94d035af43fc613d.rds))
##      loaded cached result from previous SpaDES.tools::neutralLandscapeMap call
##    user  system elapsed 
##   0.063   0.010   0.081
all.equal(map[], mapCached[]) # note --> can't use all.equal on SpatRaster -- they are pointers 
## [1] TRUE

1.5 Working with the Cache manually

Since the cache is simply a DBI database table, all DBI functions will work as is. In addition, there are several helpers in the reproducible package, including showCache, keepCache and clearCache, and the more advanced createCache, loadFromCache, rmFromCache, and saveToCache that may be useful. Also, one can access cached items manually (rather than simply rerunning the same Cache function again).

cacheDB <- showCache(mySim, userTags = "neutralLandscapeMap")
## Cache size:
##   Total (including Rasters): 1.9 Mb
##   Selected objects (not including Rasters): 1.9 Mb
## get the RasterLayer that was produced with neutralLandscapeMap()
map <- loadFromCache(cacheId = cacheDB$cacheId, cachePath = cachePath(mySim))
##      loaded cached result from previous  call
clearPlot()
Plot(map)

2 Reproducible Workflow

In general, we feel that a liberal use of Cache will make a reusable and reproducible work flow. shiny apps can be made, taking advantage of Cache. Indeed, much of the difficulty in managing data sets and saving them for future use, can be accommodated by caching.

2.1 Nested Caching

simInit() --> many .inputObjects calls

spades() call --> many module calls --> many event calls --> many function calls

Lets say we start to introduce caching to this structure. We start from the “inner” most functions that we could imaging Caching would be useful. Lets say there are some GIS operations, like raster::projectRaster, which operates on an input shapefile. We can Cache the projectRaster call to make this much faster, since it will always be the same result for a given input raster.

If we look back at our structure above, we see that we still have LOTS of places that are not Cached. That means that the spades() call will still spawn many module calls, and many event calls, just to get to the one Cache(projectRaster) call which is cached. This function will likely be called many times. This is good, but Cache does take some time. So, even if Cache(projectRaster) takes only 0.02 seconds, calling it hundreds of times means maybe 4 seconds. If we are doing this for many functions, then this will be too slow for some purposes.

We can start putting Cache all up the sequence of calls. Unfortunately, the way we use Cache at each of these levels is a bit different, so we need a slightly different approach for each.

2.1.0.1 Cache the spades call

spades(cache = TRUE)

This will cache the spades call, causing stochasticity/randomness to be frozen.

2.1.0.2 Cache a whole module

Pass .useCache = TRUE as a parameter to the module, during the simInit

Some modules are inherently non-random, such as GIS modules, or parameter fitting statistical modules. We expect these to be identical results each time, so we can safely cache the entire module.

parameters = list(
  FireModule = list(.useCache = TRUE)
)
mySim <- simInit(..., params = parameters)
mySimOut <- spades(mySim)

The messaging should indicate the caching is happening on every event in that module.

Note: This option REQUIRES that the metadata in inputs and outputs be exactly correct, i.e., all inputObjects and outputObjects must be correctly identified and listed in the defineModule metadata

If the module is cached, and there are errors when it is run, it almost is guaranteed to be a problem with the inputObjects and outputObjects incorrectly specified.

2.1.0.3 Cache individual functions

Cache(<functionName>, <other arguments>)

This will allow fine scale control of individual function calls.

2.2 Data-to-decisions

Once nested Caching is used all the way up to the experiment (see SpaDES.experiment package) level and even further up (e.g., if there is a shiny module), then even very complex models can be put into a complete workflow.

The current vision for SpaDES is that it will allow this type of “data to decisions” complete workflow that allows for deep, robust models, across disciplines, with easily accessible front ends, that are quick and responsive to users, yet can handle data changes, module changes, etc.