--- title: "mcprogress" author: "Myles Lewis" output: html_document: toc: true toc_float: collapsed: false toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{mcprogress} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE ) library(parallel) ``` # Installation Install from CRAN ```{r eval = FALSE} install.packages("mcprogress") ``` Install from Github ```{r eval = FALSE} devtools::install_github("myles-lewis/mcprogress") ``` # Examples ## *mclapply* with progress bar This package adds a progress bar to `mclapply()` using `echo` to output to the console in Rstudio or Linux environments. Simply replace your original call to `mclapply()` with `pmclapply()`. ```{r eval=FALSE} library(mcprogress) # toy example res <- pmclapply(letters[1:20], function(i) { Sys.sleep(0.2 + runif(1) * 0.1) setNames(rnorm(5), paste0(i, 1:5)) }, mc.cores = 2, title = "Working") ``` ```{r eval=FALSE} Working |================================ | 60% eta 3.1 secs ``` `pmclapply()` can be used in an identical manner to `mclapply()`. It is ideal for use if the length of `X` is comparably > cores. As processes are spawned in a block and most code for each process completes at roughly the same time, processes move along in blocks as determined by `mc.cores`. To track progress, `pmclapply` only tracks the nth process, where n=`mc.cores`. For example, with 4 cores, `pmclapply` reports progress when the 4th, 8th, 12th, 16th etc process has completed. ```{r, out.width='80%', echo=FALSE} knitr::include_graphics("mcp1.png") ``` ETA is approximate. As part of minimising overhead, it is only updated with each change in progress (i.e. each time a block of processes completes). It is not updated by interrupt. ## Tracking subprogress However, in some scenarios the length of `X` is comparable to the number of cores and each process may take a long time. For example, machine learning applied to each of 8 cross-validation folds on an 8-core machine will open 8 processes from the outset. Each process will often complete at roughly the same time. In this case `pmclapply` is much less informative as it only shows completion at the end of 1 round of processes, so it will go from 0% straight to 100%. For this scenario, we recommend users use `mcProgressBar()` which allows more fine-grained reporting of subprogress from within a block of parallel processes. The diagram below illustrates computation involving 10 processes to complete across 8 cores, with subprogress divided into 5 intervals. ```{r, out.width='65%', echo=FALSE} knitr::include_graphics("mcp2.png") ``` Technically only 1 process can be tracked. If `cores` is set to 4 and `subval` is invoked, then the 1st, 5th, 9th, 13th etc process is tracked. Subprogress of this process is computed as part of the number of blocks of processes required. In the next example, we build a custom function showing how to use `mcProgressBar()` including a call to `mclapply` wrapped around another nested function which can report subprogress. ```{r eval=FALSE} library(parallel) my_fun <- function(x, cores) { start <- Sys.time() mcProgressBar(0, title = "my_fun") # initialise progress bar res <- mclapply(seq_along(x), function(i) { # inner loop of calculation y <- 1:4 inner <- lapply(seq_along(y), function(j) { Sys.sleep(0.2 + runif(1) * 0.1) mcProgressBar(val = i, len = length(x), cores, subval = j / length(y), title = "my_fun", start = start) rnorm(4) }) inner }, mc.cores = cores) closeProgress(start, title = "my_fun") # finalise the progress bar res } output <- my_fun(letters[1:4], cores = 2) ``` Alternatively even if the function call inside `mclapply` does not have a for loop or equivalent, then progress can still be reported manually after chunks of computation. ```{r eval=FALSE} ## Example of long function longfun <- function(x, cores) { start <- Sys.time() mcProgressBar(0, title = "longfun") # initialise progress bar res <- mclapply(seq_along(x), function(i) { # long sequential calculation in parallel with 3 major steps applied to x[i] Sys.sleep(0.5) mcProgressBar(val = i, len = length(x), cores, subval = 0.33, title = "longfun", start = start) # 33% complete Sys.sleep(0.5) mcProgressBar(val = i, len = length(x), cores, subval = 0.66, title = "longfun", start = start) # 66% complete Sys.sleep(0.5) mcProgressBar(val = i, len = length(x), cores, subval = 1, title = "longfun", start = start) # 100% complete return(rnorm(4)) }, mc.cores = cores) closeProgress(start, title = "longfun") # finalise the progress bar res } output <- longfun(letters[1:4], cores = 2) ``` ## foreach The `mcProgressBar` function can be used with the `foreach` package and the `doMC` package multicore backend to show a progress bar. ```{r eval=FALSE} # Example from doMC vignette library(doMC) library(foreach) registerDoMC(4) x <- iris[which(iris[,5] != "setosa"), c(1,5)] trials <- 10000 { start <- Sys.time() r <- foreach(i = seq_len(trials), .combine = cbind) %dopar% { ind <- sample(100, 100, replace = TRUE) result1 <- glm(x[ind, 2] ~ x[ind, 1], family = binomial(logit)) mcProgressBar(i, trials, cores = getDoParWorkers(), start = start) coefficients(result1) } closeProgress(start) } # Equivalent using pmclapply r <- pmclapply(seq_len(trials), function(i) { ind <- sample(100, 100, replace = TRUE) result1 <- glm(x[ind, 2] ~ x[ind, 1], family = binomial(logit)) coefficients(result1) }, mc.cores = 4) ``` # Printing from parallel code The package also includes functions to safely print messages (including error messages) from within parallelised code. These can be very useful for debugging parallel R code. ```{r eval=FALSE} res <- mclapply(1:5, function(i) { Sys.sleep(runif(1) /10) message_parallel("Process ", i, " done") rnorm(1) }) ## Process 1 done ## Process 3 done ## Process 2 done ## Process 5 done ## Process 4 done ``` If errors occur during parallel processing, `mclapply` generates a nondescript warning "all scheduled cores encountered errors in user code". One option is to set `mc.cores = 1`. This will often reveal the error message, but can be slow if computation is long and the error occurs only half way through. ```{r eval=FALSE} out <- mclapply(1:5, function(i) { rnorm(-1) }, mc.cores = 2) # change mc.cores = 1 to reveal actual error message ## Warning in mclapply(1:5, function(i) {: all scheduled cores encountered errors ## in user code ``` The function `catchError()` enables an expression to be wrapped in `try()` so that code is executed and if an error message is produced it is printed to the console to be more visible. If no error is generated the usual of the expression is returned. This allows you to write your code as usual. It can more easily be utilised using the pipe `|>`. Additional arguments can be provided to track values so that the programmer can more easily find out when the error occurs. ```{r eval=FALSE} out <- mclapply(1:5, function(i) { j = 4 + i rnorm(-1) |> catchError(i, j) }, mc.cores = 2) ## Error in rnorm(-1) : invalid arguments ## i=1, j=5 ## Error in rnorm(-1) : invalid arguments ## i=2, j=6 ``` The function `mcstop()` allows programmers to generate visible error messages during parellel code. ```{r eval=FALSE} res <- mclapply(1:5, function(i) { Sys.sleep(runif(1) /10) if (i == 5) mcstop("My error message") rnorm(1) }) ## My error message ```