Introduction to accumulate

Mark P.J. van der Loo

Package version 0.9.3.

Use citation('accumulate') to cite the package.

Introduction

Accumulate is a package for grouped aggregation, where the groups can be dynamically collapsed into larger groups. When this collapsing takes place and how collapsing takes place is user-defined.

Installation

The latest CRAN release can be installed as follows.

install.packages("accumulate")

Next, the package can be loaded. You can use packageVersion (from base R) to check which version you have installed.

> library(accumulate)
> # check the package version
> packageVersion("accumulate")
[1] ‘0.9.3

A first example

We will use a built-in dataset as example.

> data(producers)
> head(producers)
   sbi size industrial trade other other_income  total
1 3410    8     151722  2135     0        -1775 152082
2 2840    7      50816    NA   158          949  59876
3 2752    5       4336    NA     0           36   4959
4 3120    6      18508    NA     0           80  20682
5 2524    7      21071     0     0          442  21513
6 3410    6      24220  1069     0          239  25528

This synthetic dataset contains information on various sources of turnover from producers, that are labeled with an economic activity classification (sbi) and a size class (0-9).

We wish to find a group mean by sbi x size. However, we demand that the group has at least five records, otherwise we combine the size classes of a single sbi group. This can be done as follows.

> a <- accumulate(producers
+               , collapse = sbi*size ~ sbi
+               , test = min_records(5)
+               , fun  = mean, na.rm=TRUE)
> head(round(a))
   sbi size level industrial trade other other_income  total
1 3410    8     1     364397  2859    33          353 546117
2 2840    7     0      23160   823    49          329  25812
3 2752    5    NA         NA    NA    NA           NA     NA
4 3120    6     0      20710   504   112          200  21702
5 2524    7     0      27954  1268    55          456  30468
6 3410    6     1     364397  2859    33          353 546117

The accumulate function does the following:

Explicitly, for this example we see that for (sbi,size)==(2752,5) no satisfactory group of records was found under the current collapsing scheme. Therefore the level variable equals NA and all aggregated variables are missing as well. For (sbi,size)==(2840,7) there are sufficient records, and since level=0 no collapsing was necessary. For the group (sbi,size)=(3410,8) there were not enough records to compute a mean, but taking all records in sbi==3410 gave enough records. This is signified by level=1, meaning that one collapsing step has taken place (from sbi x size to sbi).

Let us see how we specified this call to accumulate

Observe that the accumulate function is similar to R’s built-in aggregate function (this is by design). There is a second function called cumulate that has an interface that is similar to dplyr::summarise.

> a <- cumulate(producers, collapse = sbi*size ~ sbi
+       , test = function(d) nrow(d) >= 5
+       , mu_industrial = mean(industrial, na.rm=TRUE)
+       , sd_industrial = sd(industrial, na.rm=TRUE))
> head(round(a))
   sbi size level mu_industrial sd_industrial
1 3410    8     1        364397        535446
2 2840    7     0         23160         13937
3 2752    5    NA            NA            NA
4 3120    6     0         20710         21151
5 2524    7     0         27954         15089
6 3410    6     1        364397        535446

Notice that here, we wrote our own test function.

Exercises

  1. How many combinations of (sbi, size) could not be computed, even when collapsing to sbi? (You need to run the code and investigate the output).
  2. Compute the trimmed mean of all numeric variables where you trim 5% of each side the distribution. See ?mean on how to compute trimmed means.

The formula interface for specifying collapsing schemes

A collapsing scheme can be defined in a data frame or with a formula of the form

target grouping ~ collapse1 + collapse2 + ... + collapseN

Here, the target grouping is a variable or product of variables. Each collapse term is also a variable or product of variables. Each subsequent term defines the next collapsing step. Let us show the idea with a more involved example.

The sbi variable in the producers dataset encodes a hierarchical classification where longer digit sequences indicate higher level of detail. Hence we can collapse to lower levels of detail by deleting digits at the end. Let us enrich the producers dataset with extra grouping levels.

> producers$sbi3 <- substr(producers$sbi,1,3)
> producers$sbi2 <- substr(producers$sbi,1,2)
> head(producers,3)
   sbi size industrial trade other other_income  total sbi3 sbi2
1 3410    8     151722  2135     0        -1775 152082  341   34
2 2840    7      50816    NA   158          949  59876  284   28
3 2752    5       4336    NA     0           36   4959  275   27

We can now use a more involved collapsing scheme as follows.

> a <- accumulate(producers, collapse = sbi*size ~ sbi + sbi3 + sbi2
+                , test = min_records(5), fun = mean, na.rm=TRUE)
> head(round(a))
   sbi size level industrial trade other other_income  total
1 3410    8     1     364397  2859    33          353 546117
2 2840    7     0      23160   823    49          329  25812
3 2752    5     2      19526    39    52          151  20603
4 3120    6     0      20710   504   112          200  21702
5 2524    7     0      27954  1268    55          456  30468
6 3410    6     1     364397  2859    33          353 546117

For (sbi,size) == (2752,5) we have 2 levels of collapsing. In other words, for that aggregate, all records in sbi3 == 275 were used.

Exercises

  1. Compute standard deviation for trade and total using the cumulate function under the same collapsing scheme as defined above.
  2. What is the maximum collapsing level in the collapsing scheme above?
  3. Find out how many combinations of (sbi,size) have been collapsed to level 0, 1, 2, or 3. Tabulate them.
  4. Define a collapsing scheme that ends with a single-digit sbi code and compute the means of all variables.

The data frame interface for defining collapsing schemes

Collapsing schemes can be represented in data frames that have the form

[target group, parent of target group, parent of parent of target group,...].

The package comes with a helper function that creates such a scheme from hierarchical classifications that are encoded as digits.

For the sbi example we can do the following to derive a collapsing scheme.

> sbi <- unique(producers$sbi)
> csh <- csh_from_digits(sbi)
> names(csh)[1] <- "sbi"
> head(csh)
   sbi   A1  A2 A3 A4
1 3410 3410 341 34  3
2 2840 2840 284 28  2
3 2752 2752 275 27  2
4 3120 3120 312 31  3
5 2524 2524 252 25  2
6 2875 2875 287 28  2

Here, the column sbi denotes the original (maximally) 5-digit codes, A1 the 4-digit codes, and so on. It is important that the name of the first column matches a column in the data to be agregated. Both cumlate and accumulate accept such a data frame as an argument. Here is an example with cumulate.

> a <- cumulate(producers, collapse = csh, test = function(d) nrow(d) >= 5
+        , mu_total = mean(total, na.rm=TRUE)
+        , sd_total = sd(total, na.rm=TRUE))
> head(a)
   sbi level  mu_total  sd_total
1 3410     0 546117.22 844001.47
2 2840     0  31265.28  35053.37
3 2752     2  20603.08  31286.51
4 3120     0  26548.61  26784.60
5 2524     0  23434.68  18022.32
7 2875     0  15962.24   9640.14

In this representation is is not possible to use multiple grouping variables, unless you combine multiple grouping variables into a single one, for example by pasting them together.

The advantage of this representation is that it allows users to externally define a (manually edited) collapsing scheme.

Exercises

  1. Use csh to compute the median of all numerical variables of the producers dataset with accumulate (hint: you need to remove the size variable).

Convenience functions to define tests

There are several options to define test on groups of records:

  1. Use one of the built-in functions to specify common test conditions: min_records(), min_complete(), or frac_complete().
  2. Use a ruleset defined with the validate package, with the from_validator() function.
  3. Write your own custom test function.

Let us look at a small example for each case. For comparison we will always test that there are a minimum of five records.

> # load the data again to loose columns 'sbi2' and 'sbi3' and work
> # with the original data.
> data(producers)
> # 1. using a helper function
> a <- accumulate(producers, collapse = sbi*size ~ sbi
+                , test = min_records(5)
+                , fun  = mean, na.rm=TRUE)
> # 2. using a 'validator' object
> rules <- validate::validator(nrow(.) >= 5)
> a <- accumulate(producers, collapse = sbi*size ~ sbi
+                , test = from_validator(rules)
+                , fun  = mean, na.rm=TRUE)
> # 3. using a custom function
> a <- accumulate(producers, collapse=sbi*size ~ sbi
+                , test = function(d) nrow(d) >= 5
+                , fun  = mean, na.rm=TRUE)

Complex aggregates

An aggregate may be something more complex than a scalar. The accumulate package also supports complex aggregates such as linear models.

> a <- cumulate(producers, collapse = sbi*size ~ sbi
+                        , test = min_complete(5, c("other_income","trade"))
+                        , model = lm(other_income ~ trade)
+                        , mean_other = mean(other_income, na.rm=TRUE))
> head(a)
   sbi size level     model mean_other
1 3410    8    NA <logical>         NA
2 2840    7     1      <lm>   249.3333
3 2752    5    NA <logical>         NA
4 3120    6     0      <lm>   199.8889
5 2524    7     0      <lm>   456.2500
6 3410    6    NA <logical>         NA

Here, we demand that there are at least five records available for estimating the model.

The linear models are stored in a list of type object_list. Subsets or individual elements can be accessed as usual with data frames.

> a$model[[1]]
[1] NA
> a$model[[2]]

Call:
lm(formula = other_income ~ trade)

Coefficients:
(Intercept)        trade  
  221.59429      0.06937  

Smoke-testing your test function

If you write your own test function from scratch, it is easy to overlook some edge cases like the occurrence of missing data, a column that is completely NA, or receiving zero records. The function smoke_test() accepts a data set and a test function and runs the test function on several common edge cases based on the dataset. It does not check whether the test function works as expected, but it checks that the output is TRUE or FALSE in all cases and reports errors, warnings and mesages if they occur.

As an example we construct a test function that checks whether one of the variables has sufficient non-zero values.

> my_test <- function(d) sum(other != 0) > 3
> smoke_test(producers, my_test)

Test with full dataset raised issues.
   ERR: object 'other' not found

Oops, we forgot to refer to the data set. Let’s try it again.

> my_test <- function(d) sum(d$other != 0) > 3
> smoke_test(producers, my_test)

Test with full dataset raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with first record and other is NA raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with first record and all values NA raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with full dataset and sbi is NA for all records raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with full dataset and size is NA for all records raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with full dataset and industrial is NA for all records raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with full dataset and trade is NA for all records raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with full dataset and other is NA for all records raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with full dataset and other_income is NA for all records raised issues.
   NA detected in output (must be TRUE or FALSE)
Test with full dataset and total is NA for all records raised issues.
   NA detected in output (must be TRUE or FALSE)

Our function is not robust against occurrence of NA. Here’s a third attempt.

> my_test <- function(d) sum(d$other != 0,na.rm=TRUE) > 3
> smoke_test(producers, my_test)

Exercises

  1. Compute the mean of all variables using sbi*size ~ sbi1 + sbi2 as collapsing scheme. Make sure there are at least 10 records in each group.
  2. Compute the mean of the ratio between industrial and total, but demand that there are not more than 20% zeros in other. Use csh as collapsing scheme.