accumulate
Package version 0.9.3.
Use citation('accumulate')
to cite the package.
Accumulate
is a package for grouped aggregation, where the groups can be dynamically collapsed into larger groups. When this collapsing takes place and how collapsing takes place is user-defined.
The latest CRAN release can be installed as follows.
install.packages("accumulate")
Next, the package can be loaded. You can use packageVersion
(from base R) to check which version you have installed.
We will use a built-in dataset as example.
> data(producers)
> head(producers)
sbi size industrial trade other other_income total
1 3410 8 151722 2135 0 -1775 152082
2 2840 7 50816 NA 158 949 59876
3 2752 5 4336 NA 0 36 4959
4 3120 6 18508 NA 0 80 20682
5 2524 7 21071 0 0 442 21513
6 3410 6 24220 1069 0 239 25528
This synthetic dataset contains information on various sources of turnover from producers, that are labeled with an economic activity classification (sbi
) and a size
class (0-9).
We wish to find a group mean by sbi x size
. However, we demand that the group has at least five records, otherwise we combine the size classes of a single sbi
group. This can be done as follows.
> a <- accumulate(producers
+ , collapse = sbi*size ~ sbi
+ , test = min_records(5)
+ , fun = mean, na.rm=TRUE)
> head(round(a))
sbi size level industrial trade other other_income total
1 3410 8 1 364397 2859 33 353 546117
2 2840 7 0 23160 823 49 329 25812
3 2752 5 NA NA NA NA NA NA
4 3120 6 0 20710 504 112 200 21702
5 2524 7 0 27954 1268 55 456 30468
6 3410 6 1 364397 2859 33 353 546117
The accumulate function does the following:
sbi
and size
occurring in the data, it checks whether test
is satisfied. Here, it tests whether there are at least five records.
level
is set to 0 (no collapsing took place).sbi
as grouping variable for the current combination of sbi
and size
. Then, if there are enough records, the mean is computed for each variable and the output variable level
is set to 1 (first level of collapsing has been used).NA
for the current sbi
and size
combination.Explicitly, for this example we see that for (sbi,size)==(2752,5)
no satisfactory group of records was found under the current collapsing scheme. Therefore the level
variable equals NA
and all aggregated variables are missing as well. For (sbi,size)==(2840,7)
there are sufficient records, and since level=0
no collapsing was necessary. For the group (sbi,size)=(3410,8)
there were not enough records to compute a mean, but taking all records in sbi==3410
gave enough records. This is signified by level=1
, meaning that one collapsing step has taken place (from sbi x size
to sbi
).
Let us see how we specified this call to accumulate
target groups ~ collapsing scheme
. The output is always at the level of the target groups. The collapsing scheme determines which records are used to compute a value for the target groups if the test
is not satisfied.test
is a function that should accept any subset of records of producers
and return TRUE
or FALSE
. In this case we used the convenience function min_records(5)
provided by accumulate
. The function min_records()
creates a testing function for us that we can pass as testing function.fun
is the aggregation function that will be applied to each group.Observe that the accumulate function is similar to R’s built-in aggregate
function (this is by design). There is a second function called cumulate
that has an interface that is similar to dplyr::summarise
.
> a <- cumulate(producers, collapse = sbi*size ~ sbi
+ , test = function(d) nrow(d) >= 5
+ , mu_industrial = mean(industrial, na.rm=TRUE)
+ , sd_industrial = sd(industrial, na.rm=TRUE))
> head(round(a))
sbi size level mu_industrial sd_industrial
1 3410 8 1 364397 535446
2 2840 7 0 23160 13937
3 2752 5 NA NA NA
4 3120 6 0 20710 21151
5 2524 7 0 27954 15089
6 3410 6 1 364397 535446
Notice that here, we wrote our own test function.
(sbi, size)
could not be computed, even when collapsing to sbi
? (You need to run the code and investigate the output).?mean
on how to compute trimmed means.A collapsing scheme can be defined in a data frame or with a formula of the form
target grouping ~ collapse1 + collapse2 + ... + collapseN
Here, the target grouping
is a variable or product of variables. Each collapse
term is also a variable or product of variables. Each subsequent term defines the next collapsing step. Let us show the idea with a more involved example.
The sbi
variable in the producers
dataset encodes a hierarchical classification where longer digit sequences indicate higher level of detail. Hence we can collapse to lower levels of detail by deleting digits at the end. Let us enrich the producers
dataset with extra grouping levels.
> producers$sbi3 <- substr(producers$sbi,1,3)
> producers$sbi2 <- substr(producers$sbi,1,2)
> head(producers,3)
sbi size industrial trade other other_income total sbi3 sbi2
1 3410 8 151722 2135 0 -1775 152082 341 34
2 2840 7 50816 NA 158 949 59876 284 28
3 2752 5 4336 NA 0 36 4959 275 27
We can now use a more involved collapsing scheme as follows.
> a <- accumulate(producers, collapse = sbi*size ~ sbi + sbi3 + sbi2
+ , test = min_records(5), fun = mean, na.rm=TRUE)
> head(round(a))
sbi size level industrial trade other other_income total
1 3410 8 1 364397 2859 33 353 546117
2 2840 7 0 23160 823 49 329 25812
3 2752 5 2 19526 39 52 151 20603
4 3120 6 0 20710 504 112 200 21702
5 2524 7 0 27954 1268 55 456 30468
6 3410 6 1 364397 2859 33 353 546117
For (sbi,size) == (2752,5)
we have 2 levels of collapsing. In other words, for that aggregate, all records in sbi3 == 275
were used.
trade
and total
using the cumulate
function under the same collapsing scheme as defined above.(sbi,size)
have been collapsed to level 0, 1, 2, or 3. Tabulate them.sbi
code and compute the means of all variables.Collapsing schemes can be represented in data frames that have the form
[target group, parent of target group, parent of parent of target group,...].
The package comes with a helper function that creates such a scheme from hierarchical classifications that are encoded as digits.
For the sbi
example we can do the following to derive a collapsing scheme.
> sbi <- unique(producers$sbi)
> csh <- csh_from_digits(sbi)
> names(csh)[1] <- "sbi"
> head(csh)
sbi A1 A2 A3 A4
1 3410 3410 341 34 3
2 2840 2840 284 28 2
3 2752 2752 275 27 2
4 3120 3120 312 31 3
5 2524 2524 252 25 2
6 2875 2875 287 28 2
Here, the column sbi
denotes the original (maximally) 5-digit codes, A1
the 4-digit codes, and so on. It is important that the name of the first column matches a column in the data to be agregated. Both cumlate
and accumulate
accept such a data frame as an argument. Here is an example with cumulate
.
> a <- cumulate(producers, collapse = csh, test = function(d) nrow(d) >= 5
+ , mu_total = mean(total, na.rm=TRUE)
+ , sd_total = sd(total, na.rm=TRUE))
> head(a)
sbi level mu_total sd_total
1 3410 0 546117.22 844001.47
2 2840 0 31265.28 35053.37
3 2752 2 20603.08 31286.51
4 3120 0 26548.61 26784.60
5 2524 0 23434.68 18022.32
7 2875 0 15962.24 9640.14
In this representation is is not possible to use multiple grouping variables, unless you combine multiple grouping variables into a single one, for example by pasting them together.
The advantage of this representation is that it allows users to externally define a (manually edited) collapsing scheme.
csh
to compute the median of all numerical variables of the producers
dataset with accumulate
(hint: you need to remove the size
variable).There are several options to define test on groups of records:
min_records()
, min_complete()
, or frac_complete()
.from_validator()
function.Let us look at a small example for each case. For comparison we will always test that there are a minimum of five records.
> # load the data again to loose columns 'sbi2' and 'sbi3' and work
> # with the original data.
> data(producers)
> # 1. using a helper function
> a <- accumulate(producers, collapse = sbi*size ~ sbi
+ , test = min_records(5)
+ , fun = mean, na.rm=TRUE)
> # 2. using a 'validator' object
> rules <- validate::validator(nrow(.) >= 5)
> a <- accumulate(producers, collapse = sbi*size ~ sbi
+ , test = from_validator(rules)
+ , fun = mean, na.rm=TRUE)
> # 3. using a custom function
> a <- accumulate(producers, collapse=sbi*size ~ sbi
+ , test = function(d) nrow(d) >= 5
+ , fun = mean, na.rm=TRUE)
An aggregate may be something more complex than a scalar. The accumulate
package also supports complex aggregates such as linear models.
> a <- cumulate(producers, collapse = sbi*size ~ sbi
+ , test = min_complete(5, c("other_income","trade"))
+ , model = lm(other_income ~ trade)
+ , mean_other = mean(other_income, na.rm=TRUE))
> head(a)
sbi size level model mean_other
1 3410 8 NA <logical> NA
2 2840 7 1 <lm> 249.3333
3 2752 5 NA <logical> NA
4 3120 6 0 <lm> 199.8889
5 2524 7 0 <lm> 456.2500
6 3410 6 NA <logical> NA
Here, we demand that there are at least five records available for estimating the model.
The linear models are stored in a list
of type object_list
. Subsets or individual elements can be accessed as usual with data frames.
> a$model[[1]]
[1] NA
> a$model[[2]]
Call:
lm(formula = other_income ~ trade)
Coefficients:
(Intercept) trade
221.59429 0.06937
If you write your own test function from scratch, it is easy to overlook some edge cases like the occurrence of missing data, a column that is completely NA
, or receiving zero records. The function smoke_test()
accepts a data set and a test function and runs the test function on several common edge cases based on the dataset. It does not check whether the test function works as expected, but it checks that the output is TRUE
or FALSE
in all cases and reports errors, warnings and mesages if they occur.
As an example we construct a test function that checks whether one of the variables has sufficient non-zero values.
> my_test <- function(d) sum(other != 0) > 3
> smoke_test(producers, my_test)
Test with full dataset raised issues.
ERR: object 'other' not found
Oops, we forgot to refer to the data set. Let’s try it again.
> my_test <- function(d) sum(d$other != 0) > 3
> smoke_test(producers, my_test)
Test with full dataset raised issues.
NA detected in output (must be TRUE or FALSE)
Test with first record and other is NA raised issues.
NA detected in output (must be TRUE or FALSE)
Test with first record and all values NA raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and sbi is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and size is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and industrial is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and trade is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and other is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and other_income is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and total is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Our function is not robust against occurrence of NA
. Here’s a third attempt.
sbi*size ~ sbi1 + sbi2
as collapsing scheme. Make sure there are at least 10 records in each group.industrial
and total
, but demand that there are not more than 20% zeros in other
. Use csh
as collapsing scheme.