The groupr package is designed to make certain forms of data
manipulation easier by representing the underlying data in richer ways.
In particular, the standard grouping function
dplyr::group_by
is extended to include groups that can be
marked “inapplicable” at certain values of the grouping variable. The
hope is that code that can recognize these kinds of groups will be
simpler to write and easier to understand. The package also provides
functions for some tasks, like pivoting, that are especially well suited
to this idea.
In dplyr, groups are denoted with a grouping column that contains
unique values for every group. For example, we can group
mtcars
by the variable vs
:
library(dplyr, warn.conflicts = FALSE)
group_by(mtcars, vs)
#> # A tibble: 32 × 11
#> # Groups: vs [2]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
The result is a dataset with two groups defined by
vs == 1
and vs == 0
.
In groups
, we can optionally mark one of the two groups
as inapplicable:
#> # A tibble: 32 × 11
#> # Row indices: vs (I: 1) [2]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
What this means depends on what comes next. Here are a few possibilities:
These different meanings will be clear in the context of actual data cleaning operations.
Pivoting can be thought of as a simple rearrangement of groups. Consider the iris dataset:
as_tibble(iris)
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
We could pivot to “longer” format by collapsing the different
measurements into a single column. Equivalently, we can consider the
columns Seal.Length, Sepal.Width, ...
to describe groups of
data, not distinct variables. In other words, the four columns together
form one collection of data, and each column is a subgroup of that
collection.
To pivot, we transfer this “column grouping” to the standard dplyr
“row grouping.” To do this we just take our groups out of the different
columns and merge them into one. The result is a consolidated column of
data (value
), along with a standard (row) grouping variable
(type
).
<- group_by2(iris, Species) %>%
iris2 colgrp("value", "type")
pivot_grps(iris2, rows = "type")
#> # A tibble: 600 × 3
#> # Row indices: Species, type [12]
#> Species type value
#> <fct> <chr> <dbl>
#> 1 setosa Sepal.Length 5.1
#> 2 setosa Sepal.Length 4.9
#> 3 setosa Sepal.Length 4.7
#> 4 setosa Sepal.Length 4.6
#> 5 setosa Sepal.Length 5
#> 6 setosa Sepal.Length 5.4
#> 7 setosa Sepal.Length 4.6
#> 8 setosa Sepal.Length 5
#> 9 setosa Sepal.Length 4.4
#> 10 setosa Sepal.Length 4.9
#> # … with 590 more rows
So, pivoting to longer is the same as converting column groupings to row groupings, and pivoting to wider just does the inverse.
Consider this example dataset:
<- tibble(
df grp = c(1, 1, 1, 1, 2),
subgrp = c(1, 2, 3, 4, NA),
val = c(3.1, 2.8, 4.0, 3.8, 10.2)
)
df#> # A tibble: 5 × 3
#> grp subgrp val
#> <dbl> <dbl> <dbl>
#> 1 1 1 3.1
#> 2 1 2 2.8
#> 3 1 3 4
#> 4 1 4 3.8
#> 5 2 NA 10.2
Imagine we want to convert the row grouping defined by
grp
into a column grouping. Without inapplicable groups we
get this:
<- group_by2(df, grp, subgrp)
regular_df pivot_grps(regular_df, cols = "grp")
#> # A tibble: 5 × 2
#> # Row indices: subgrp [5]
#> # Col index: grp
#> subgrp val$`1` $`2`
#> <dbl> <dbl> <dbl>
#> 1 1 3.1 NA
#> 2 2 2.8 NA
#> 3 3 4 NA
#> 4 4 3.8 NA
#> 5 NA NA 10.2
It looks a bit off. What if we wanted val_2 == 10.2
for
all values of subgrp? In other words, what if val = 10.2
describes the entire second group?
This is an example of an operation that is very challenging to write with standard pivoting functions, but trivial with inapplicable groups. Simply group like this before pivoting:
<- group_by2(df, grp, subgrp = NA)
igrp_df pivot_grps(igrp_df, cols = "grp")
#> # A tibble: 4 × 2
#> # Row indices: subgrp [4]
#> # Col index: grp
#> subgrp val$`1` $`2`
#> <dbl> <dbl> <dbl>
#> 1 1 3.1 10.2
#> 2 2 2.8 10.2
#> 3 3 4 10.2
#> 4 4 3.8 10.2
In this case we have a hierarchical grouping, where there are allowed
to be multiple values for each value of grp
but we may also
have a single value that describes all the subgroups.
Note also how the only difference is in the grouping structure. The operation itself remains concise and easy to understand.
It is common to have a calculation to apply to only a subset of the data. For example, if you have group A and group B, you may be interested in calculating a mean for group A but leaving it missing for all the rows in group B. Depending on the calculation, this can be tough to express.
In mtcars, if we want the mean of hp
for all rows where
vs == 1
, the easiest way is something like the
following:
%>%
mtcars group_by2(vs = 0) %>%
mutate(hp_mean_vs1 = mean(hp))
Mutations are not currently provided in groups
but will
be added in the future.