Select variables

Dan Chaltiel

2024-11-01

This vignette will review the different ways of selecting variables to describe with crosstable. For more general informations about crosstable, see vignette("crosstable") (link). For more information and tips on tidyselect, see tidyselect syntax

Whole table

The simplest case is when you want to describe the whole table, as you need no further argument. If you really want to be more explicit, you also can use tidyselect::everything(). All tidyselect helpers are re-exported by dplyr so we only want to load this latter package.

Here are the 10 first lines of the iris2 dataset:

library(crosstable)
ct = crosstable(iris2, everything()) #or simply `crosstable(iris2)`
ct %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

value

Sepal.Length

Length of Sepal

Min / Max

4.3 / 7.9

Med [IQR]

5.8 [5.1;6.4]

Mean (std)

5.8 (0.8)

N (NA)

150 (0)

Sepal.Width

Width of Sepal

Min / Max

2.0 / 4.4

Med [IQR]

3.0 [2.8;3.3]

Mean (std)

3.1 (0.4)

N (NA)

150 (0)

Petal.Length

Length of Petal

Min / Max

1.0 / 6.9

Med [IQR]

4.3 [1.6;5.1]

Mean (std)

3.8 (1.8)

N (NA)

150 (0)

Petal.Width

Width of Petal

Min / Max

0.1 / 2.5

Med [IQR]

1.3 [0.3;1.8]

Mean (std)

1.2 (0.8)

N (NA)

150 (0)

Species

Specie

setosa

50 (33.33%)

versicolor

50 (33.33%)

virginica

50 (33.33%)

Select by column name

Name

Just like with dplyr::select(), you can use names with or without quotes to select variables you want to describe. Use c() to select several columns:

crosstable(mtcars2, c(mpg, "qsec"), by=vs) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

Engine

straight

vshaped

mpg

Miles/(US) gallon

Min / Max

17.8 / 33.9

10.4 / 26.0

Med [IQR]

22.8 [21.4;29.6]

15.6 [14.8;19.1]

Mean (std)

24.6 (5.4)

16.6 (3.9)

N (NA)

14 (0)

18 (0)

qsec

1/4 mile time

Min / Max

16.9 / 22.9

14.5 / 18.0

Med [IQR]

19.2 [18.6;20.0]

17.0 [16.0;17.4]

Mean (std)

19.3 (1.4)

16.7 (1.1)

N (NA)

14 (0)

18 (0)

External vector

However, it is better to use all_of or any_of when you take your column names from an external vector. Otherwise, there would be an ambiguity as you might have wanted to select a column named like this vector.

qsec = c("mpg", "cyl") #wouldn't that be the most evil variable name ever?
crosstable(mtcars2, any_of(qsec), by=vs) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

Engine

straight

vshaped

mpg

Miles/(US) gallon

Min / Max

17.8 / 33.9

10.4 / 26.0

Med [IQR]

22.8 [21.4;29.6]

15.6 [14.8;19.1]

Mean (std)

24.6 (5.4)

16.6 (3.9)

N (NA)

14 (0)

18 (0)

cyl

Number of cylinders

4

10 (90.91%)

1 (9.09%)

6

4 (57.14%)

3 (42.86%)

8

0 (0%)

14 (100.00%)

Negation

You can use negation to keep all but some columns:

crosstable(mtcars2, c(-mpg, -cyl, -1), by=vs) %>% head(8) %>% #-c(mpg, cyl, 1) would also work
  as_flextable(keep_id=TRUE)

.id

label

variable

Engine

straight

vshaped

disp

Displacement (cu.in.)

Min / Max

71.1 / 258.0

120.3 / 472.0

Med [IQR]

120.5 [83.0;162.4]

311.0 [275.8;360.0]

Mean (std)

132.5 (56.9)

307.1 (106.8)

N (NA)

14 (0)

18 (0)

hp

Gross horsepower

Min / Max

52.0 / 123.0

91.0 / 335.0

Med [IQR]

96.0 [66.0;109.8]

180.0 [156.2;226.2]

Mean (std)

91.4 (24.4)

189.7 (60.3)

N (NA)

14 (0)

18 (0)

Indice

This can be useful sometimes, for instance when you want to quickly describe the 3 first columns.

crosstable(mtcars2, 2:4, by=vs) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

Engine

straight

vshaped

mpg

Miles/(US) gallon

Min / Max

17.8 / 33.9

10.4 / 26.0

Med [IQR]

22.8 [21.4;29.6]

15.6 [14.8;19.1]

Mean (std)

24.6 (5.4)

16.6 (3.9)

N (NA)

14 (0)

18 (0)

cyl

Number of cylinders

4

10 (90.91%)

1 (9.09%)

6

4 (57.14%)

3 (42.86%)

8

0 (0%)

14 (100.00%)

disp

Displacement (cu.in.)

Min / Max

71.1 / 258.0

120.3 / 472.0

Med [IQR]

120.5 [83.0;162.4]

311.0 [275.8;360.0]

Mean (std)

132.5 (56.9)

307.1 (106.8)

N (NA)

14 (0)

18 (0)

You can also use negation (-(1:3)), concatenation (c(1,2,3)), or both (crosstable(mtcars2, 1:4, -2, by=vs)).

Select with tidyselect helpers

Along with everything(), tidyselect provides a large choice of helpers. You can browse ?tidyselect::select_helpers for a complete list.

Note that all have the useful ignore.case argument which is often very convenient.

The main ones are re-exported by crosstable: starts_with(), ends_with(), contains() and matches(). Here are some examples:

crosstable(mtcars2, starts_with("d")) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

value

disp

Displacement (cu.in.)

Min / Max

71.1 / 472.0

Med [IQR]

196.3 [120.8;326.0]

Mean (std)

230.7 (123.9)

N (NA)

32 (0)

drat

Rear axle ratio

Min / Max

2.8 / 4.9

Med [IQR]

3.7 [3.1;3.9]

Mean (std)

3.6 (0.5)

N (NA)

32 (0)

crosstable(mtcars2, c(ends_with("g"), contains("yl"))) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

value

mpg

Miles/(US) gallon

Min / Max

10.4 / 33.9

Med [IQR]

19.2 [15.4;22.8]

Mean (std)

20.1 (6.0)

N (NA)

32 (0)

cyl

Number of cylinders

4

11 (34.38%)

6

7 (21.88%)

8

14 (43.75%)

#to all regex haters: the following call selects all columns which name 
#starts with "d" or "g", followed by exactly 3 characters
crosstable(mtcars2, matches("^d|g.{3}$")) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

value

disp

Displacement (cu.in.)

Min / Max

71.1 / 472.0

Med [IQR]

196.3 [120.8;326.0]

Mean (std)

230.7 (123.9)

N (NA)

32 (0)

drat

Rear axle ratio

Min / Max

2.8 / 4.9

Med [IQR]

3.7 [3.1;3.9]

Mean (std)

3.6 (0.5)

N (NA)

32 (0)

gear

Number of forward gears

3

15 (46.88%)

4

12 (37.50%)

5

5 (15.62%)

Select with predicate functions

Sometimes, you want to select columns if they meet a set of specifications, for instance of type or of value. You can then use predicate functions: functions that return a single logical value. If the function is named, it is a good practice to wrap it in where().

For instance, you might want to keep only character variables:

crosstable(mtcars2, c(where(is.character), where(is.factor), -model)) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

value

vs

Engine

straight

14 (43.75%)

vshaped

18 (56.25%)

am

Transmission

auto

19 (59.38%)

manual

13 (40.62%)

cyl

Number of cylinders

4

11 (34.38%)

6

7 (21.88%)

8

14 (43.75%)

gear

Number of forward gears

3

15 (46.88%)

4

12 (37.50%)

5

5 (15.62%)

Using anonymous functions, you can even use more complicated patterns. For instance, you might want only numeric variables which mean is higher than 100:

crosstable(mtcars2, where(function(x) is.numeric(x) && mean(x)>100)) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

value

disp

Displacement (cu.in.)

Min / Max

71.1 / 472.0

Med [IQR]

196.3 [120.8;326.0]

Mean (std)

230.7 (123.9)

N (NA)

32 (0)

hp

Gross horsepower

Min / Max

52.0 / 335.0

Med [IQR]

123.0 [96.5;180.0]

Mean (std)

146.7 (68.6)

N (NA)

32 (0)

Of note, crosstable support lambda-functions, so you could instead write crosstable(mtcars2, where(~is.numeric(.x) && mean(.x)>100)) for the exact same result but a tidier code.

The only logical constraint is that the function in where() should return a single logical value. Use &&, ||, and parenthesis to combine functions in complex patterns.

Select with a formula

If you want to mutate some variables in real-time, you can use the formula interface. The left-hand-side are the variables to describe, while the right-hand-side is the by variable (which can be set to NULL, 0 or 1 for “no variable”).

crosstable(mtcars2, mpg+cyl ~ vs) %>% 
  as_flextable(keep_id=TRUE)

.id

label

variable

Engine

straight

vshaped

mpg

Miles/(US) gallon

Min / Max

17.8 / 33.9

10.4 / 26.0

Med [IQR]

22.8 [21.4;29.6]

15.6 [14.8;19.1]

Mean (std)

24.6 (5.4)

16.6 (3.9)

N (NA)

14 (0)

18 (0)

cyl

Number of cylinders

4

10 (90.91%)

1 (9.09%)

6

4 (57.14%)

3 (42.86%)

8

0 (0%)

14 (100.00%)

This permits very complex and interesting patterns, using functions in situ and operations using with the I function. Labels are inherited and make little sense though.

crosstable(mtcars2, sqrt(mpg) + I(qsec^2) ~ ifelse(mpg>20,"mpg>20","mpg<20"),
           label=FALSE) %>% 
  as_flextable()

label

variable

ifelse(mpg > 20, "mpg>20", "mpg<20")

mpg<20

mpg>20

sqrt(mpg)

Min / Max

3.2 / 4.4

4.6 / 5.8

Med [IQR]

4.0 [3.8;4.2]

4.9 [4.6;5.4]

Mean (std)

4.0 (0.4)

5.0 (0.4)

N (NA)

18 (0)

14 (0)

I(qsec^2)

Min / Max

210.2 / 408.8

270.9 / 524.4

Med [IQR]

301.0 [259.3;321.8]

351.8 [303.0;391.8]

Mean (std)

294.3 (50.4)

356.8 (66.7)

N (NA)

18 (0)

14 (0)

Note that you cannot use tidyselect helpers in formulas and that you cannot use formula declared in an object.

Ultimate example

Lets play a little with all of this :-)

I want all numeric variables that do not start by “d” or “w”, but I still want drat in the end.

crosstable(mtcars2, c(where(is.numeric), -matches("^d|w"), drat), label=FALSE) %>% 
  as_flextable()

label

variable

value

mpg

Min / Max

10.4 / 33.9

Med [IQR]

19.2 [15.4;22.8]

Mean (std)

20.1 (6.0)

N (NA)

32 (0)

hp

Min / Max

52.0 / 335.0

Med [IQR]

123.0 [96.5;180.0]

Mean (std)

146.7 (68.6)

N (NA)

32 (0)

qsec

Min / Max

14.5 / 22.9

Med [IQR]

17.7 [16.9;18.9]

Mean (std)

17.8 (1.8)

N (NA)

32 (0)

carb

Min / Max

1.0 / 8.0

Med [IQR]

2.0 [2.0;4.0]

Mean (std)

2.8 (1.6)

N (NA)

32 (0)

drat

Min / Max

2.8 / 4.9

Med [IQR]

3.7 [3.1;3.9]

Mean (std)

3.6 (0.5)

N (NA)

32 (0)