Bivariate analysis of continuous and/or categorical variables

2024-02-22

Tidycomm includes five functions for bivariate explorative data analysis:

We will again use sample data from the Worlds of Journalism 2012-16 study for demonstration purposes:

WoJ
#> # A tibble: 1,200 × 15
#>    country   reach employment temp_contract autonomy_selection autonomy_emphasis
#>    <fct>     <fct> <chr>      <fct>                      <dbl>             <dbl>
#>  1 Germany   Nati… Full-time  Permanent                      5                 4
#>  2 Germany   Nati… Full-time  Permanent                      3                 4
#>  3 Switzerl… Regi… Full-time  Permanent                      4                 4
#>  4 Switzerl… Local Part-time  Permanent                      4                 5
#>  5 Austria   Nati… Part-time  Permanent                      4                 4
#>  6 Switzerl… Local Freelancer <NA>                           4                 4
#>  7 Germany   Local Full-time  Permanent                      4                 4
#>  8 Denmark   Nati… Full-time  Permanent                      3                 3
#>  9 Switzerl… Local Full-time  Permanent                      5                 5
#> 10 Denmark   Nati… Full-time  Permanent                      2                 4
#> # ℹ 1,190 more rows
#> # ℹ 9 more variables: ethics_1 <dbl>, ethics_2 <dbl>, ethics_3 <dbl>,
#> #   ethics_4 <dbl>, work_experience <dbl>, trust_parliament <dbl>,
#> #   trust_government <dbl>, trust_parties <dbl>, trust_politicians <dbl>

Compute contingency tables and Chi-square tests

crosstab() outputs a contingency table for one independent (column) variable and one or more dependent (row) variables:

WoJ %>% 
  crosstab(reach, employment)
#> # A tibble: 3 × 5
#>   employment Local Regional National Transnational
#> * <chr>      <dbl>    <dbl>    <dbl>         <dbl>
#> 1 Freelancer    23       36      104             9
#> 2 Full-time    111      287      438            66
#> 3 Part-time     15       32       75             4

Additional options include add_total (adds a row-wise Total column if set to TRUE) and percentages (outputs column-wise percentages instead of absolute values if set to TRUE):

WoJ %>% 
  crosstab(reach, employment, add_total = TRUE, percentages = TRUE)
#> # A tibble: 3 × 6
#>   employment Local Regional National Transnational Total
#> * <chr>      <dbl>    <dbl>    <dbl>         <dbl> <dbl>
#> 1 Freelancer 0.154   0.101     0.169        0.114  0.143
#> 2 Full-time  0.745   0.808     0.710        0.835  0.752
#> 3 Part-time  0.101   0.0901    0.122        0.0506 0.105

Setting chi_square = TRUE computes a \(\chi^2\) test including Cramer’s \(V\) and outputs the results in a console message:

WoJ %>% 
  crosstab(reach, employment, chi_square = TRUE)
#> # A tibble: 3 × 5
#>   employment Local Regional National Transnational
#> * <chr>      <dbl>    <dbl>    <dbl>         <dbl>
#> 1 Freelancer    23       36      104             9
#> 2 Full-time    111      287      438            66
#> 3 Part-time     15       32       75             4
#> # Chi-square = 16.005, df = 6, p = 0.014, V = 0.082

Finally, passing multiple row variables will treat all unique value combinations as a single variable for percentage and Chi-square computations:

WoJ %>% 
  crosstab(reach, employment, country, percentages = TRUE)
#> # A tibble: 15 × 6
#>    employment country       Local Regional National Transnational
#>  * <chr>      <fct>         <dbl>    <dbl>    <dbl>         <dbl>
#>  1 Freelancer Austria     0.0134   0.0113   0.0162         0     
#>  2 Freelancer Denmark     0.0537   0.0197   0.112          0.0127
#>  3 Freelancer Germany     0.0470   0.0507   0.00648        0     
#>  4 Freelancer Switzerland 0.0403   0.00845  0.00162        0     
#>  5 Freelancer UK          0        0.0113   0.0324         0.101 
#>  6 Full-time  Austria     0.0403   0.180    0.152          0.0127
#>  7 Full-time  Denmark     0.168    0.192    0.295          0     
#>  8 Full-time  Germany     0.268    0.172    0.0616         0     
#>  9 Full-time  Switzerland 0.168    0.197    0.0875         0.0633
#> 10 Full-time  UK          0.101    0.0676   0.113          0.759 
#> 11 Part-time  Austria     0        0.0225   0.0292         0     
#> 12 Part-time  Denmark     0.00671  0.0113   0.0178         0     
#> 13 Part-time  Germany     0        0.00282  0.00648        0     
#> 14 Part-time  Switzerland 0.0872   0.0479   0.0632         0     
#> 15 Part-time  UK          0.00671  0.00563  0.00486        0.0506

You can also visualize the output from crosstab():

WoJ %>% 
  crosstab(reach, employment, percentages = TRUE) %>% 
  visualize()

Note that the percentages = TRUE argument determines whether the bars add up to 100% and thus cover the whole width or whether they do not:

WoJ %>% 
  crosstab(reach, employment) %>% 
  visualize()

Compute t-Tests

Use t_test() to quickly compute t-Tests for a group variable and one or more test variables. Output includes test statistics, descriptive statistics and Cohen’s \(d\) effect size estimates:

WoJ %>% 
  t_test(temp_contract, autonomy_selection, autonomy_emphasis)
#> # A tibble: 2 × 12
#>   Variable M_Permanent SD_Permanent M_Temporary SD_Temporary Delta_M     t    df
#> * <chr>      <num:.3!>    <num:.3!>   <num:.3!>    <num:.3!> <num:.> <num> <dbl>
#> 1 autonom…       3.910        0.755       3.698        0.932   0.212 1.627    56
#> 2 autonom…       4.124        0.768       3.887        0.870   0.237 2.171   995
#> # ℹ 4 more variables: p <num:.3!>, d <num:.3!>, Levene_p <dbl>, var_equal <chr>

Passing no test variables will compute t-Tests for all numerical variables in the data:

WoJ %>% 
  t_test(temp_contract)
#> # A tibble: 11 × 12
#>    Variable     M_Permanent SD_Permanent M_Temporary SD_Temporary Delta_M      t
#>  * <chr>          <num:.3!>    <num:.3!>   <num:.3!>    <num:.3!> <num:.> <num:>
#>  1 autonomy_se…       3.910        0.755       3.698        0.932   0.212  1.627
#>  2 autonomy_em…       4.124        0.768       3.887        0.870   0.237  2.171
#>  3 ethics_1           1.568        0.850       1.981        0.990  -0.414 -3.415
#>  4 ethics_2           3.241        1.263       3.509        1.234  -0.269 -1.510
#>  5 ethics_3           2.369        1.121       2.283        0.928   0.086  0.549
#>  6 ethics_4           2.534        1.239       2.566        1.217  -0.032 -0.185
#>  7 work_experi…      17.707       10.540      11.283       11.821   6.424  4.288
#>  8 trust_parli…       3.073        0.797       3.019        0.772   0.054  0.480
#>  9 trust_gover…       2.870        0.847       2.642        0.811   0.229  1.918
#> 10 trust_parti…       2.430        0.724       2.358        0.736   0.072  0.703
#> 11 trust_polit…       2.533        0.707       2.396        0.689   0.136  1.369
#> # ℹ 5 more variables: df <dbl>, p <num:.3!>, d <num:.3!>, Levene_p <dbl>,
#> #   var_equal <chr>

If passing a group variable with more than two unique levels, t_test() will produce a warning and default to the first two unique values. You can manually define the levels by setting the levels argument:

WoJ %>% 
  t_test(employment, autonomy_selection, autonomy_emphasis)
#> Warning: employment has more than 2 levels, defaulting to first two (Full-time
#> and Part-time). Consider filtering your data or setting levels with the levels
#> argument
#> # A tibble: 2 × 12
#>   Variable     `M_Full-time` `SD_Full-time` `M_Part-time` `SD_Part-time` Delta_M
#> * <chr>            <num:.3!>      <num:.3!>     <num:.3!>      <num:.3!> <num:.>
#> 1 autonomy_se…         3.903          0.782         3.825          0.633   0.078
#> 2 autonomy_em…         4.118          0.781         4.016          0.759   0.102
#> # ℹ 6 more variables: t <num:.3!>, df <dbl>, p <num:.3!>, d <num:.3!>,
#> #   Levene_p <dbl>, var_equal <chr>

WoJ %>% 
  t_test(employment, autonomy_selection, autonomy_emphasis, levels = c("Full-time", "Freelancer"))
#> # A tibble: 2 × 12
#>   Variable `M_Full-time` `SD_Full-time` M_Freelancer SD_Freelancer Delta_M     t
#> * <chr>        <num:.3!>      <num:.3!>    <num:.3!>     <num:.3!> <num:.> <num>
#> 1 autonom…         3.903          0.782        3.765         0.993   0.139 1.724
#> 2 autonom…         4.118          0.781        3.901         0.852   0.217 3.287
#> # ℹ 5 more variables: df <dbl>, p <num:.3!>, d <num:.3!>, Levene_p <dbl>,
#> #   var_equal <chr>

Additional options include:

Previously, the (now deprecated) option of var.equal was also available. This has been overthrown, however, as t_test() now by default tests for equal variance (using a Levene test) to decide whether to use pooled variance or to use the Welch approximation to the degrees of freedom.

t_test() also provides a one-sample t-Test if you provide a mu argument:

WoJ %>% 
  t_test(autonomy_emphasis, mu = 3.9)
#> # A tibble: 1 × 9
#>   Variable              M    SD CI_95_LL CI_95_UL    Mu     t    df        p
#> * <chr>             <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 autonomy_emphasis  4.08 0.793     4.03     4.12   3.9  7.68  1194 3.23e-14

Of course, also the result from t-Tests can be visualized easily as such:

WoJ %>% 
  t_test(temp_contract, autonomy_selection, autonomy_emphasis) %>% 
  visualize()

Compute one-way ANOVAs

unianova() will compute one-way ANOVAs for one group variable and one or more test variables. Output includes test statistics, \(\eta^2\) effect size estimates, and \(\omega^2\), if Welch’s approximation is used to account for unequal variances.

WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis)
#> # A tibble: 2 × 9
#>   Variable            F df_num df_denom     p omega_squared eta_squared Levene_p
#> * <chr>           <num>  <dbl>    <dbl> <num>     <num:.3!>   <num:.3!>    <dbl>
#> 1 autonomy_selec… 2.012      2      251 0.136         0.002      NA        0    
#> 2 autonomy_empha… 5.861      2     1192 0.003        NA           0.010    0.175
#> # ℹ 1 more variable: var_equal <chr>

Descriptives can be added by setting descriptives = TRUE. If no test variables are passed, all numerical variables in the data will be used:

WoJ %>% 
  unianova(employment, descriptives = TRUE)
#> # A tibble: 11 × 15
#>    Variable                  F df_num df_denom     p omega_squared `M_Full-time`
#>  * <chr>              <num:.3>  <dbl>    <dbl> <num>     <num:.3!>         <dbl>
#>  1 autonomy_selection    2.012      2      251 0.136         0.002          3.90
#>  2 autonomy_emphasis     5.861      2     1192 0.003        NA              4.12
#>  3 ethics_1              2.171      2     1197 0.115        NA              1.62
#>  4 ethics_2              2.204      2     1197 0.111        NA              3.24
#>  5 ethics_3              5.823      2      253 0.003         0.007          2.39
#>  6 ethics_4              3.453      2     1197 0.032        NA              2.58
#>  7 work_experience       3.739      2      240 0.025         0.006         17.5 
#>  8 trust_parliament      1.527      2     1197 0.218        NA              3.06
#>  9 trust_government     12.864      2     1197 0.000        NA              2.82
#> 10 trust_parties         0.842      2     1197 0.431        NA              2.42
#> 11 trust_politicians     0.328      2     1197 0.721        NA              2.52
#> # ℹ 8 more variables: `SD_Full-time` <dbl>, `M_Part-time` <dbl>,
#> #   `SD_Part-time` <dbl>, M_Freelancer <dbl>, SD_Freelancer <dbl>,
#> #   eta_squared <num:.3!>, Levene_p <dbl>, var_equal <chr>

You can also compute Tukey’s HSD post-hoc tests by setting post_hoc = TRUE. Results will be added as a tibble in a list column post_hoc.

WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis, post_hoc = TRUE)
#> # A tibble: 2 × 10
#>   Variable            F df_num df_denom     p omega_squared post_hoc eta_squared
#> * <chr>           <num>  <dbl>    <dbl> <num>     <num:.3!> <list>     <num:.3!>
#> 1 autonomy_selec… 2.012      2      251 0.136         0.002 <df>          NA    
#> 2 autonomy_empha… 5.861      2     1192 0.003        NA     <df>           0.010
#> # ℹ 2 more variables: Levene_p <dbl>, var_equal <chr>

These can then be unnested with tidyr::unnest():

WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis, post_hoc = TRUE) %>% 
  dplyr::select(Variable, post_hoc) %>% 
  tidyr::unnest(post_hoc)
#> # A tibble: 6 × 11
#>   Variable      Group_Var contrast Delta_M conf_lower conf_upper       p       d
#>   <chr>         <chr>     <chr>      <dbl>      <dbl>      <dbl>   <dbl>   <dbl>
#> 1 autonomy_sel… employme… Full-ti… -0.0780     -0.225     0.0688 0.422   -0.110 
#> 2 autonomy_sel… employme… Full-ti… -0.139      -0.329     0.0512 0.199   -0.155 
#> 3 autonomy_sel… employme… Part-ti… -0.0607     -0.284     0.163  0.798   -0.0729
#> 4 autonomy_emp… employme… Full-ti… -0.102      -0.278     0.0741 0.362   -0.133 
#> 5 autonomy_emp… employme… Full-ti… -0.217      -0.372    -0.0629 0.00284 -0.266 
#> 6 autonomy_emp… employme… Part-ti… -0.115      -0.333     0.102  0.428   -0.143 
#> # ℹ 3 more variables: se <dbl>, t <dbl>, df <dbl>

Visualize one-way ANOVAs the way you visualize almost everything in tidycomm:

WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis) %>% 
  visualize()

Compute correlation tables and matrices

correlate() will compute correlations for all combinations of the passed variables:

WoJ %>% 
  correlate(work_experience, autonomy_selection, autonomy_emphasis)
#> # A tibble: 3 × 6
#>   x                  y                      r    df         p     n
#> * <chr>              <chr>              <dbl> <int>     <dbl> <int>
#> 1 work_experience    autonomy_selection 0.161  1182 2.71e-  8  1184
#> 2 work_experience    autonomy_emphasis  0.155  1180 8.87e-  8  1182
#> 3 autonomy_selection autonomy_emphasis  0.644  1192 4.83e-141  1194

If no variables passed, correlations for all combinations of numerical variables will be computed:

WoJ %>% 
  correlate()
#> # A tibble: 55 × 6
#>    x                  y                        r    df         p     n
#>  * <chr>              <chr>                <dbl> <int>     <dbl> <int>
#>  1 autonomy_selection autonomy_emphasis  0.644    1192 4.83e-141  1194
#>  2 autonomy_selection ethics_1          -0.0766   1195 7.98e-  3  1197
#>  3 autonomy_selection ethics_2          -0.0274   1195 3.43e-  1  1197
#>  4 autonomy_selection ethics_3          -0.0257   1195 3.73e-  1  1197
#>  5 autonomy_selection ethics_4          -0.0781   1195 6.89e-  3  1197
#>  6 autonomy_selection work_experience    0.161    1182 2.71e-  8  1184
#>  7 autonomy_selection trust_parliament  -0.00840  1195 7.72e-  1  1197
#>  8 autonomy_selection trust_government   0.0414   1195 1.53e-  1  1197
#>  9 autonomy_selection trust_parties      0.0269   1195 3.52e-  1  1197
#> 10 autonomy_selection trust_politicians  0.0109   1195 7.07e-  1  1197
#> # ℹ 45 more rows

Specify a focus variable using the with parameter to correlate all other variables with this focus variable.

WoJ %>% 
  correlate(autonomy_selection, autonomy_emphasis, with = work_experience)
#> # A tibble: 2 × 6
#>   x               y                      r    df            p     n
#> * <chr>           <chr>              <dbl> <int>        <dbl> <int>
#> 1 work_experience autonomy_selection 0.161  1182 0.0000000271  1184
#> 2 work_experience autonomy_emphasis  0.155  1180 0.0000000887  1182

Run a partial correlation by designating three variables along with the partial parameter.

WoJ %>% 
  correlate(autonomy_selection, autonomy_emphasis, partial = work_experience)
#> # A tibble: 1 × 7
#>   x                  y                 z                 r    df         p     n
#> * <chr>              <chr>             <chr>         <dbl> <dbl>     <dbl> <int>
#> 1 autonomy_selection autonomy_emphasis work_experie… 0.637  1178 3.07e-135  1181

Visualize correlations by passing the results on to the visualize() function:

WoJ %>% 
  correlate(work_experience, autonomy_selection) %>% 
  visualize()

If you provide more than two variables, you automatically get a correlogram (the same you would get if you convert correlations to a correlation matrix):

WoJ %>% 
  correlate(work_experience, autonomy_selection, autonomy_emphasis) %>% 
  visualize()
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2

By default, Pearson’s product-moment correlations coefficients (\(r\)) will be computed. Set method to "kendall" to obtain Kendall’s \(\tau\) or to "spearman" to obtain Spearman’s \(\rho\) instead.

To obtain a correlation matrix, pass the output of correlate() to to_correlation_matrix():

WoJ %>% 
  correlate(work_experience, autonomy_selection, autonomy_emphasis) %>% 
  to_correlation_matrix()
#> # A tibble: 3 × 4
#>   r                  work_experience autonomy_selection autonomy_emphasis
#> * <chr>                        <dbl>              <dbl>             <dbl>
#> 1 work_experience              1                  0.161             0.155
#> 2 autonomy_selection           0.161              1                 0.644
#> 3 autonomy_emphasis            0.155              0.644             1

Compute linear regressions

regress() will create a linear regression on one dependent variable with a flexible number of independent variables. Independent variables can thereby be continuous, dichotomous, and factorial (in which case each factor level will be translated into a dichotomous dummy variable version):

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government)
#> # A tibble: 3 × 6
#>   Variable              B  StdErr    beta     t         p
#> * <chr>             <dbl>   <dbl>   <dbl> <dbl>     <dbl>
#> 1 (Intercept)      3.52   0.0906  NA      38.8  3.02e-213
#> 2 work_experience  0.0121 0.00211  0.164   5.72 1.35e-  8
#> 3 trust_government 0.0501 0.0271   0.0531  1.85 6.49e-  2
#> # F(2, 1181) = 17.400584, p = 0.000000, R-square = 0.028624

The function automatically adds standardized beta values to the expected linear-regression output. You can also opt in to calculate up to three precondition checks:

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government,
          check_independenterrors = TRUE,
          check_multicollinearity = TRUE,
          check_homoscedasticity = TRUE)
#> # A tibble: 3 × 8
#>   Variable              B  StdErr    beta     t         p   VIF tolerance
#> * <chr>             <dbl>   <dbl>   <dbl> <dbl>     <dbl> <dbl>     <dbl>
#> 1 (Intercept)      3.52   0.0906  NA      38.8  3.02e-213 NA       NA    
#> 2 work_experience  0.0121 0.00211  0.164   5.72 1.35e-  8  1.01     0.995
#> 3 trust_government 0.0501 0.0271   0.0531  1.85 6.49e-  2  1.01     0.995
#> # F(2, 1181) = 17.400584, p = 0.000000, R-square = 0.028624
#> - Check for independent errors: Durbin-Watson = 1.928431 (p = 0.228000)
#> - Check for homoscedasticity: Breusch-Pagan = 0.181605 (p = 0.669997)
#> - Check for multicollinearity: VIF/tolerance added to output

For linear regressions, a number of visualizations are possible. The default one is the visualization of the result(s), is that the dependent variable is correlated with each of the independent variables separately and a linear model is presented in these:

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize()

Alternatively you can visualize precondition-check-assisting depictions. Correlograms among independent variables, for example:

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "correlogram")

Next up, visualize a residuals-versus-fitted plot to determine distributions:

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "resfit")

Or use a (normal) probability-probability plot to check for multicollinearity:

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "pp")

The (normal) quantile-quantile plot also helps checking for multicollinearity but focuses more on outliers:

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "qq")

Next up, the scale-location (sometimes also called spread-location) plot checks whether residuals are spread equally to help check for homoscedasticity:

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "scaloc")

Finally, visualize the residuals-versus-leverage plot to check for influential outliers affecting the final model more than the rest of the data:

WoJ %>% 
  regress(autonomy_selection, work_experience, trust_government) %>% 
  visualize(which = "reslev")