Marginal standardization

The basics

Marginal standardization, approach = "margstd_delta" and approach = "margstd_boot", makes use of the good convergence properties of the logit link, which is also guaranteed to result in probabilities within (0; 1).

After fitting a logistic model, predicted probabilities for each observation are obtained for the levels of the exposure variable. Risk ratios and risk differences are calculated by contrasting the predicted probabilities as ratios or differences. Standard errors and confidence intervals are obtained via the delta method or via bootstrapping the entire procedure.

We use the same example data as in the Get Started vignette.

library(risks)  # provides riskratio(), riskdiff(), postestimation functions
library(dplyr)  # For data handling
library(broom)  # For tidy() model summaries
data(breastcancer)

We fit the same risk difference model as in the Get Started vignette, this time explicitly specifying approach = "margstd_delta":

fit_margstd <- riskdiff(formula = death ~ stage + receptor, 
                        data = breastcancer, 
                        approach = "margstd_delta")
summary(fit_margstd)
#> 
#> Risk difference model, fitted via marginal standardization of a logistic model with delta method (margstd_delta).
#> Call:
#> stats::glm(formula = death ~ stage + receptor, family = binomial(link = "logit"), 
#>     data = breastcancer, start = "(no starting values)")
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#>                                         
#> 
#> Coefficients: (3 not defined because of singularities)
#>                Estimate Std. Error z value Pr(>|z|)    
#> stageStage I    0.00000    0.00000     NaN      NaN    
#> stageStage II   0.16303    0.05964   2.734  0.00626 ** 
#> stageStage III  0.57097    0.09962   5.732 9.95e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 228.15  on 191  degrees of freedom
#> Residual deviance: 185.88  on 188  degrees of freedom
#> AIC: 193.88
#> 
#> Number of Fisher Scoring iterations: 4
#> 
#> Confidence intervals for coefficients: (delta method)
#>                     2.5 %    97.5 %
#> stageStage I   0.00000000 0.0000000
#> stageStage II  0.04614515 0.2799187
#> stageStage III 0.37571719 0.7662158

Consistent with earlier results, we observed that women with stage III breast cancer have a 57 percentage points higher risk of death, adjusting for hormone receptor status.

Note that coefficients and standard errors are only estimated for the exposure variable. Model fit characteristics and predicted values stem directly from the underlying logistic model.

Choice of exposure variable

Standardization can only be done over one exposure variable, and thus no coefficients are estimated for other variables in the model.

Requesting estimates for a different exposure variable, using variable = "receptor":

fit_margstd2 <- riskdiff(formula = death ~ stage + receptor, 
                         data = breastcancer, 
                         approach = "margstd_delta", variable = "receptor")
summary(fit_margstd2)
#> 
#> Risk difference model, fitted via marginal standardization of a logistic model with delta method (margstd_delta).
#> Call:
#> stats::glm(formula = death ~ stage + receptor, family = binomial(link = "logit"), 
#>     data = breastcancer, start = "(no starting values)")
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#>                                         
#> 
#> Coefficients: (3 not defined because of singularities)
#>              Estimate Std. Error z value Pr(>|z|)  
#> receptorHigh  0.00000    0.00000     NaN      NaN  
#> receptorLow   0.16163    0.07454   2.169   0.0301 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 228.15  on 191  degrees of freedom
#> Residual deviance: 185.88  on 188  degrees of freedom
#> AIC: 193.88
#> 
#> Number of Fisher Scoring iterations: 4
#> 
#> Confidence intervals for coefficients: (delta method)
#>                   2.5 %    97.5 %
#> receptorHigh 0.00000000 0.0000000
#> receptorLow  0.01554788 0.3077201

Bootstrap confidence intervals

Marginal standardization cannot rely on analytical standard error formulae. One approach ("margstd_delta") is the delta method. Alternatively, parametric bootstrapping ("margstd_boot" with the default, bootci = "bca") can be used: given the initial maximum-likelihood estimates of the coefficients and the observed predictors, the outcome vector is predicted from a binomial distribution. The model is fit again using the predicted outcomes and observed predictors. Repeating this process generates the empirical distribution of the coefficients.

Confidence intervals for coefficients are calculated based on BCa confidence intervals for parametric bootstrapping (Efron B, Narasimhan B. The Automatic Construction of Bootstrap Confidence Intervals. J Comput Graph Stat 2020;29(3):608-619).

For comparison purposes, alternative approaches to bootstrapping and confidence interval estimation can requested in tidy(), summary.margstd(), and confint():

When using tidy() to access model results, the parameter bootverbose = TRUE will add bootstrapping parameters and properties to the tibble:

fit_margstd3 <- riskdiff(formula = death ~ stage + receptor, 
                         data = breastcancer, 
                         approach = "margstd_boot")
summary(fit_margstd3)
#> 
#> Risk difference model, fitted via marginal standardization of a logistic model with bootstrapping (margstd_boot).
#> Call:
#> stats::glm(formula = death ~ stage + receptor, family = binomial(link = "logit"), 
#>     data = breastcancer, start = "(no starting values)")
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#>                                         
#> 
#> Coefficients: (3 not defined because of singularities)
#>                Estimate Std. Error z value Pr(>|z|)    
#> stageStage I    0.00000    0.00000     NaN      NaN    
#> stageStage II   0.16303    0.06133   2.658  0.00785 ** 
#> stageStage III  0.57097    0.09823   5.813 6.15e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 228.15  on 191  degrees of freedom
#> Residual deviance: 185.88  on 188  degrees of freedom
#> AIC: 193.88
#> 
#> Number of Fisher Scoring iterations: 4
#> 
#> Confidence intervals for coefficients: (1000 BCa bootstrap repeats)
#>                      2.5%     97.5%
#> stageStage I           NA        NA
#> stageStage II  0.02144726 0.2737863
#> stageStage III 0.37966981 0.7484015

tidy(fit_margstd3, bootverbose = TRUE) %>%
  select(-statistic, -p.value)  # allow the other columns to get printed instead
#> # A tibble: 3 × 10
#>   term       estimate std.error conf.low conf.high bootrepeats bootci jacksd.low
#>   <chr>      <dbl[1d>     <dbl>    <dbl>     <dbl>       <dbl> <chr>       <dbl>
#> 1 stageStag…    0        0       NA         NA            1000 bca      NA      
#> 2 stageStag…    0.163    0.0587   0.0233     0.258        1000 bca       0.00942
#> 3 stageStag…    0.571    0.100    0.357      0.753        1000 bca       0.0146 
#> # ℹ 2 more variables: jacksd.high <dbl>, model <chr>

Bootstrap repeats

The default are (currently) only 1000 bootstrap repeats to reduce initial computation time. If the calculation of BCa confidence intervals fails with an error, and for final, precise estimates, the number of repeats should be increased to >>1000. Use the option bootrepeats = in summary(), tidy(), or confint().

For reproducibility, a seed should be set (e.g., set.seed(123)).