Estimating controlled direct effects

2024-10-01

This vignette demonstrates how to estimate the controlled direct effect of some treatment, net the effect of some mediator using sequential g-estimation approach as described by Acharya, Blackwell, and Sen (2016). With standard regression and matching approaches, estimating the controlled direct effect is difficult because it often requires adjusting for covariates that causally affect the mediator and are causally affected by the treatment. Including such variables in a regression model can lead to post-treatment bias when estimating the direct effect of treatment. A large class of models and estimation approaches have been developed to avoid this problem, but here we focus on one called sequential g-estimation. This approach uses a standard linear model to estimate the effect of the mediator, and then removes that effect from effect from the dependent variable to estimate the direct effect of treatment without post-treatment bias.

Quantity of interest

Here we briefly introduce some of the statistical theory behind estimating direct effects. The main quantity of interest is the average controlled direct effect or ACDE. This can be illustrated from the following causal structure (Acharya, Blackwell, and Sen 2016, fig. 3):

DAG with direct effects structure
DAG with direct effects structure

The average controlled direct effect defined for a given treatment (\(A_i = a\) vs. \(A_i = a^\prime\)) and a given value of the mediator (\(M_i = m\)) is \[ACDE(a, a^\prime, m) = E[Y_i(a, m) - Y_i(a^\prime, m)]\] and is the total of the dashed lines in the figure above. Thus, we hold the mediator constant when comparing the outcome under each treatment. The ACDE is useful for discriminating between causal mechanisms, because the average total effect of treatment, \(\tau(a, a^\prime) \equiv E[Y_i(a) - Y_i(a^\prime)]\), can be decomposed as the sum of three quantities: (1) the ACDE, (2) the average natural indirect effect, and (3) a reference interaction that measures of how much the direct effect of \(A_i\) depends on a particular \(M_i = m\). Thus, if the ACDE is non-zero, it is evidence that the effect of \(A_i\) is not entirely explained by \(M_i\). We illustrate this with an empirical example. For more information on this, see Acharya, Blackwell, and Sen (2016).

Implementation

This vignette introduces a new way to design, specify, and implement various estimators for controlled direct effects using a coding style that we refer to as the “assembly line.” Users can specify the various components of the estimators sequentially and easily swap out implementation details such as what exact estimators are used for the estimation of each nuisance function or what arguments to pass to these functions. A unified structure and output across different research designs and estimators should make comparing estimators very easy. Finally, this implementation is built around the idea of cross-fitting for nuisance function estimation.

Let’s see how this works in practice. We first load a subset of the jobcorps data from the package:

data("jobcorps")
jc <- jobcorps[
  1:200,
  c("treat", "female", "age_cat", "work2year2q", "pemplq4", "emplq4", "exhealth30")
]

This is the Job Corps experiment data used in Huber (2014) to estimate the effect of a randomized job training program on self-assessed health holding constant the intermediate outcome of employment. We use a subset of the variables from their study, but the basic variable we use here:

In context of our CDE estimators, we refer to be \(A_i\) (the treatment) and \(M_i\) (the mediator) as treatment variables because we are interested in causal effects that contrast different combinations of these two variables.

The first step in our assembly line is to choose the type of CDE estimator (or really estimation strategy) that we want to pursue. These choices include IPW (cde_ipw()), regression imputation (cde_reg_impute()), and augmented IPW (cde_aipw()) among others. Some of these different functions will take different arguments to specify different aspects of the estimator. The AIPW estimator is useful to demonstrate because it uses many of the features of the assembly line approach.

my_aipw <- cde_aipw() |>
  set_treatment(treat, ~ female + age_cat) |>
  treat_model(engine = "logit") |>
  outreg_model(engine = "lm") |>
  set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
  treat_model(engine = "logit") |>
  outreg_model(engine = "lm") |>
  estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)

The call here consists of several steps connected by R’s pipe operator. The first call cde_aipw() simply initializes the estimator as taking the AIPW form, meaning that it will use both propensity score models and outcome regressions to estimate the causal effects.

Following this we have two blocks of 3 lines each that look similar with a few differences. These “treatment blocks” specify the causal variables of interest in their causal ordering. In set_treatment() we set the name of the treatment variable and give a formula describing the pre-treatment covariates for this treatment. This formula, unless overridden by the next two functions, will be used in the fitting of the propensity score and outcome regression models. treat_model() allows the user to specify how to fit the propensity score model for this treatment with a choice of engine and optional arguments to pass to each engine. outreg_model() does the same for the outcome regression. We repeat this specification for the mediator.

The estimate() function finally implements the specifications given in the previous commands. Users use a formula to set the outcome and which of the treatment variables we want effects for and indicate the data frame where all of these variables can be found. By default, the nuisance functions (the propensity score model and the outcome regression) are fit using cross-fitting.

We can either use the summary() or tidy() functions to get a summary of the different effects being estimated:

summary(my_aipw)
## 
##  aipw CDE Estimator
## ---------------------------
## Causal variable: treat 
## 
## Treatment model: treat ~ female + age_cat 
## Treatment engine: logit 
## 
## Outcome model: ~female + age_cat 
## Outcome engine: lm 
## ---------------------------
## Causal variable: work2year2q 
## 
## Treatment model: work2year2q ~ emplq4 + pemplq4 + female + age_cat 
## Treatment engine: logit 
## 
## Outcome model: ~emplq4 + pemplq4 + female + age_cat 
## Outcome engine: lm 
## ---------------------------
## Cross-fit: TRUE 
## Number of folds: 5 
## 
## Estimated Effects:
##                                  Estimate Std. Error  t value Pr(>|t|)
## treat [(1, 0) vs. (0, 0)]        0.030412    0.02259  1.34644  0.17824
## treat [(1, 1) vs. (0, 1)]        0.029877    0.01396  2.14080  0.03233
## work2year2q [(0, 1) vs. (0, 0)]  0.001241    0.02077  0.05974  0.95237
## work2year2q [(1, 1) vs. (1, 0)] -0.001672    0.01655 -0.10101  0.91955
##                                  CI Lower CI Upper   DF
## treat [(1, 0) vs. (0, 0)]       -0.013871  0.07469 3818
## treat [(1, 1) vs. (0, 1)]        0.002518  0.05724 6207
## work2year2q [(0, 1) vs. (0, 0)] -0.039475  0.04196 3991
## work2year2q [(1, 1) vs. (1, 0)] -0.034120  0.03078 6034
tidy(my_aipw)
##                                            term     estimate  std.error
## treat_1_0             treat [(1, 0) vs. (0, 0)]  0.030411569 0.02258657
## treat_1_1             treat [(1, 1) vs. (0, 1)]  0.029876867 0.01395590
## work2year2q_0_1 work2year2q [(0, 1) vs. (0, 0)]  0.001240634 0.02076711
## work2year2q_1_1 work2year2q [(1, 1) vs. (1, 0)] -0.001671883 0.01655227
##                   statistic    p.value     conf.low  conf.high   df
## treat_1_0        1.34644455 0.17823912 -0.013871340 0.07469448 3818
## treat_1_1        2.14080471 0.03232863  0.002518461 0.05723527 6207
## work2year2q_0_1  0.05974031 0.95236546 -0.039474510 0.04195578 3991
## work2year2q_1_1 -0.10100627 0.91954883 -0.034120251 0.03077648 6034

Because the estimate() function included both treat and work2year2q in the formula, the output includes both the controlled direct effects of the treatment and the conditional average treatment effect of the mediator. Furthermore, by default, the output contains estimates for an effect for each combination of other treatment variables. This can be changed by setting the eval_vals argument in the relevant set_treatment() call.

The modular structure of the call allow users to easily swap out parts of the models. For example, we could easily alter the above call to use lasso versions of the regression and logit model.

my_aipw_lasso <- cde_aipw() |>
  set_treatment(treat, ~ female + age_cat) |>
  treat_model(engine = "rlasso_logit") |>
  outreg_model(engine = "rlasso") |>
  set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
  treat_model(engine = "rlasso_logit") |>
  outreg_model(engine = "rlasso") |>
  estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)

Here we use a “rigorous” version of the lasso from the hdm package to speed up computation. There are a number of different engines for different outcome and treatment types including:

References

Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016. “Explaining Causal Findings Without Bias: Detecting and Assessing Direct Effects.” American Political Science Review 110 (3): 512–29. https://doi.org/10.1017/S0003055416000216.
Huber, Martin. 2014. “Identifying Causal Mechanisms (Primarily) Based on Inverse Probability Weighting.” Journal of Applied Econometrics 29 (6): 920–43.