Get Started

Introduction

This article is an introduction to double/debiased machine learning using short-stacking in R. Topics discussed below include:

  1. Estimation with a single machine learner
  2. Estimation with multiple machine learners & short-stacking
  3. Estimation using different types of short-stacking

See Articles for discussions of more advanced topics.

Estimation with a Single Machine Learner

For illustration, we apply ddml to the included random subsample of 5,000 observations from the data of Angrist & Evans (1998). The data contains information on the labor supply of mothers, their children, as well as demographic data. See ?AE98 for details.

# Load ddml and set seed
library(ddml)
set.seed(35611)

# Construct variables from the included Angrist & Evans (1998) data
y = AE98[, "worked"]
D = AE98[, "morekids"]
Z = AE98[, "samesex"]
X = AE98[, c("age","agefst","black","hisp","othrace","educ")]

ddml_late estimates the local average treatment effect (LATE) using double/debiased machine learning (see ?ddml_late). The high-dimensional nuisance parameters arising in the estimate of the LATE are conditional expectation functions of the control variables \(X\). In particular, we require first step estimates of the reduced forms \(E[Y|Z=z, X], E[D|Z=z, X]\) for \(z=0,1\) and \(E[Z|X]\). In the absence of functional form assumptions, these conditional expectations need to be estimated nonparametrically.

Here, we consider gradient boosting from the popular xgboost package to estimate the nuisance parameters. The function mdl_xgboost is a wrapper for xgboost, allowing to specify all parameters of the original function. See ?mdl_xgboost for details and take a look at vignette("new_ml_wrapper") to learn how to write a wrapper for a different machine learner yourself.

# Specify a single learner
learners_single <- list(what = mdl_xgboost,
                        args = list(nrounds = 10,
                                    max_depth = 1))

Double/debiased machine learning relies on cross-fitting to avoid large bias from overfitting when estimating the nuisance parameters. The argument sample_folds = 3 implies that 2/3 of the observations – about 3,333 observations – are used to train the machine learner in each cross-fitting sample fold.

# Estimate the local average treatment effect using xgboost.
late_fit <- ddml_late(y, D, Z, X,
                      learners = learners_single,
                      sample_folds = 3,
                      silent = TRUE)
summary(late_fit)
#> LATE estimation results: 
#>  
#>     Estimate Std. Error   t value  Pr(>|t|)
#>   -0.2680106  0.2449627 -1.094087 0.2739166

(Note that estimation here is based on a random subsample of 5,000 observations. The results can thus not readily be compared to those in Angrist & Evans (1998).)

Estimation with Multiple Machine Learners & Short-Stacking

Since the statistical properties of machine learners depend heavily on the underlying (unknown!) structure of the data, adaptive combination of multiple machine learners can increase robustness. In the below snippet, ddml_late estimates the LATE with short-stacking based on three base learners:

learners_multiple <- list(list(fun = ols),
                          list(fun = mdl_glmnet),
                          list(fun = mdl_xgboost,
                               args = list(nrounds = 10,
                                           max_depth = 1)))

Short-stacking is a computationally convenient variant of stacking originally introduced by Wolpert (1992). Stacking constructs linear combinations of base learners to minimize the out-of-sample mean squared error of a particular reduced form (e.g., \(E[Z|X]\)). Short-stacking uses the out-of-sample predictions that naturally arise in computation of double/debiased machine learning estimates due to cross-fitting, which substantially reduces the computational burden (see vignette("stacking")).

In finite samples, regularizing the linear combination of base learners as constructed via (short-)stacking can improve statistical properties. This can be specified via the ensemble_type argument. Below, ddml_late estimates the nuisance parameters via linear combinations of the four base learners with linear coefficients that are constrained to be non-negative and sum to one.

# Estimate the local average treatment effect using short-stacking with base
#     learners ols, lasso, and xgboost.
late_fit <- ddml_late(y, D, Z, X,
                      learners = learners_multiple,
                      ensemble_type = 'nnls1',
                      shortstack = TRUE,
                      sample_folds = 3,
                      silent = TRUE)
summary(late_fit)
#> LATE estimation results: 
#>  
#>       Estimate Std. Error    t value  Pr(>|t|)
#> nnls1 -0.19173  0.2005266 -0.9561322 0.3390054

It is often insightful to see which base learners contribute the most to the final reduced form estimates. The below snippet shows the weights for the reduced forms \(E[Y|Z=0,X]\) and \(E[Y|Z=1,X]\):

cat("Stacking weights for E[Y|Z=0, X]: \n")
#> Stacking weights for E[Y|Z=0, X]:
t(late_fit$weights$y_X_Z0)
#>           [,1]      [,2]      [,3]
#> nnls1 0.228205 0.5782117 0.1935833

cat("Stacking weights for E[Y|Z=1, X]: \n")
#> Stacking weights for E[Y|Z=1, X]:
t(late_fit$weights$y_X_Z1)
#>            [,1]      [,2]      [,3]
#> nnls1 0.2389935 0.6920264 0.0689801

Estimation using Different Types of Short-Stacking

ddml supports multiple schemes for constructing linear combinations of base learners. Since each of them relies on the out-of-sample predictions of the base learners, it is computationally cheap to compute them simultaneously. The below snippet estimates the LATE using the base learners in four different linear combinations:

# Estimate the local average treatment effect using short-stacking with base
#     learners ols, lasso, ridge, and xgboost.
late_fit <- ddml_late(y, D, Z, X,
                      learners = learners_multiple,
                      ensemble_type = c('nnls', 'singlebest',
                                        'ols', 'average'),
                      shortstack = TRUE,
                      sample_folds = 3,
                      silent = TRUE)
summary(late_fit)
#> LATE estimation results: 
#>  
#>              Estimate Std. Error   t value  Pr(>|t|)
#> nnls       -0.2432971  0.2120355 -1.147436 0.2512017
#> singlebest -0.2518913  0.1967114 -1.280512 0.2003652
#> ols        -0.2417658  0.2122186 -1.139230 0.2546073
#> average    -0.2117828  0.1958048 -1.081601 0.2794296

References

Angrist J, Evans W (1998). “Children and Their Parents’ Labor Supply: Evidence from Exogenous Variation in Family Size.” American Economic Review, 88(3), 450-477.

Wolpert D H (1992). “Stacked generalization.” Neural Networks, 5(2), 241-259.