An Introduction to the distfreereg Package

Introduction

This vignette introduces the basic functionality of the distfreereg package. The package has two main functions: distfreereg() and compare(). The former is introduced below; compare() is introduced here.

The function distfreereg() implements the distribution-free regression testing procedure introduced by Khmaladze (2021). Its purpose is to test whether or not a specified function \(Y=f(X;\theta)\) is the conditional mean, \({\rm E}(Y\mathrel|X)\). Specifically, the test is of the following type: \[\begin{equation} H_0{:}\ \exists\theta\in\Theta\,|\,{\rm E}(Y\mathrel|X)=f(X;\theta) \quad\hbox{against}\quad H_1{:}\ \forall\theta\in\Theta\,|\,{\rm E}(Y\mathrel|X)\neq f(X;\theta). \label{eqn:hypo1} \end{equation}\]

This vignette starts by discussing the testing of a linear model created using lm(). Following this is an analogous discussion of testing a non-linear model created using nls(). Both scenarios can be replicated using the formulas directly. In the most general implementation, the mean function is specified by an R function. To use distfreereg() with a mean function implemented in software other than R, see the section on the default method.

Testing a Linear Model

Suppose that a linear model object m is created using lm(), and we want to test whether or not a linear model is appropriate; specifically, whether or not the true mean function follows a specified form that is linear in its parameters.

In some cases, a residual plot suffices to show that the specified form is wrong. Below, the true mean function is \({\rm E}(y|x)=x^{2.1}\), but the formula used to build the model is linear in \(x\). The residuals plot clearly indicates model misspecification.

set.seed(20240304)
n <- 3e2
form_lm <- y ~ x
data_lm <- data.frame(x = runif(n, min = 0, max = 3))
data_lm$y <- data_lm$x^2.1 + rnorm(n, sd = 0.3)
m_1 <- lm(form_lm, data = data_lm)
plot(m_1, which = 1)

If, however, we use a form that is quadratic in \(x\), the residuals plot is suggestive, but it leaves us without a clear conclusion. How certain can we be that the specified form is wrong?

form_lm_2 <- y ~ I(x^2)
m_2 <- lm(form_lm_2, data = data_lm)
plot(m_2, which = 1)

In this case, there appears to be evidence of a mis-specified mean, but we need a formal test to quantify our (un)certainty. The function distfreereg() implements such a test.

set.seed(20240304)
(dfr_lm_2 <- distfreereg(test_mean = m_2))
## 
## Number of observations: 300 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
## (Intercept)     I(x^2) 
##  -7.718e-02  1.101e+00 
## 
## Observed statistics:
## Stat      Value  Pr(>Value)       MCSE 
##   KS  1.237e+00   7.270e-02  2.596e-03 
##  CvM  4.506e-01   5.260e-02  2.232e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

In fact, tests based on two statistics are included by default: a Kolmogorov–Smirnov (KS) statistic and a Cramér–von Mises (CvM) statistic, defined as \(\max_i|w_i|\) and \({1\over n}\sum_iw_i^2\), respectively, where \(w\mathrel{\raise 0.4pt\hbox{$:$}\hskip-2pt=}(w_1,\ldots,w_n)\) is the partial sum process calculated by distfreereg(). The preceding output shows a table summarizing the tests’ results, including p-values and Monte Carlo standard errors thereof. Both of the p-values are marginal, supplying a measure of our intuitive conclusion based on the residuals plot.

Testing a Non-Linear Model

The procedure just illustrated applies to objects created with nls(), as well. Here, the data generating function has the form being tested, so we should not expect the null to be rejected. Note that the weights argument in nls() is set equal to the inverse of the variance vector.

set.seed(20240304)
n <- 3e2
sds <- runif(n, min = 0.5, max = 5)
data_nls <- data.frame(x = rnorm(n), y = rnorm(n))
data_nls$z <- exp(3*data_nls$x) - 2*data_nls$y^2 + rnorm(n, sd = sds)
form_nls <- z ~ exp(a*x) - b*y^2
m_2 <- nls(form_nls, data = data_nls, weights = 1/sds^2, start = c(a = 1, b = 1))

Testing this form of the mean structure is done just as with an lm object:

set.seed(20240304)
(dfr_nls <- distfreereg(test_mean = m_2))
## 
## Number of observations: 300 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
##         a          b 
## 3.000e+00  1.957e+00 
## 
## Observed statistics:
## Stat      Value  Pr(>Value)       MCSE 
##   KS  7.237e-01   6.151e-01  4.866e-03 
##  CvM  1.052e-01   5.547e-01  4.970e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

Both tests fail to reject the null.

Means Specified by Formulas

If no lm or nls object has been created yet, the formula and data can be supplied directly to distfreereg(). Below, we recreate the examples from above without first creating the lm and nls objects.

The most notable practical difference in implementing the tests in this context is that three arguments are required rather than just one:

  1. test_mean: a formula object specifying the mean function;

  2. data: a data frame containing the data;

  3. method: specifies whether to use lm() or nls(). Defaults to “lm”.

Formulas with lm()

The following call recreates the setup behind dfr_lm_2 above.

set.seed(20240304)
(dfr_form_lm_2 <- distfreereg(test_mean = form_lm_2, data = data_lm))
## 
## Number of observations: 300 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
## (Intercept)     I(x^2) 
##  -7.718e-02  1.101e+00 
## 
## Observed statistics:
## Stat      Value  Pr(>Value)       MCSE 
##   KS  1.237e+00   7.270e-02  2.596e-03 
##  CvM  4.506e-01   5.260e-02  2.232e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

The key pieces of the output are identical to those of dfr_lm:

identical(dfr_lm_2$observed_stats, dfr_form_lm_2$observed_stats)
## [1] TRUE
identical(dfr_lm_2$p, dfr_form_lm_2$p)
## [1] TRUE

Formulas with nls()

Two additional arguments are used here:

  1. method: a character string specifying the method to use to fit the model. Its default value is “lm”, so we must specify “nls” here.

  2. covariance: a list containing a named object specifying the error covariance structure. This ultimately supplies the weights values to nls(). This can be omitted if weights are not being used. (See below for more details.)

  3. theta_init: a numeric vector specifying the starting values for parameter estimation. (This is optional, since nls() will use default starting values with a warning if not supplied. See nls() documentation for further details.)

The following call recreates the setup behind dfr_nls. The covariance structure is specified by, P, the precision matrix. This produces results identical, in the sense of identical(), to those from the example from above. Using any other option (e.g., Sigma = diag(sds^2)) would result in numerically equivalent results in the sense of all.equal().

set.seed(20240304)
(dfr_form_nls <- distfreereg(test_mean = form_nls, data = data_nls, method = "nls",
                             covariance = list(P = diag(1/sds^2)),
                             theta_init = c(a = 1, b = 1)))
## 
## Number of observations: 300 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
##         a          b 
## 3.000e+00  1.957e+00 
## 
## Observed statistics:
## Stat      Value  Pr(>Value)       MCSE 
##   KS  7.237e-01   6.151e-01  4.866e-03 
##  CvM  1.052e-01   5.547e-01  4.970e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

The key pieces of the output are identical to those of dfr_nls:

identical(dfr_nls$observed_stats, dfr_form_nls$observed_stats)
## [1] TRUE
identical(dfr_nls$p, dfr_form_nls$p)
## [1] TRUE

Means Specified by Functions

The most general case that can be handled entirely within R is that in which the mean regression function is specified by an R function.

Suppose that we want to replicate the situation in the nls examples above, but we want to account for errors that have a non-diagonal covariance matrix. (The weights argument of nls() limits that function’s applicability to diagonal covariance matrices.) The example below illustrates how to do this.

The function being tested is specified by test_mean, which is a function with two arguments, X and theta. Here, X represents the entire matrix of covariates. Therefore, to replicate the references to x and y in this example, we use X[,1] and X[,2], respectively. The arguments X and Y supply the covariate matrix and the response vector, respectively, in lieu of the data argument.

set.seed(20240304)
n <- 3e2
true_mean <- function(X, theta) exp(theta[1]*X[,1]) - theta[2]*X[,2]^2
test_mean <- true_mean
theta <- c(3,-2)
Sigma <- rWishart(1, df = n, Sigma = diag(n))[,,1]
X <- matrix(rnorm(2*n), nrow = n)
Y <- distfreereg:::f2ftheta(true_mean, X)(theta) +
  distfreereg:::rmvnorm(n = n, reps = 1, SqrtSigma = distfreereg:::matsqrt(Sigma))

(dfr_1 <- distfreereg(test_mean = test_mean, Y = Y, X = X,
                      covariance = list(Sigma = Sigma),
                      theta_init = rep(1, length(theta))))
## 
## Number of observations: 300 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
##    theta1      theta2 
## 3.000e+00  -2.037e+00 
## 
## Observed statistics:
## Stat      Value  Pr(>Value)       MCSE 
##   KS  1.016e+00   2.226e-01  4.160e-03 
##  CvM  1.330e-01   4.419e-01  4.966e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

The Default Method: Using the Algorithm with External Functions

All of the examples above illustrate how distfreereg() can be used when the mean function being tested is defined in R. If the mean function is not defined in R, distfreereg() can still be used as long as certain objects can be imported into R. Specifically, distfreereg() needs the model’s fitted values and the Jacobian of the mean function, evaluated at the estimated parameter vector for each observation’s covariate values.

To illustrate this, we extract certain elements from the example above, and pretend that these were created using external software, and then imported into R.

J <- dfr_1$J
fitted_values <- fitted(dfr_1)

These can be supplied to distfreereg() to implement the test. Note that test_mean is set to NULL, and since no parameter estimation is done here, no value for theta_init is specified.

distfreereg(test_mean = NULL, Y = Y, X = X, fitted_values = fitted_values,
            J = J, covariance = list(Sigma = Sigma))
## 
## Number of observations: 300 
## Monte Carlo simulations: 10000 
## 
## Observed statistics:
## Stat      Value  Pr(>Value)       MCSE 
##   KS  1.016e+00   2.241e-01  4.170e-03 
##  CvM  1.330e-01   4.488e-01  4.974e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

Summary and Diagnostic Plots

Below is an introduction to the three types of plots available for distfreereg objects. See the plotting vignette for more details regarding plot customization.

A Summary Plot

A summary of the results of the test can be plotted easily.

plot(dfr_1)

The density of the simulated statistic is plotted, as is a vertical line showing the observed statistic. The upper tail is shaded, and the p-value (the area of that shaded region) is shown. Finally, a 95% confidence band of the density function is plotted, and the region is shaded in grey.

Diagnostic Plots

Two diagnostic plots are also available. The first produces a time-series-like plot of the residuals in the order specified by res_order in the distfreereg object.

plot(dfr_1, which = "residuals")

The second produces a plot of the empirical partial sum process.

plot(dfr_1, which = "epsp")

Specifying the Covariance Structure

The covariance structure of the errors, conditional on the covariates, must be specified. For the lm and nls methods, the default behavior is to extract the covariance structure from the supplied model object. In general, the covariance structure is specified using the covariance argument. The value of covariance is a list with at least one named element that specifies one of four matrices:

  1. Sigma, the covariance matrix;

  2. SqrtSigma, the square root of Sigma;

  3. P, the precision matrix (that is, the inverse of Sigma); and

  4. Q, the square root of P.

Internally, the algorithm only needs Q, so some efficiency can be gained by supplying this directly if it is known. Supplying more than one of the four matrices is not forbidden, but is strongly discouraged. In this case, no verification is done that the supplied matrices are consistent with each other, and Q will be calculated using the most convenient supplied element.

Each of the four named elements can be one of three types of object:

  1. A positive-definite matrix, the most general way to specify a covariance structure;

  2. a numeric vector of positive values whose length is the sample size; and

  3. a length-1 numeric vector specifying a positive number.

Specifying a numeric vector is theoretically equivalent to specifying a diagonal matrix with the given vector along the diagonal. Specifying a single number is theoretically equivalent to specifying a diagonal matrix with that value along the diagonal. Using the simplest possible expression in each case is recommended for conceptual and computational simplicity.

The following code shows that supplying Q produces observed statistics identical to those in the previous example.

Q <- distfreereg:::matsqrt(distfreereg:::matinv(Sigma, tol = .Machine$double.eps))
dfr_2 <- distfreereg(Y = Y, X = X, test_mean = test_mean,
                     covariance = list(Q = Q),
                     theta_init = rep(1, length(theta)))
identical(dfr_1$observed_stats, dfr_2$observed_stats)
## [1] TRUE

Methods

Several common generic functions have distfreereg methods, including coef(), predict(), and update(). See the documentation for the complete list.

Adding New Statistics

The stat argument of distfreereg() allows the user to specify other statistics by specifying any function whose input is a numeric vector and whose output is a numeric vector of length one.

new_stat <- function(x) sum(abs(x))
update(dfr_1, stat = "new_stat")
## 
## Number of observations: 300 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
##    theta1      theta2 
## 3.000e+00  -2.037e+00 
## 
## Observed statistics:
##     Stat      Value  Pr(>Value)       MCSE 
## new_stat  8.072e+01   5.384e-01  4.985e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

If necessary, the default statistics can be selected using “KS” and “CvM” as explicit values in the character vector supplied to stat.

update(dfr_1, stat = c("new_stat", "KS", "CvM"))
## 
## Number of observations: 300 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
##    theta1      theta2 
## 3.000e+00  -2.037e+00 
## 
## Observed statistics:
##     Stat      Value  Pr(>Value)       MCSE 
## new_stat  8.072e+01   5.364e-01  4.987e-03 
##       KS  1.016e+00   2.208e-01  4.148e-03 
##      CvM  1.330e-01   4.447e-01  4.969e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

Checks are done to verify that the functions provided calculate legitimate statistics; that is, that they take a numeric vector as input and output a single number.

bad_stat <- function(x) x[1,2]
try(update(dfr_1, stat = c("KS", "CvM", "bad_stat")))
## Error in value[[3L]](cond) : 
##   Unable to evaluate stat[i](rnorm(100)): Error in x[1, 2]: incorrect number of dimensions
try(update(dfr_1, stat = c("KS", "CvM", "undefined_stat")))
## Error in value[[3L]](cond) : 
##   Unable to evaluate stat[i](rnorm(100)): Error in get(stat[i]): object 'undefined_stat' not found

Non-Normal Errors

One of the strengths of the test implemented by distfreereg() is that it makes only weak assumptions about the distribution of the errors. Specifically, it only requires that the error covariance matrix be finite and that the error distribution be strongly mixing (also known as \(\alpha\)-mixing). Below is an example illustrating that we get the expected results when the error distribution is the \(t\) distribution. Note that the \(t\) distribution with three degrees of freedom has variance 3.

set.seed(20241003)
n <- 1e2
func <- function(X, theta) theta[1] + theta[2]*X[,1] + 0.5*X[,1]^2
theta <- c(2,5)
X <- matrix(rexp(n))
Y <- theta[1] + theta[2]*X[,1] + 0.5*X[,1]^2 + rt(n, df = 3)

dfr <- distfreereg(Y = Y, X = X, test_mean = func, theta_init = c(1,1),
                   covariance = list(Sigma = 3))
dfr
## 
## Number of observations: 100 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
##    theta1     theta2 
## 1.805e+00  4.934e+00 
## 
## Observed statistics:
## Stat      Value  Pr(>Value)       MCSE 
##   KS  6.079e-01   7.623e-01  4.257e-03 
##  CvM  7.765e-02   7.062e-01  4.555e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
alt_func <- function(X, theta) theta[1] + theta[2]*X[,1]
update(dfr, test_mean = alt_func)
## 
## Number of observations: 100 
## Monte Carlo simulations: 10000 
## 
## Estimated parameter values:
##    theta1     theta2 
## 1.252e+00  6.400e+00 
## 
## Observed statistics:
## Stat      Value  Pr(>Value)       MCSE 
##   KS  1.466e+00   1.990e-02  1.397e-03 
##  CvM  6.932e-01   1.280e-02  1.124e-03 
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.

References

Khmaladze, E. V. (2021), “Distribution-free testing in linear and parametric regression,” Annals of the Institute of Statistical Mathematics, Springer Science; Business Media LLC, 73, 1063–1087. https://doi.org/10.1007/s10463-021-00786-3.