distfreereg
PackageThis vignette introduces the basic functionality of the
distfreereg
package. The package has two main functions:
distfreereg()
and compare()
. The former is
introduced below; compare()
is introduced here.
The function distfreereg()
implements the
distribution-free regression testing procedure introduced by Khmaladze (2021). Its purpose is to test
whether or not a specified function \(Y=f(X;\theta)\) is the conditional mean,
\({\rm E}(Y\mathrel|X)\). Specifically,
the test is of the following type: \[\begin{equation}
H_0{:}\ \exists\theta\in\Theta\,|\,{\rm E}(Y\mathrel|X)=f(X;\theta)
\quad\hbox{against}\quad
H_1{:}\ \forall\theta\in\Theta\,|\,{\rm E}(Y\mathrel|X)\neq
f(X;\theta).
\label{eqn:hypo1}
\end{equation}\]
This vignette starts by discussing the testing of a linear model created
using lm()
. Following this is an analogous discussion of testing a non-linear model
created using nls()
. Both scenarios can be replicated using
the formulas directly. In the
most general implementation,
the mean function is specified by an R
function. To use
distfreereg()
with a mean function implemented in software
other than R
, see the section on the default
method.
Suppose that a linear model object m
is created using
lm()
, and we want to test whether or not a linear model is
appropriate; specifically, whether or not the true mean function follows
a specified form that is linear in its parameters.
In some cases, a residual plot suffices to show that the specified form is wrong. Below, the true mean function is \({\rm E}(y|x)=x^{2.1}\), but the formula used to build the model is linear in \(x\). The residuals plot clearly indicates model misspecification.
set.seed(20240304)
n <- 3e2
form_lm <- y ~ x
data_lm <- data.frame(x = runif(n, min = 0, max = 3))
data_lm$y <- data_lm$x^2.1 + rnorm(n, sd = 0.3)
m_1 <- lm(form_lm, data = data_lm)
plot(m_1, which = 1)
If, however, we use a form that is quadratic in \(x\), the residuals plot is suggestive, but it leaves us without a clear conclusion. How certain can we be that the specified form is wrong?
In this case, there appears to be evidence of a mis-specified mean,
but we need a formal test to quantify our (un)certainty. The function
distfreereg()
implements such a test.
##
## Number of observations: 300
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## (Intercept) I(x^2)
## -7.718e-02 1.101e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 1.237e+00 7.270e-02 2.596e-03
## CvM 4.506e-01 5.260e-02 2.232e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
In fact, tests based on two statistics are included by default: a
Kolmogorov–Smirnov (KS) statistic and a Cramér–von Mises (CvM)
statistic, defined as \(\max_i|w_i|\)
and \({1\over n}\sum_iw_i^2\),
respectively, where \(w\mathrel{\raise
0.4pt\hbox{$:$}\hskip-2pt=}(w_1,\ldots,w_n)\) is the partial sum
process calculated by distfreereg()
. The preceding output
shows a table summarizing the tests’ results, including p-values and
Monte Carlo standard errors thereof. Both of the p-values are marginal,
supplying a measure of our intuitive conclusion based on the residuals
plot.
The procedure just illustrated applies to objects created with
nls()
, as well. Here, the data generating function has the
form being tested, so we should not expect the null to be rejected. Note
that the weights
argument in nls()
is set
equal to the inverse of the variance vector.
set.seed(20240304)
n <- 3e2
sds <- runif(n, min = 0.5, max = 5)
data_nls <- data.frame(x = rnorm(n), y = rnorm(n))
data_nls$z <- exp(3*data_nls$x) - 2*data_nls$y^2 + rnorm(n, sd = sds)
form_nls <- z ~ exp(a*x) - b*y^2
m_2 <- nls(form_nls, data = data_nls, weights = 1/sds^2, start = c(a = 1, b = 1))
Testing this form of the mean structure is done just as with an
lm
object:
##
## Number of observations: 300
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## a b
## 3.000e+00 1.957e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 7.237e-01 6.151e-01 4.866e-03
## CvM 1.052e-01 5.547e-01 4.970e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
Both tests fail to reject the null.
If no lm
or nls
object has been created
yet, the formula and data can be supplied directly to
distfreereg()
. Below, we recreate the examples from above
without first creating the lm
and nls
objects.
The most notable practical difference in implementing the tests in this context is that three arguments are required rather than just one:
test_mean
: a formula
object specifying
the mean function;
data
: a data frame containing the data;
method
: specifies whether to use lm()
or nls()
. Defaults to “lm
”.
lm()
The following call recreates the setup behind dfr_lm_2
above.
##
## Number of observations: 300
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## (Intercept) I(x^2)
## -7.718e-02 1.101e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 1.237e+00 7.270e-02 2.596e-03
## CvM 4.506e-01 5.260e-02 2.232e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
The key pieces of the output are identical to those of
dfr_lm
:
## [1] TRUE
## [1] TRUE
nls()
Two additional arguments are used here:
method
: a character string specifying the method to
use to fit the model. Its default value is “lm
”, so we must
specify “nls
” here.
covariance
: a list containing a named object
specifying the error covariance structure. This ultimately supplies the
weights values to nls()
. This can be omitted if weights are
not being used. (See below for more
details.)
theta_init
: a numeric vector specifying the starting
values for parameter estimation. (This is optional, since
nls()
will use default starting values with a warning if
not supplied. See nls()
documentation for further
details.)
The following call recreates the setup behind dfr_nls
.
The covariance structure is specified by, P
, the precision
matrix. This produces results identical, in the sense of
identical()
, to those from the example from above. Using
any other option (e.g., Sigma = diag(sds^2)
) would result
in numerically equivalent results in the sense of
all.equal()
.
set.seed(20240304)
(dfr_form_nls <- distfreereg(test_mean = form_nls, data = data_nls, method = "nls",
covariance = list(P = diag(1/sds^2)),
theta_init = c(a = 1, b = 1)))
##
## Number of observations: 300
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## a b
## 3.000e+00 1.957e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 7.237e-01 6.151e-01 4.866e-03
## CvM 1.052e-01 5.547e-01 4.970e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
The key pieces of the output are identical to those of
dfr_nls
:
## [1] TRUE
## [1] TRUE
The most general case that can be handled entirely within
R
is that in which the mean regression function is
specified by an R
function.
Suppose that we want to replicate the situation in the
nls
examples above, but we want to account for errors that
have a non-diagonal covariance matrix. (The weights
argument of nls()
limits that function’s applicability to
diagonal covariance matrices.) The example below illustrates how to do
this.
The function being tested is specified by test_mean
,
which is a function with two arguments, X
and
theta
. Here, X
represents the entire matrix of
covariates. Therefore, to replicate the references to x
and
y
in this
example, we use X[,1]
and X[,2]
,
respectively. The arguments X
and Y
supply the
covariate matrix and the response vector, respectively, in lieu of the
data
argument.
set.seed(20240304)
n <- 3e2
true_mean <- function(X, theta) exp(theta[1]*X[,1]) - theta[2]*X[,2]^2
test_mean <- true_mean
theta <- c(3,-2)
Sigma <- rWishart(1, df = n, Sigma = diag(n))[,,1]
X <- matrix(rnorm(2*n), nrow = n)
Y <- distfreereg:::f2ftheta(true_mean, X)(theta) +
distfreereg:::rmvnorm(n = n, reps = 1, SqrtSigma = distfreereg:::matsqrt(Sigma))
(dfr_1 <- distfreereg(test_mean = test_mean, Y = Y, X = X,
covariance = list(Sigma = Sigma),
theta_init = rep(1, length(theta))))
##
## Number of observations: 300
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## theta1 theta2
## 3.000e+00 -2.037e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 1.016e+00 2.226e-01 4.160e-03
## CvM 1.330e-01 4.419e-01 4.966e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
All of the examples above illustrate how distfreereg()
can be used when the mean function being tested is defined in
R
. If the mean function is not defined in R
,
distfreereg()
can still be used as long as certain objects
can be imported into R
. Specifically,
distfreereg()
needs the model’s fitted values and the
Jacobian of the mean function, evaluated at the estimated parameter
vector for each observation’s covariate values.
To illustrate this, we extract certain elements from the example above, and pretend that these
were created using external software, and then imported into
R
.
These can be supplied to distfreereg()
to implement the
test. Note that test_mean
is set to NULL
, and
since no parameter estimation is done here, no value for
theta_init
is specified.
distfreereg(test_mean = NULL, Y = Y, X = X, fitted_values = fitted_values,
J = J, covariance = list(Sigma = Sigma))
##
## Number of observations: 300
## Monte Carlo simulations: 10000
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 1.016e+00 2.241e-01 4.170e-03
## CvM 1.330e-01 4.488e-01 4.974e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
Below is an introduction to the three types of plots available for
distfreereg
objects. See the plotting vignette for more details
regarding plot customization.
A summary of the results of the test can be plotted easily.
The density of the simulated statistic is plotted, as is a vertical line showing the observed statistic. The upper tail is shaded, and the p-value (the area of that shaded region) is shown. Finally, a 95% confidence band of the density function is plotted, and the region is shaded in grey.
Two diagnostic plots are also available. The first produces a
time-series-like plot of the residuals in the order specified by
res_order
in the distfreereg
object.
The second produces a plot of the empirical partial sum process.
The covariance structure of the errors, conditional on the
covariates, must be specified. For the lm
and
nls
methods, the default behavior is to extract the
covariance structure from the supplied model object. In general, the
covariance structure is specified using the covariance
argument. The value of covariance
is a list with at least
one named element that specifies one of four matrices:
Sigma
, the covariance matrix;
SqrtSigma
, the square root of
Sigma
;
P
, the precision matrix (that is, the inverse of
Sigma
); and
Q
, the square root of P
.
Internally, the algorithm only needs Q
, so some
efficiency can be gained by supplying this directly if it is known.
Supplying more than one of the four matrices is not forbidden, but is
strongly discouraged. In this case, no verification is done that the
supplied matrices are consistent with each other, and Q
will be calculated using the most convenient supplied element.
Each of the four named elements can be one of three types of object:
A positive-definite matrix, the most general way to specify a covariance structure;
a numeric vector of positive values whose length is the sample size; and
a length-1 numeric vector specifying a positive number.
Specifying a numeric vector is theoretically equivalent to specifying a diagonal matrix with the given vector along the diagonal. Specifying a single number is theoretically equivalent to specifying a diagonal matrix with that value along the diagonal. Using the simplest possible expression in each case is recommended for conceptual and computational simplicity.
The following code shows that supplying Q
produces
observed statistics identical to those in the previous example.
Q <- distfreereg:::matsqrt(distfreereg:::matinv(Sigma, tol = .Machine$double.eps))
dfr_2 <- distfreereg(Y = Y, X = X, test_mean = test_mean,
covariance = list(Q = Q),
theta_init = rep(1, length(theta)))
identical(dfr_1$observed_stats, dfr_2$observed_stats)
## [1] TRUE
Several common generic functions have distfreereg
methods, including coef()
, predict()
, and
update()
. See the documentation for the complete list.
The stat
argument of distfreereg()
allows
the user to specify other statistics by specifying any function whose
input is a numeric vector and whose output is a numeric vector of length
one.
##
## Number of observations: 300
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## theta1 theta2
## 3.000e+00 -2.037e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## new_stat 8.072e+01 5.384e-01 4.985e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
If necessary, the default statistics can be selected using
“KS
” and “CvM
” as explicit values in the
character vector supplied to stat
.
##
## Number of observations: 300
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## theta1 theta2
## 3.000e+00 -2.037e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## new_stat 8.072e+01 5.364e-01 4.987e-03
## KS 1.016e+00 2.208e-01 4.148e-03
## CvM 1.330e-01 4.447e-01 4.969e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
Checks are done to verify that the functions provided calculate legitimate statistics; that is, that they take a numeric vector as input and output a single number.
## Error in value[[3L]](cond) :
## Unable to evaluate stat[i](rnorm(100)): Error in x[1, 2]: incorrect number of dimensions
## Error in value[[3L]](cond) :
## Unable to evaluate stat[i](rnorm(100)): Error in get(stat[i]): object 'undefined_stat' not found
One of the strengths of the test implemented by
distfreereg()
is that it makes only weak assumptions about
the distribution of the errors. Specifically, it only requires that the
error covariance matrix be finite and that the error distribution be
strongly mixing (also known as \(\alpha\)-mixing). Below is an example
illustrating that we get the expected results when the error
distribution is the \(t\) distribution.
Note that the \(t\) distribution with
three degrees of freedom has variance 3.
set.seed(20241003)
n <- 1e2
func <- function(X, theta) theta[1] + theta[2]*X[,1] + 0.5*X[,1]^2
theta <- c(2,5)
X <- matrix(rexp(n))
Y <- theta[1] + theta[2]*X[,1] + 0.5*X[,1]^2 + rt(n, df = 3)
dfr <- distfreereg(Y = Y, X = X, test_mean = func, theta_init = c(1,1),
covariance = list(Sigma = 3))
dfr
##
## Number of observations: 100
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## theta1 theta2
## 1.805e+00 4.934e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 6.079e-01 7.623e-01 4.257e-03
## CvM 7.765e-02 7.062e-01 4.555e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.
##
## Number of observations: 100
## Monte Carlo simulations: 10000
##
## Estimated parameter values:
## theta1 theta2
## 1.252e+00 6.400e+00
##
## Observed statistics:
## Stat Value Pr(>Value) MCSE
## KS 1.466e+00 1.990e-02 1.397e-03
## CvM 6.932e-01 1.280e-02 1.124e-03
## ---
## `MCSE' is the Monte Carlo standard error of the estimated p-value.