An essential part of the testing procedure implemented by
distfreereg()
is the estimation of the model’s parameters.
When using the lm
or nls
methods, or the
corresponding formula
methods, parameter estimation is
handled using the modeling functions themselves. Below is the setup used
as needed thereafter.
set.seed(20240926)
n <- 3e2
true_mean <- function(X, theta) exp(theta[1]*X[,1]) - theta[2]*X[,2]^2
test_mean <- true_mean
theta <- c(3,-2)
Sigma <- rWishart(1, df = n, Sigma = diag(n))[,,1]
X <- matrix(rnorm(2*n), nrow = n)
Y <- distfreereg:::f2ftheta(true_mean, X)(theta) +
distfreereg:::rmvnorm(n = n, reps = 1, SqrtSigma = distfreereg:::matsqrt(Sigma))
dfr_1 <- distfreereg(test_mean = test_mean, Y = Y, X = X,
covariance = list(Sigma = Sigma),
theta_init = rep(1, length(theta)))
For the function
method, parameters are estimated using
the optim()
function by default. The default optimization
method is BFGS
, which works well for many cases. The method
can be changed using the control
argument of
distfreereg()
. For example, if we know beforehand that
\(\theta=(\theta_1,\theta_2)\)
satisfies \(2\leq\theta_1\leq4\) and
\(-3\leq\theta_2\leq-1\),
set.seed(20240926)
dfr_2 <- update(dfr_1, control = list(
optimization_args = list(method = "L-BFGS-B", lower = c(2,-3), upper = c(4,-1))))
identical(dfr_1$theta_hat, dfr_2$theta_hat)
## [1] FALSE
## [1] "Mean relative difference: 0.0001865232"
The results are not identical, but are nearly so.
To estimate the precision with which the parameters have been
estimated, vcov()
has a distfreereg
method.
When test_mean
is a formula, an lm
object, or
an nls
object, the model is supplied to vcov()
for appropriate method dispatch. When test_mean
is a
function, then the method described in Vaart (2007) is applied.
## theta1 theta2
## theta1 2.226767e-07 2.892374e-05
## theta2 2.892374e-05 6.410426e-03
An analogous method exists for confint()
, which uses
this covariance matrix to calculate confidence intervals at a stated
confidence level:
## $ci
## 5 % 95 %
## theta1 2.999730 3.001282
## theta2 -2.131274 -1.867883
##
## $vcov
## theta1 theta2
## theta1 2.226767e-07 2.892374e-05
## theta2 2.892374e-05 6.410426e-03
Note that this method returns the result of vcov()
as
well as the confidence intervals, since the calculations in
vcov()
can be computationally expensive, and since the
covariance matrix is used to calculate the confidence intervals, it is
included in this output to avoid requiring a separate call to
vcov()
in case its results are also desired.
By default, distfreereg.function()
uses
optim()
to estimate the model’s parameters. Suppose,
however, we wanted to use nlminb()
instead. (This function
is selected to illustrate the use of another optimization function only,
and is not an implicit recommendation.) We can use
distfreereg()
’s control
argument to do
this.
For a given optimization function, we need three things:
The name of the argument that specifies the function to be minimized.
The name of the argument that specifies the starting parameter values.
The name of the element in the returned value that contains the parameter estimates.
In the case of nlminb()
, a review of the help page
indicates that these are “objective
”, “start
”,
and “par
”, respectively. We supply these three names, along
with the function nlminb
itself, to the
control
argument as follows.
set.seed(20240926)
dfr_3 <- update(dfr_1, control = list(optimization_fun = nlminb,
fun_to_optimize_arg = "objective",
theta_init_arg = "start",
theta_hat_name = "par"))
Once again, these are not identical, but are nearly so.
## [1] FALSE
## [1] "Mean relative difference: 4.784141e-05"
Detailed output from optim()
, helpful for diagnosing
some parameter estimation problems, is found in the
optimization_output
element of the output:
## $par
## theta1 theta2
## 3.000507 -1.999340
##
## $objective
## [1] 308.6614
##
## $convergence
## [1] 0
##
## $iterations
## [1] 16
##
## $evaluations
## function gradient
## 27 37
##
## $message
## [1] "relative convergence (4)"
Before optimizing, distfreereg.function()
verifies that
the mean function can be evaluated at theta_init
. This will
detect some problems but does not guarantee compatibility. This should
be kept in mind when an opaque error message arises. In the first
example below, the starting vector is too short. This error is caught by
the initial validation. In the second example, though, the initial
validation is passed, but a later step produces an error.
## Error in validate_numeric(x = f_out, len = n, message = "Test function output failed numeric validation: ") :
## Test function output failed numeric validation: f_out cannot have NA values
## Error in value[[3L]](cond) :
## Unable to invert square root of J^tJ: Error in value[[3L]](cond): Unable to invert mat: Error in solve.default(mat, tol = tol): Lapack routine dgesv: system is exactly singular: U[3,3] = 0
The theory behind the test implemented by distfreereg()
requires that the parameter estimates be approximately normally
distributed around the true parameter value. With regularity assumptions
on the test mean function, this is true in theory, but it must also be
sufficiently close to true in practice. This requires good optimization
algorithms. In the example below, something goes awry.
In this example, we use a different error generating function from
the default, specifically one that uses a block-diagonal covariance
matrix and generates errors corresponding to each block using a
multivariate \(t\) distribution.
Further, we use the default method for optim()
, namely
Nelder–Mead.
First, define the function and true parameter vector.
Next, define the error distribution function. This first specifies
the dimension matdim
of the blocks. The function
block_t()
generates the errors by repeated calls to
mvtnorm::rmvt()
with the appropriate scale matrix to obtain
the correct covariance structure. (Recall that the errors for each
repetition done by compare()
are stored in the columns of
the error matrix.)
matdim <- 10
block_t <- function(n, reps, blocks, df){
output <- matrix(NA, nrow = n, ncol = reps)
for(i in seq_len(reps))
output[,i] <- as.vector(sapply(blocks, function(x) mvtnorm::rmvt(1, df = df, sigma = x*(df-2)/df)))
return(output)
}
Finally, define a function that extracts the parameter estimates from
a distfreereg
object, and a function that generates
covariates, a covariance matrix, and then runs
compare()
.
get_theta_hat <- function(dfr) dfr$theta_hat
create_cdfr <- function(n){
X <- as.matrix(rexp(n, rate = 1))
Sig_list <- lapply(1:(n/matdim), function(x) rWishart(1, df = matdim, diag(matdim))[,,1])
Sig <- as.matrix(Matrix::bdiag(Sig_list))
return(compare(theta = theta,
true_mean = true_func, test_mean = true_func,
true_X = X, X = X,
true_covariance = list(Sigma = Sig), covariance = list(Sigma = Sig),
theta_init = c(1,1),
err_dist_fun = block_t,
err_dist_args = list(blocks = Sig_list, df = 4),
reps = 1e3,
prog = Inf,
manual = get_theta_hat,
control = list(optimization_args = list(method = "Nelder-Mead")))
)
}
Now, we set the seed, run the simulation, and extract the saved parameter estimates.
set.seed(14)
cdfr_seed14 <- create_cdfr(400)
theta_seed14_1 <- sapply(cdfr_seed14$manual, function(x) x[[1]])
theta_seed14_2 <- sapply(cdfr_seed14$manual, function(x) x[[2]])
Since we are using the same function for true_mean
and
test_mean
, the cumulative distribution functions of the
observed and simulated statistics should be nearly the same. The
following plot is therefore alarming.
We can see that there is a problem with the parameter estimates, namely that their densities are not approximately normally distributed about the true parameter values.
The problem does seem to be the use of Nelder–Mead here, since the
default method selected by distfreereg()
, namely BFGS, does
not result in such distributions in similar cases. This is, of course,
no guarantee that BFGS will alway be better, but it does appear to be a
more sensible default for this application.
The keen-eyed reader will note two possible objections to the
analysis above: perhaps the sample size is insufficient, and perhaps the
input of set.seed()
was carefully selected to obtain these
results.
The first of these objections can be addressed by showing the results with a different seed, where the plots suggest no problems.
set.seed(1)
cdfr_seed1 <- create_cdfr(400)
theta_seed1_1 <- sapply(cdfr_seed1$manual, function(x) x[[1]])
theta_seed1_2 <- sapply(cdfr_seed1$manual, function(x) x[[2]])
plot(cdfr_seed1)
A formal test that the samples have come from the same distribution is also reassuring.
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: cdfr_seed1[["observed_stats"]][["KS"]] and cdfr_seed1[["mcsim_stats"]][["KS"]]
## D = 0.0398, p-value = 0.1122
## alternative hypothesis: two-sided
The second possible objection, namely that the seed was carefully
selected to produce these results, is only half true. It was selected,
but not particularly carefully. Considering integers from 1 to 15 as
inputs to set.seed()
, at least 7, 9, 10, and 13 produce
poor results. The phenomenon is not difficult to reproduce.