Fixes a major bug, by which responses would be used as covariates in the random forests. Thanks for reporting @flystar233, see #78. You can expect different and better imputations.
Out-of-sample application is now possible! Thanks to @jeandigitale for pushing the idea in #58.
This means you can run
imp <- missRanger(..., keep_forests = TRUE)
and then
apply its models to new data via predict(imp, newdata)
. The
“missRanger” object can be saved/loaded as binary file, e.g, via
saveRDS()
/readRDS()
for later use.
Note that out-of-sample imputation works best for rows in
newdata
with only one missing value (counting only missings
in variables used as covariates in random forests). We call this the
“easy case”. In the “hard case”, even multiple iterations (set by
iter
) can lead to unsatisfactory results.
The out-of-sample algorithm works as follows:
pmm()
is more picky: xtrain
and
xtest
must both be either numeric, logical, or factor (with
identical levels).data_raw
.visit_seq
to to_impute
.ranger()
arguments are now explicit
arguments in missRanger()
to improve tab-completion
experience:
keep_forests = TRUE
, the argument
data_only
is set to FALSE
by default.pmm.k
.verbose
argument is passed to ranger()
as
well.data_only = TRUE
to control if only the
imputed data should be returned (default), or an object of class
“missRanger”. This object contains the imputed data and infos like OOB
prediction errors, fixing #28. The
value FALSE
will later becoming the default in {missRanger
3.0.0}. This will be announced via deprecation cycle.keep_forests = FALSE
. Should the random
forests of the best iteration (the one that generated the final imputed
data) be added to the “missRanger” object? Note that this will use a lot
of memory. Only relevant if data_only = FALSE
. This solves
#54.missRanger()
now works with syntactically wrong
variable names like “1bad:variable”. This solves an old issue,
recently popping up in this new
issue.missRanger()
now works with any number of features, as
long as the formula is left at its default, i.e., . ~ .
.
This solves this issue.ranger()
is now called via the x/y interface, not the
formula interface anymore.importFrom
to ::
code
styleMaintenance release,
mtry = function(m) max(1, m %/% 3)
. Keep in mind
that missRanger()
might use a growing set of covariables in
the first iteration of the process, so passing mtry = 2
might result in an error.This is a summary of all changes since version 1.x.x.
missRanger
now also imputes and uses logical
variables, character variables and further variables of mode numeric
like dates and times.
Added formula interface to specify which variables to impute (those on the left hand side) and those used to do so (those on the right hand side). Here some (pseudo) examples:
. ~ .
(default): Use all variables to impute all
variables. Note that only those with missing values will be imputed.
Variables without missings will only be used to impute others.
. ~ . - ID
: Use all variables except ID
to impute all missing values.
Species ~ Sepal.Width
: Use Sepal.Width
to impute Species
. Only works if Sepal.Width
does not contain missing values. (Add it to the right hand side if it
does.)
Species + Sepal.Length ~ Species + Petal.Length
: Use
Species
and Petal.Length
to impute
Species
and Sepal.Length
. Only works if
Petal.Length
does not contain missing values because it
does not appear on the left hand side and is therefore not imputed
itself.
. ~ 1
: Univariate imputation for all relevant
columns (as nothing is selected on the right hand side).
The first argument of generateNA
is called
x
instead of data
in consistency with
imputeUnivariate
.
imputeUnivariate
now also works for data frames and
matrices.
In PMM mode, missRanger
relies on OOB predictions.
The smaller the value of num.trees
, the higher the risk of
missing OOB predictions, which caused an error in PMM. Now,
pmm
allows for missing values in xtrain
or
ytrain
. Thus, the algorithm will even work with
num.trees = 1
. This will be useful to impute large data
sets with PMM.
The function imputeUnivariate
has received a
seed
argument.
The function imputeUnivariate
has received a
v
argument, specifying columns to impute.
The function generateNA
offers now the possibility
to use different proportions of missings for each column.
If verbose
is not 0, then missRanger
will show which variables will be imputed in which order and which
variables will be used for imputation.
returnOOB
is now effectively controlling
if out-of-bag errors are attached as attribute “oob” to the resulting
data frame or not. So far, it was always attached.