The Rborist package

‘r Sys.Date()’

Introduction

The Rborist package implements the Random Forest (TM) algorithm, with particular emphasis on high performance. The package is an R-language spinoff of the Arborist project, a multi-language effort targeting a variety of decision-tree methods. Look and feel owes a large debt to Liaw’s original randomForest package.

High performance

The interpretation of the phrase “high performance” will vary among users. We claim that the Rborist is a high-performance package primarily because it either does, or has the potential to, take advantage of the acceleration offerred by commodity parallel hardware. We also expect performance to scale in accordance with algorithmic complexity, decaying gracefully as resource limitations are approached.

Particular attention has been paid to minimizing data movement and, especially, toward maximizing data locality. We believe that this has been a key contributing factor to performance and scalability and will continue to play a major role in efforts to extend support more broadly.

The current implementation is limited to in-memory execution on multicore and multi-socket hardware. We envision the following improvements in the near to medium term:

Training and validation

Simple example

The simplest way to invoke the package is to pass it a design matrix and response vector, of conforming dimensions. For appropriately typed design x and response y, then, it suffices to call:

  rs <- Rborist(x, y)

The design can be either a data.frame, a numeric matrix or an integer matrix. In the case of a frame, the columns (predictors) must, individually, have either numeric or factor value. Integer matrices are coerced internally to their numeric counterparts. The response type may be either numeric, yielding a regression forest, or categorical, yielding a classification forest.

The return value (here rs) is of class Rborist. The full layout of an Rborist object is described by the help() files. A very useful example is the validation object, which summarizes testing on the out-of-bag samples.

Validation

Regression

In regression training, the validation object’s members include mean absolute and square errors, as well as the r-squared statistic. Continuing from the previous codelet, these are obtainable as follows:

  rs$validation$mae
  rs$validation$mse
  rs$validation$rsq

These statistics are derived from the original training reponse (y, above) and the derived out-of-bag reponse. The out-of-bag response itself can also be obtained fromt the validation object:

   rs$validation$yPred

validation also contains member qPred, for use with quantile regression. This member will be described in the next section.

Classification

In classification training, the validation object also presents the out-of-bag response in member yPred. Its other members, however, are specialized for classification:

  • The misprediction rate is reported for each classification category by field mispredition.

  • A confusion matrix is reported by field confusion.

  • The out-of-bag error rate is given by oobError.

  • The census field is a matrix giving, for each row, the number of times each response category is predicted for the row.

  • The prob field reports a normalized version of census and can be interpreted as the probability of predicting a given category at a given row.

In addition to validation, an Rborist object contains several other members. Most of these are used to coordinate subsequent operations, such as prediction or feature contribution, and do not directly convey summary information. An exception is the training member, which summarizes training statistics not directly related to validation. training currently has a single member, info:

       rs$training$info

For each predictor the info vector reports the sum, over all tree nodes in the forest for which the predictor defines a splitting condition, of the information value precipitating the respective split. Although these values depend on the information metric employed, they do provide a relative measure of each predictor’s importance in the model.

Validation can be suppressed as follows:

       rs <- Rborist(x, y, noValidate = TRUE)

This option is primarily for the use of package maintainers, but may prove useful, for example, in instances when model fit is not a main concern. The validation field will then have value NULL.

Quantile prediction

When predicting over a regression forest, Rborist provides quantiles simply for the asking. Leaf information, by default, is rich enough to make their computation quite easy. Quantiles can be requested in either of two ways. The simplest is to set option quantiles to TRUE:

       rs <- Rborist(x, y, quantiles = TRUE)

This default quantile vector consists of quartiles, which are given by the qPred member mentioned above:

       rs$validation$qPred

Quantile prediction also estimates the quantile position at which the predicted mean, yPred, lies. This estimate is reported by the qEst value, as follow:

’’‘{r, eval = FALSE} rs\(validation\)qEst’’’

Explicity specfiying the quantile vector, quantVec, yields quantiles of any desired type. Deciles, for example, can be requested as follows:

       rs <- Rborist(x, y, quantVec = seq(0.1, 1.0, by=0.1))

The algorithm employed to compute the quantiles is approximate, as bining is employed to improve performance.

If, on the other hand, when training a regression forest, quantiles are not desired and it is intended that quantiles will not subsequently be requested after training, space can be saved by representing leaves in a sparser form:

       rs <- Rborist(x, y, thinLeaves=TRUE)

Tree and forest size

The simplest way to affect forest size is to specify the number of trees. The following codelet requests 100 trees:

       rs <- Rborist(x, y, nTree=100)

This also affects training time, which is expected to scale linearly with the number of trees.

Training of individual trees can be constrained according to several parameters. The minNode option places a lower bound on node sizes, i.e., the number of distinct samples subsumed by each node. A value of 1, for example, allows splitting to proceed until purity. A larger value results in a smaller tree, as well as faster training. To ensure that splitting does not proceed below a node size of 20, for example:

       rs <- Rborist(x, y, minNode=20)

Another way to control tree size is to specify its maximal depth. The option nLevel sets an upper bound on the number of levels over which trees can be trained. The following codelet causes tree construction to halt after the root is created, resulting in a forest of singleton trees.

       rs <- Rborist(x, y, nLevel = 1)

As with several other size-based constraints, constraining level depth also results in faster execution.

Option maxLeaf limits the number of leaves (terminals) present in a tree. Unlike other size constraints described so far, maxLeaf is enforced after training, so as not to bias tree shape. While this pruning operation is quite fast, then, limiting the number of leaves will not speed the training process.

       rs <- Rborist(x, y, maxLeaf = 13}

Option minInfo sets a splitting threshold based on relative information gain. A splitting candidate is rejected if the ratio of information content between a node and its potential successors lies below the threshold.

       rs <- Rborist(x, y, minRatio = 0.1)

This option should be applied with care and should probably be avoided at low predictor count.

Performance as well as storage economy can in some cases both be enhanced by abstracting away repeated instances of the same predictor value. This is the idea behind sparse representations and, in particular, one-hot encodings, in which repeated values of zero are represented implicitly. The Arborist employs a simlilar strategy, but on a per-predictor basis, representing high-frequency observations of a given predictor implicitly. A plurality threshold above which to compress repeated values is specified by the autoCompress option:

       rs <- Rborist(x, y, autoCompress = 0.4)

As autoCompress specifies a plurality threshold, only a single set of repeated values undergoes compression for a given predictor. Resolution of ties, in particular, is implementation-dependent. A threshold frequency of 1.0, the maximum, ensures that no compression takes place, while a threshold frequency of 0.0, the minimum, ensures some value is compressed for each predictor, regardless of frequency.

As a complement to the thinLeaves option for training, the Streamline command can be applied following training to reduce a forest’s memory footprint. Streamline clears fields employed by validation, quantile regression and feature contribution, so should not be employed if any of these operations are expected to be performed subsequently.

       rs <- Rborist(x, y)
       ...
       rs <- Streamline(rs)

Sampling options

Several options affect the behavior of the various sampling mechanims used by the package.

Option nSamp dictates the number of bagged samples defining the root of each tree. A smaller sample count may result in smaller trees, as fewer opportunites arise to split.

Option withRepl specifies whether bag sampling is to be done with replacement.

Vector rowWeight weights the probability of bagging a given row. The following invocation gives each row identical weight, which is also the default behavior:

       rs <- Rborist(x, y, rowWeight = rep(1/nrow(y), nrow(y))

Vector predWeight weights the probability of selecting a given predictor as splitting candidate. For example, this invocation selects predictor 0 half as often, per predictor, as the remaining predictors:

       rs <- Rborist(x, y, predWeight = c(0.5, rep(1.0, ncol(x)-1)))

Option predProb is the uniform probability of selecting a predictor for splitting. That is, the value applies uniformly to all predictors. Hence the number of predictors tried at a given split will have a binomial distribution.

Option predFixed is the actual number of predictors to test for a given split, and so calls for sampling predictors without replacement. predFixed and predProb cannot both be specified within a single training operation.

For regression training, vector regMono specifies a (signed) rejection probability to enforce monotonicity with respect to a given predictor. Negative values specify rejection of nondecreasing splitting candidates with a given probability, while positive values stipulate a rejection probability for nonincreasing predictors. A value of zero indicates that no monotonicity constraint is enforced. Values assigned to factor predictors are ignored.

The following example rejects nonincreasing predictors as splitting candidates with probability one:

       rs <- Rborist(x, y, regMono = rep(1.0, ncol(x)))

Other training options

Classification training can be fine-tuned by weighting individiual categories. The option classWeight permits weights to be specified, by category, for the objective function used to split. This may be useful, for example, when the training response is unbalanced.

The following example employs a non-normalized weighting vector to give the first category twice as much weight as the others. Note that the category ecoding reflected by ‘levels()’ is not portable:

       lv <- levels(y)
       rs <- Rborist(x, y, classWeight = c(2.0, rep(1.0, length(lv) - 1))

By default, when a numerical predictor is chosen to split a node, the Arborist assigns its split value as that corresponding to the mean rank, with respect to the full set of observations on the predictor, between the two split boundary points. That is, the splitting criterion attempts to reflect the ECDF of the entire sample. This contrasts with other implementations, which typically select either the midpoint value or one of the two boundary values. In particular, depending upon how the observations are distributed, the midrank can correspond to a value arbitrarily close to either of the two boundaries.

The vector splitQuant allows a (sub)quantile value to be interpolated, so that the split value can be manipulated more finely with respect to the two endpoints. For example, the following codelet expresses the default behavior, which is to select the middle rank (i.e., 0.5 quantile) for all numerical predictors (if any):

       rs <- Rborist(x, y, splitQuant = rep(0.5, ncol(x)))

Similarly, this example chooses the left endpoint for all relevant predictors:

       rs <- Rborist(x, y, splitQuant = rep(0.0, ncol(x)))

Preformatting

The Arborist represents predictors internally using a format streamlined for subsequent training. A “preformatted” representation of the training data can be generated directly and trained separately as follows:

       pf <- Preformat(x)
       rs <- Rborist(pf, y)

Separate preformatting can result in a slight performance improvement in workflows with iterated training, such as under Caret. This is simply because the sorting and packing performed at the start of training can be cached instead of repeatedly computed.

A better motivation for preformatting arises when the training data can be represented sparsely. Suppose, for example, that B is a large data set consisting chiefly of predictors with highly repetitive observation values. As the Arborist is able to identify and compress repetitive observations, storage might be conserved by first preformatting B, then deleting it and finally training on the preformatted representation:

       pf <- Preformat(B)
       rm(B)
       rs <- Rborist(pf, y)

Performance

Unlike many implementations, which employ sorting to order observations within a node, the Arborist employs a presort followed by linear updates. The updates are a form of stable partition, which we refer to as restaging. Restaging reduces the algorithmic complexity from the oft-cited \(\mathcal{O}(n \log{}^2 n),\) where n represents the number of training samples, to \(\mathcal{O}(n \log{} n).\)

Restaging is not without its problems, however. As currently implemented, a double buffer caches outcome data corresponding to every cell in the design. Hence the restaging containers can grow quite large. While some users may find this to be acceptable price for scalability, others may find the storage requirements too high for their application. Scalability, moreover, may not be an important concern at low sample or row count.

For high-dimensional problems, the Ranger package may provide a suitable alternative. Similarly, for some problems, the package Xgboost offers excellent storage characteristics. In the meantime, we envision several improvements, to appear in upcoming versions, to the Arborist’s parsimony with storage.