The rsample package provides a number of resampling methods which are broadly applicable to a wide variety of modeling applications. This vignette walks through the most popular methods in the package, with brief descriptions of how they can be applied. For a more in-depth overview of resampling, check out the matching chapters in Tidy Modeling with R and Feature Engineering and Selection.
Let’s go ahead and load rsample now:
As well as dplyr, for the pipe operator %>%
:
We’ll also load in a few data sets from the modeldata package. First, the Ames housing data, containing the sale prices of homes in Ames, Iowa:
data(ames, package = "modeldata")
head(ames, 2)
#> # A tibble: 2 × 74
#> MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
#> <fct> <fct> <dbl> <int> <fct> <fct> <fct>
#> 1 One_Story_1946_and_New… Resident… 141 31770 Pave No_A… Slightly…
#> 2 One_Story_1946_and_New… Resident… 80 11622 Pave No_A… Regular
#> # ℹ 67 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#> # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#> # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#> # Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#> # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#> # Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
#> # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, …
Secondly, data on Chicago transit ridership numbers:
data(Chicago, package = "modeldata")
head(Chicago, 2)
#> # A tibble: 2 × 50
#> ridership Austin Quincy_Wells Belmont Archer_35th Oak_Park Western Clark_Lake
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 15.7 1.46 8.37 4.60 2.01 1.42 3.32 15.6
#> 2 15.8 1.50 8.35 4.72 2.09 1.43 3.34 15.7
#> # ℹ 42 more variables: Clinton <dbl>, Merchandise_Mart <dbl>,
#> # Irving_Park <dbl>, Washington_Wells <dbl>, Harlem <dbl>, Monroe <dbl>,
#> # Polk <dbl>, Ashland <dbl>, Kedzie <dbl>, Addison <dbl>,
#> # Jefferson_Park <dbl>, Montrose <dbl>, California <dbl>, temp_min <dbl>,
#> # temp <dbl>, temp_max <dbl>, temp_change <dbl>, dew <dbl>, humidity <dbl>,
#> # pressure <dbl>, pressure_change <dbl>, wind <dbl>, wind_max <dbl>,
#> # gust <dbl>, gust_max <dbl>, percip <dbl>, percip_max <dbl>, …
In addition to these data sets from the modeldata package, we’ll also make use of the Orange data set in base R, containing repeated measurements of 5 orange trees over time:
And last but not least, we’ll set a seed so our results are reproducible:
By far and away, the most common use for rsample is to generate simple random resamples of your data. The rsample package includes a number of functions specifically for this purpose.
To split your data into two sets – often referred to as the
“training” and “testing” sets – rsample provides the
initial_split()
function:
The output of this is an rsplit object
with each observation assigned to one of the two sets. You can control
the proportion of data assigned to the “training” set through the
prop
argument:
To get the actual data assigned to either set, use the
training()
and testing()
functions:
resample <- initial_split(ames, prop = 0.6)
head(training(resample), 2)
#> # A tibble: 2 × 74
#> MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
#> <fct> <fct> <dbl> <int> <fct> <fct> <fct>
#> 1 One_Story_1946_and_New… Resident… 110 14333 Pave No_A… Regular
#> 2 One_Story_1946_and_New… Resident… 65 8450 Pave No_A… Regular
#> # ℹ 67 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#> # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#> # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#> # Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#> # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#> # Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
#> # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, …
head(testing(resample), 2)
#> # A tibble: 2 × 74
#> MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
#> <fct> <fct> <dbl> <int> <fct> <fct> <fct>
#> 1 One_Story_1946_and_New… Resident… 141 31770 Pave No_A… Slightly…
#> 2 One_Story_1946_and_New… Resident… 80 11622 Pave No_A… Regular
#> # ℹ 67 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#> # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#> # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#> # Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#> # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#> # Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
#> # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, …
You should only evaluate models against your test set once, when you’ve completely finished tuning and training your models. To estimate performance of model candidates, you typically split your training data into one part used for model fitting and one part used for measuring performance. To distinguish those set from training and test set, we refer to them as analysis and assessment set, respectively. Typically, you split your training data into analysis and assessment sets multiple times to get stable estimates of model performance.
Perhaps the most common cross-validation method is V-fold cross-validation. Also known as “k-fold cross-validation”, this method creates V resamples by splitting your data into V groups (also known as “folds”) of roughly equal size. The analysis set of each resample is made up of V-1 folds, with the remaining fold being used as the assessment set. This way, each observation in your data is used in exactly one assessment set.
To use V-fold cross-validation in rsample, use the
vfold_cv()
function:
vfold_cv(ames, v = 2)
#> # 2-fold cross-validation
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1465/1465]> Fold1
#> 2 <split [1465/1465]> Fold2
One downside to V-fold cross validation is that it tends to produce
“noisy”, or high-variance, estimates when
compared to other resampling methods. To try and reduce that
variance, it’s often helpful to perform what’s known as repeated
cross-validation, effectively running the V-fold resampling
procedure multiple times for your data. To perform repeated V-fold
cross-validation in rsample, you can use the repeats argument inside of
vfold_cv()
:
An alternative to V-fold cross-validation is Monte-Carlo cross-validation. Where V-fold assigns each observation in your data to one (and exactly one) assessment set, Monte-Carlo cross-validation takes a random subset of your data for each assessment set, meaning each observation can be used in 0, 1, or many assessment sets. The analysis set is then made up of all the observations that weren’t selected. Because each assessment set is sampled independently, you can repeat this as many times as you want.
To use Monte-Carlo cross-validation in rsample, use the
mc_cv()
function:
mc_cv(ames, prop = 0.8, times = 2)
#> # Monte Carlo cross-validation (0.8/0.2) with 2 resamples
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2344/586]> Resample1
#> 2 <split [2344/586]> Resample2
Similar to initial_split()
, you can control the
proportion of your data assigned to the analysis fold using
prop
. You can also control the number of resamples you
create using the times
argument.
Monte-Carlo cross-validation tends to produce more biased estimates than V-fold. As such, when computationally feasible we typically recommend using five or so repeats of 10-fold cross-validation for model assessment.
The last primary technique in rsample for creating resamples from the training data is bootstrap resampling. A “bootstrap sample” is a sample of your data set, the same size as your data set, taken with replacement so that a single observation might be sampled multiple times. The assessment set is then made up of all the observations that weren’t selected for the analysis set. Generally, bootstrap resampling produces pessimistic estimates of model accuracy.
You can create bootstrap resamples in rsample using the
bootstraps()
function. While you can’t control the
proportion of data in each set – the assessment set of a bootstrap
resample is always the same size as the training data – the function
otherwise works exactly like mc_cv()
:
If your data is vast enough for a reliable performance estimate from
just one assessment set, you can do a three-way split of your data into
a training, validation and test set right at the start. (The validation
set has the role of the single assessment set.) Instead of using
initial_split()
to create a binary split, you can use
initial_validation_split()
to create that three-way
split:
three_way_split <- initial_validation_split(ames, prop = c(0.6, 0.2))
three_way_split
#> <Training/Validation/Testing/Total>
#> <1758/586/586/2930>
The prop
argument here has two elements, specifying the
proportion of the data assigned to the training and the validation
set.
To create an rset
object for tuning,
validation_set()
bundles together the training and
validation set, read for use with the tune) package.
If your data is heavily imbalanced (that is, if the distribution of an important continuous variable is skewed, or some classes of a categorical variable are much more common than others), simple random resampling may accidentally skew your data even further by allocating more “rare” observations disproportionately into the analysis or assessment fold. In these situations, it can be useful to instead use stratified resampling to ensure the analysis and assessment folds have a similar distribution as your overall data.
All of the functions discussed so far support stratified resampling
through their strata
argument. This argument takes a single
column identifier and uses it to stratify the resampling procedure:
vfold_cv(ames, v = 2, strata = Sale_Price)
#> # 2-fold cross-validation using stratification
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1464/1466]> Fold1
#> 2 <split [1466/1464]> Fold2
By default, rsample will cut continuous variables into four bins, and
ensure that each bin is proportionally represented in each set. If
desired, this behavior can be changed using the breaks
argument:
Often, some observations in your data will be “more related” to each other than would be probable under random chance, for instance because they represent repeated measurements of the same subject or were all collected at a single location. In these situations, you often want to assign all related observations to either the analysis or assessment fold as a group, to avoid having assessment data that’s closely related to the data used to fit a model.
All of the functions discussed so far have a “grouped resampling”
variation to handle these situations. These functions all start with the
group_
prefix, and use the argument group
to
specify which column should be used to group observations. Other than
respecting these groups, these functions all work like their ungrouped
variants:
resample <- group_initial_split(Orange, group = Tree)
unique(training(resample)$Tree)
#> [1] 1 2 3 4
#> Levels: 3 < 1 < 5 < 2 < 4
unique(testing(resample)$Tree)
#> [1] 5
#> Levels: 3 < 1 < 5 < 2 < 4
It’s important to note that, while functions like
group_mc_cv()
still let you specify what proportion of your
data should be in the analysis set (and group_bootstraps()
still attempts to create analysis sets the same size as your original
data), rsample won’t “split” groups in order to exactly meet that
proportion. These functions start out by assigning one group at random
to each set (or, for group_vfold_cv()
, to each fold) and
then assign each of the remaining groups, in a random order, to
whichever set brings the relative sizes of each set closest to the
target proportion. That means that resamples are randomized, and you can
safely use repeated cross-validation just as you would with ungrouped
resampling, but also means you can wind up with very differently sized
analysis and assessment sets than anticipated if your groups are
unbalanced:
set.seed(1)
group_bootstraps(ames, Neighborhood, times = 2)
#> # Group bootstrap sampling
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2939/907]> Bootstrap1
#> 2 <split [2958/635]> Bootstrap2
While most of the grouped resampling functions are always focused on
balancing the proportion of data in the analysis set, by default
group_vfold_cv()
will attempt to balance the number of
groups assigned to each fold. If instead you’d like to balance the
number of observations in each fold (meaning your assessment sets will
be of similar sizes, but smaller groups will be more likely to be
assigned to the same folds than would happen under random chance), you
can use the argument balance = "observations"
:
group_vfold_cv(ames, Neighborhood, balance = "observations", v = 2)
#> # Group 2-fold cross-validation
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1475/1455]> Resample1
#> 2 <split [1455/1475]> Resample2
If you’re working with spatial data, your observations will often be more related to their neighbors than to the rest of the data set; as Tobler’s first law of geography puts it, “everything is related to everything else, but near things are more related than distant things.” However, you often won’t have a pre-defined “location” variable that you can use to group related observations. The spatialsample package provides functions for spatial cross-validation using rsample syntax and classes, and is often useful for these situations.
When working with time-based data, it usually doesn’t make sense to randomly resample your data: random resampling will likely result in your analysis set having observations from later than your assessment set, which isn’t a realistic way to assess model performance.
As such, rsample provides a few different functions to make sure that all data in your assessment sets are after that in the analysis set.
First off, two variants on initial_split()
and
initial_validation_split()
,
initial_time_split()
and
initial_validation_time_split()
, will assign the
first rows of your data to the training set (with the number of
rows assigned determined by prop
):
initial_time_split(Chicago)
#> <Training/Testing/Total>
#> <4273/1425/5698>
initial_validation_time_split(Chicago)
#> <Training/Validation/Testing/Total>
#> <3418/1140/1140/5698>
There are also several functions in rsample to help you construct
multiple analysis and assessment sets from time-based data. For
instance, the sliding_window()
will create “windows” of
your data, moving down through the rows of the data frame:
sliding_window(Chicago) %>%
head(2)
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1/1]> Slice0001
#> 2 <split [1/1]> Slice0002
If you want to create sliding windows of your data based on a
specific variable, you can use the sliding_index()
function:
sliding_index(Chicago, date) %>%
head(2)
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1/1]> Slice0001
#> 2 <split [1/1]> Slice0002
And if you want to set the size of windows based on units of time,
for instance to have each window contain a year of data, you can use
sliding_period()
:
sliding_period(Chicago, date, "year") %>%
head(2)
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [344/365]> Slice01
#> 2 <split [365/365]> Slice02
All of these functions produce analysis sets of the same size, with
the start and end of the analysis set “sliding” down your data frame. If
you’d rather have your analysis set get progressively larger, so that
you’re predicting new data based upon a growing set of older
observations, you can use the rolling_origin()
function:
rolling_origin(Chicago) %>%
head(2)
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [5/1]> Slice0001
#> 2 <split [6/1]> Slice0002
Note that all of these time-based resampling functions are deterministic: unlike the rest of the package, running these functions repeatedly under different random seeds will always return the same results.