“Quick and Easy” ML
MUCH SIMPLER USER INTERFACE than caret, mlr3, tidymodels, etc.
easy for learners, powerful/convenient for experts
Ideal for teaching!
numerous built-in real datasets.
includes tutorials on major ML methods
Special features for those experienced in ML
advanced functions for feature selection and model development
advanced ML algorithms, including some novel/unusual ones
advanced plotting utilities
The letters ‘qe’ in the package title stand for “quick and easy,” alluding to the convenience goal of the package. We bring together a variety of machine learning (ML) tools from standard R packages, providing wrappers with a uniform, extremely simple interface. Hence the term “quick and easy.”
For instance, consider the mlb1 data included in the package, consisting of data on professional baseball players. As usual in R, we load the data:
Here is what the data looks like:
> head(mlb1)
Position Height Weight Age
1 Catcher 74 180 22.99
2 Catcher 74 215 34.69
3 Catcher 72 210 30.78
4 First_Baseman 72 210 35.43
5 First_Baseman 73 188 35.71
6 Second_Baseman 69 176 29.39
The qe-series function calls are of the very simple form
For instance, say we wish to predict player weights. For the random forests ML algorithm, we would make the simple call
For gradient boosting, the call would be similar,
and so on. IT COULDN’T BE EASIER! No setup, predefinitions etc.; just make a simple call.
Default values are used on the above calls, but nondefaults can be specified, e.g.
Each qe-series function is paired with a predict method, e.g. to predict player weight:
A catcher of height 73 and age 28 would be predicted to have weight about 204.
Categorical variables can be predicted too. Where possible, class probabilities are computed in addition to class. Let’s predict player position from the physical characteristics:
> w <- qeGBoost(mlb1,'Position')
> predict(w,data.frame(Height=73,Weight=185,Age=28))
$predClasses
[1] "Relief_Pitcher"
$probs
Catcher First_Baseman Outfielder Relief_Pitcher Second_Baseman
[1,] 0.02396515 0.03167778 0.2369061 0.2830575 0.1421796
Shortstop Starting_Pitcher Third_Baseman
[1,] 0.0592867 0.1824601 0.04046717
A player of height 73, weight 185 and age 28 would be predicted to be a relief pitcher, with probability 0.28. The second most-likely position would be outfielder, and so on.
By default, the qe functions reserve a holdout set on which to assess model accuracy. The remaining data form the training set. After a model is fit to the training set, we use it to predict the holdout data, so as to assess the predictive power of our model. (To specify no holdout, set holdout=NULL in the call.)
> z <- qeRF(mlb1,'Weight')
holdout set has 101 rows
> z$testAcc
[1] 14.45285
> z$baseAcc
[1] 17.22356
The mean absolute prediction error (MAPE) on the holdout data was about 14.5 pounds. On the other hand, if we had simply predicted every player using the overall mean weight, the MAPE would be about 17.2. So, using height, age and player position for our prediction did improve things.
Of course, since the holdout set is random, the same is true for the above accuracy numbers. To gauge the predictive power of a model over many holdout sets, one can use replicMeans(), which is available in qeML via automatic loading of the regtools package. Say for 100 holdout sets:
So the true MAPE for this model on new data is estimated to be 13.6. The standard error is also output, to gauge whether 100 replicates is enough.
The package includes tutorials for those with no background in machine learning, as well as tutorials on advanced topics. A few examples (showing how they are invoked):
vignette(‘ML_Overview’); for those with no prior ML background
vignette(‘Overfitting’); plugging “overfitting” into Google yielded 49,400,000 results!–but what is overfitting REALLY about?; read here!
vignette(‘Feature_Selection’); we often need to pare down our set of predictor variables, both to save computation and prevent overfitting; how can this be done, especially in qeML?
vignette(‘PCA_and_UMAP’); this vignette first takes a closer, more practical look at Principal Components Analysis, then gives an overview of UMAP, a relatively new nonlinear alternative to PCA
Type vignette(‘Function_list’).