The qeML Package: “Quick and Easy” Machine Learning

“Easy for learners, powerful for advanced users”

What this package is about

“Quick and Easy” ML
- MUCH SIMPLER USER INTERFACE than caret, mlr3, tidymodels, etc.
- easy for learners, powerful/convenient for experts
Ideal for teaching!
- numerous built-in real datasets.
- includes tutorials on major ML methods
Special features for those experienced in ML
- advanced functions for feature selection and model development
- advanced ML algorithms, including some novel/unusual ones
- advanced plotting utilities

Easy model fit–first examples

The letters ‘qe’ in the package title stand for “quick and easy,” alluding to the convenience goal of the package. We bring together a variety of machine learning (ML) tools from standard R packages, providing wrappers with a uniform, extremely simple interface. Hence the term “quick and easy.”

For instance, consider the mlb1 data included in the package, consisting of data on professional baseball players. As usual in R, we load the data:

> data(mlb1)

Here is what the data looks like:

> head(mlb1)
        Position Height Weight   Age
1        Catcher     74    180 22.99
2        Catcher     74    215 34.69
3        Catcher     72    210 30.78
4  First_Baseman     72    210 35.43
5  First_Baseman     73    188 35.71
6 Second_Baseman     69    176 29.39

The qe-series function calls are of the very simple form

qe_function_name(dataset,variable_to_predict)

For instance, say we wish to predict player weights. For the random forests ML algorithm, we would make the simple call

qeRF(mlb1,'Weight')

For gradient boosting, the call would be similar,

qeGBoost(mlb1,'Weight')

and so on. IT COULDN’T BE EASIER! No setup, predefinitions etc.; just make a simple call.

Default values are used on the above calls, but nondefaults can be specified, e.g.

qeRF(mlb1,'Weight',nTree=200)

Prediction

Each qe-series function is paired with a predict method, e.g. to predict player weight:

> z <- qeGBoost(mlb1,'Weight')
> x <- data.frame(Position='Catcher',Height=73,Age=28)
> predict(z,x)
[1] 204.2406

A catcher of height 73 and age 28 would be predicted to have weight about 204.

Categorical variables can be predicted too. Where possible, class probabilities are computed in addition to class. Let’s predict player position from the physical characteristics:

> w <- qeGBoost(mlb1,'Position')
> predict(w,data.frame(Height=73,Weight=185,Age=28))
$predClasses
[1] "Relief_Pitcher"

$probs
        Catcher First_Baseman Outfielder Relief_Pitcher Second_Baseman
[1,] 0.02396515    0.03167778  0.2369061      0.2830575      0.1421796
     Shortstop Starting_Pitcher Third_Baseman
[1,] 0.0592867        0.1824601    0.04046717

A player of height 73, weight 185 and age 28 would be predicted to be a relief pitcher, with probability 0.28. The second most-likely position would be outfielder, and so on.

Holdout sets

By default, the qe functions reserve a holdout set on which to assess model accuracy. The remaining data form the training set. After a model is fit to the training set, we use it to predict the holdout data, so as to assess the predictive power of our model. (To specify no holdout, set holdout=NULL in the call.)

> z <- qeRF(mlb1,'Weight')
holdout set has  101 rows
> z$testAcc
[1] 14.45285
> z$baseAcc
[1] 17.22356

The mean absolute prediction error (MAPE) on the holdout data was about 14.5 pounds. On the other hand, if we had simply predicted every player using the overall mean weight, the MAPE would be about 17.2. So, using height, age and player position for our prediction did improve things.

Of course, since the holdout set is random, the same is true for the above accuracy numbers. To gauge the predictive power of a model over many holdout sets, one can use replicMeans(), which is available in qeML via automatic loading of the regtools package. Say for 100 holdout sets:

> replicMeans(100,"qeRF(mlb1,'Weight')$testAcc")
[1] 13.6354
attr(,"stderr")
[1] 0.1147791

So the true MAPE for this model on new data is estimated to be 13.6. The standard error is also output, to gauge whether 100 replicates is enough.

Tutorials

The package includes tutorials for those with no background in machine learning, as well as tutorials on advanced topics. A few examples (showing how they are invoked):

vignette(‘ML_Overview’); for those with no prior ML background
vignette(‘Overfitting’); plugging “overfitting” into Google yielded 49,400,000 results!–but what is overfitting REALLY about?; read here!
vignette(‘Feature_Selection’); we often need to pare down our set of predictor variables, both to save computation and prevent overfitting; how can this be done, especially in qeML?
vignette(‘PCA_and_UMAP’); this vignette first takes a closer, more practical look at Principal Components Analysis, then gives an overview of UMAP, a relatively new nonlinear alternative to PCA

Full function list, by category

Type vignette(‘Function_list’).

Package author: Norm Matloff, UC Davis

Professor of computer science, and former professor of statistics; 2017 Ziegal Award for book, Statistical Regression and Classification: from Linear Models to Machine Learning; Distinguished Teaching Award and Outstanding Public Service Award, UC Davis; bio here.

Quick Start