The explore package offers a simplified way to use machine learning and make a prediction.
explain_tree()
creates a decision treeexplain_forest()
creates a random forestexplain_xgboost()
creates a xgboost modelexplain_logreg()
creates a logistic regressionpredict_target()
uses a model to make a predictionWe use synthetic data in this example
library(dplyr)
library(explore)
train <- create_data_buy(obs = 1000, seed = 1)
glimpse(train)
#> Rows: 1,000
#> Columns: 13
#> $ period <int> 202012, 202012, 202012, 202012, 202012, 202012, 202012~
#> $ buy <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, ~
#> $ age <int> 46, 42, 69, 51, 55, 58, 69, 73, 59, 34, 20, 36, 48, 45~
#> $ city_ind <int> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, ~
#> $ female_ind <int> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, ~
#> $ fixedvoice_ind <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ fixeddata_ind <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ fixedtv_ind <int> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, ~
#> $ mobilevoice_ind <int> 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, ~
#> $ mobiledata_prd <chr> "NO", "NO", "BUSINESS", "BUSINESS", "BUSINESS", "NO", ~
#> $ bbi_speed_ind <int> 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, ~
#> $ bbi_usg_gb <int> 93, 66, 72, 48, 81, 65, 48, 42, 40, 63, 64, 72, 69, 84~
#> $ hh_single <int> 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ~
First we create a decision tree model, using buy
as
target (buy
contains only 0 and 1 values)
We see some clear patterns. Now we create a random forest model (as
it is more accurate). To get the model itself, use parameter
out = "model"
Now we create test data and use the model for a prediction. We use a different seed so we get different data.
test <- create_data_buy(obs = 1000, seed = 2)
glimpse(test)
#> Rows: 1,000
#> Columns: 13
#> $ period <int> 202012, 202012, 202012, 202012, 202012, 202012, 202012~
#> $ buy <int> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, ~
#> $ age <int> 40, 61, 76, 39, 59, 47, 37, 65, 34, 64, 53, 46, 56, 67~
#> $ city_ind <int> 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, ~
#> $ female_ind <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, ~
#> $ fixedvoice_ind <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~
#> $ fixeddata_ind <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ fixedtv_ind <int> 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, ~
#> $ mobilevoice_ind <int> 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, ~
#> $ mobiledata_prd <chr> "BUSINESS", "MOBILE STICK", "NO", "MOBILE STICK", "NO"~
#> $ bbi_speed_ind <int> 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, ~
#> $ bbi_usg_gb <int> 77, 68, 8, 63, 49, 75, 66, 45, 82, 55, 45, 53, 49, 53,~
#> $ hh_single <int> 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, ~
test <- test %>% predict_target(model = model)
glimpse(test)
#> Rows: 1,000
#> Columns: 15
#> $ period <int> 202012, 202012, 202012, 202012, 202012, 202012, 202012~
#> $ buy <int> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, ~
#> $ age <int> 40, 61, 76, 39, 59, 47, 37, 65, 34, 64, 53, 46, 56, 67~
#> $ city_ind <int> 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, ~
#> $ female_ind <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, ~
#> $ fixedvoice_ind <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~
#> $ fixeddata_ind <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ fixedtv_ind <int> 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, ~
#> $ mobilevoice_ind <int> 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, ~
#> $ mobiledata_prd <chr> "BUSINESS", "MOBILE STICK", "NO", "MOBILE STICK", "NO"~
#> $ bbi_speed_ind <int> 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, ~
#> $ bbi_usg_gb <int> 77, 68, 8, 63, 49, 75, 66, 45, 82, 55, 45, 53, 49, 53,~
#> $ hh_single <int> 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, ~
#> $ prediction_0 <dbl> 0.06, 0.40, 1.00, 0.00, 0.74, 0.40, 0.04, 1.00, 0.00, ~
#> $ prediction_1 <dbl> 0.94, 0.60, 0.00, 1.00, 0.26, 0.60, 0.96, 0.00, 1.00, ~
Now we got 2 new variables prediction_0
(the probability
of buy == 0
) and prediction_1
(the probability
of buy == 1
). We can check the predictions by comparing
prediction_1
with real values of buy.
There is a clear difference between buy == 0
and
buy == 1
. So the prediction works.