Show the code
library(pacman)
p_load(tidyverse, tidymodels, gt, finetune, bonsai)
Using option-add to Tune Different Model of Workflowset
Olamide Adu
June 11, 2024
The tidymodels package is a game-changer for the R ecosystem, providing a streamlined and intuitive approach to modeling. Built on the tidyverse foundation, it offers a cohesive framework that simplifies the journey from data wrangling to robust models. What makes tidymodels
stand out is its consistent workflow, reducing the learning curve for data scientists and ensuring compatibility across different modeling packages【Kuhn and Silge (2022)】.
The workflows
package is one of the standout components of tidymodels, making the iterative machine learning process in R more manageable. By bundling model fitting and data preprocessing steps into a single coherent object, workflows
simplifies the complexities of the machine learning pipeline, ensuring each step is clearly defined and reproducible. This iterative machine learning process, as covered in “Tidy Modeling with R”【Kuhn and Silge (2022)】, is illustrated below:
The focus of this post, the workflowsets
package, builds on the workflows
package by extending its capabilities to handle multiple machine learning models. Since the best model for any given task is not predetermined, it’s crucial to test multiple models and compare their performances. workflowsets
is designed to manage multiple workflows, making it easier to compare different modeling approaches and preprocessing strategies.
This blog post introduces the option_add
function of the workflowsets
package, which is used to control options for evaluating workflow set functions such as fit_resamples
and tune_grid
. For more information on this function, refer to the documentation with ?option_add
.
We start by loading the packages we will be using for this post
For this post we’ll use the heart disease dataset from kaggle.com. A preview of the data is given Table 1
Heart Diseases | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Age | Gender | Cholesterol | Blood Pressure | Heart Rate | Smoking | Alcohol Intake | Exercise Hours | Family History | Diabetes | Obesity | Stress Level | Blood Sugar | Exercise Induced Angina | Chest Pain Type | Heart Disease |
75 | Female | 228 | 119 | 66 | Current | Heavy | 1 | No | No | Yes | 8 | 119 | Yes | Atypical Angina | 1 |
48 | Male | 204 | 165 | 62 | Current | None | 5 | No | No | No | 9 | 70 | Yes | Typical Angina | 0 |
53 | Male | 234 | 91 | 67 | Never | Heavy | 3 | Yes | No | Yes | 5 | 196 | Yes | Atypical Angina | 1 |
69 | Female | 192 | 90 | 72 | Current | None | 4 | No | Yes | No | 7 | 107 | Yes | Non-anginal Pain | 0 |
62 | Female | 172 | 163 | 93 | Never | None | 6 | No | Yes | No | 2 | 183 | Yes | Asymptomatic | 0 |
77 | Male | 309 | 110 | 73 | Never | None | 0 | No | Yes | Yes | 4 | 122 | Yes | Asymptomatic | 1 |
skimr::skim_without_charts(heart_disease) |>
gt() |>
tab_spanner(
label = "Character",
columns = character.min:character.whitespace
) |>
tab_spanner(
label = "Numeric",
columns = starts_with("numeric")
) |>
cols_label(
skim_type ~ "Type",
skim_variable ~"Variable",
n_missing ~ "Missing?",
complete_rate ~ "Complete?",
character.min ~ "Min",
character.max ~ "Max",
character.empty ~ "Empty",
character.n_unique ~ "Unique",
character.whitespace ~ "Gap",
numeric.mean ~ "Mean",
numeric.sd ~ "SD",
numeric.p0 ~ "Min",
numeric.p25 ~ "25%",
numeric.p50 ~ "Median",
numeric.p75 ~ "75%",
numeric.p100 ~ "Max"
) |>
cols_width(
skim_type ~ px(80),
everything() ~ px(70)
) |>
opt_stylize(
style = 2,
color = "cyan",
) |>
as_raw_html()
Type | Variable | Missing? | Complete? |
Character
|
Numeric
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min | Max | Empty | Unique | Gap | Mean | SD | Min | 25% | Median | 75% | Max | ||||
character | Gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Smoking | 0 | 1 | 5 | 7 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Alcohol Intake | 0 | 1 | 4 | 8 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Family History | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Diabetes | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Obesity | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Exercise Induced Angina | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Chest Pain Type | 0 | 1 | 12 | 16 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA |
numeric | Age | 0 | 1 | NA | NA | NA | NA | NA | 52.293 | 15.727126 | 25 | 39.00 | 52.0 | 66 | 79 |
numeric | Cholesterol | 0 | 1 | NA | NA | NA | NA | NA | 249.939 | 57.914673 | 150 | 200.00 | 248.0 | 299 | 349 |
numeric | Blood Pressure | 0 | 1 | NA | NA | NA | NA | NA | 135.281 | 26.388300 | 90 | 112.75 | 136.0 | 159 | 179 |
numeric | Heart Rate | 0 | 1 | NA | NA | NA | NA | NA | 79.204 | 11.486092 | 60 | 70.00 | 79.0 | 89 | 99 |
numeric | Exercise Hours | 0 | 1 | NA | NA | NA | NA | NA | 4.529 | 2.934241 | 0 | 2.00 | 4.5 | 7 | 9 |
numeric | Stress Level | 0 | 1 | NA | NA | NA | NA | NA | 5.646 | 2.831024 | 1 | 3.00 | 6.0 | 8 | 10 |
numeric | Blood Sugar | 0 | 1 | NA | NA | NA | NA | NA | 134.941 | 36.699624 | 70 | 104.00 | 135.0 | 167 | 199 |
numeric | Heart Disease | 0 | 1 | NA | NA | NA | NA | NA | 0.392 | 0.488441 | 0 | 0.00 | 0.0 | 1 | 1 |
Table 2 shows there are no missing values, so we can proceed with our analysis.
Next, we will convert all character variables to factor data types
We won’t spend time on EDA and proceed with our modeling workflow.
we will split our data to 75% for training and 25% for testing, using the outcome variable (heart_disease
) as the strata to ensure a balance split. Additionally, We will create validation folds to evaluate the models.
age | gender | cholesterol | blood_pressure | heart_rate | smoking | alcohol_intake | exercise_hours | family_history | diabetes | obesity | stress_level | blood_sugar | exercise_induced_angina | chest_pain_type | heart_disease |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
48 | Male | 204 | 165 | 62 | Current | None | 5 | No | No | No | 9 | 70 | Yes | Typical Angina | No |
62 | Female | 172 | 163 | 93 | Never | None | 6 | No | Yes | No | 2 | 183 | Yes | Asymptomatic | No |
37 | Female | 317 | 137 | 66 | Current | Heavy | 3 | No | Yes | Yes | 5 | 114 | No | Non-anginal Pain | No |
43 | Male | 155 | 169 | 82 | Current | Heavy | 8 | Yes | Yes | No | 2 | 163 | No | Typical Angina | No |
44 | Female | 250 | 111 | 66 | Former | None | 6 | Yes | No | Yes | 3 | 121 | Yes | Non-anginal Pain | No |
43 | Female | 279 | 173 | 81 | Current | Moderate | 9 | Yes | No | No | 7 | 150 | No | Asymptomatic | No |
We will use two models for our analysis:
K-nearest neighbors (KNN) model
Generalized linear model (GLM).
Below is the specification we have set for the KNN model:
K-Nearest Neighbor Model Specification (classification)
Main Arguments:
neighbors = tune()
weight_func = tune()
dist_power = tune()
Computational engine: kknn
Model fit template:
kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
ks = min_rows(tune(), data, 5), kernel = tune(), distance = tune())
The KNN spec model is having three tuning parameters. For the GLM model we have the following:
Logistic Regression Model Specification (classification)
Engine-Specific Arguments:
family = stats::binomial(link = "logit")
Computational engine: glm
Model fit template:
stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
family = stats::binomial(link = "logit"))
The GLM specification is having no tuning parameter.
As seen in all the model specification above, the formula is missing. We’ll determine the formula for all models and the necessary preprocessing/feature engineering options we want to include in the next step using the recipe
package
We have three preprocessing specification. The first defines the formula which we will use, the second includes normalizing all numeric predictors, and the final preprocessing step involves creating dummy variables for our categorical variables.
age | gender | cholesterol | blood_pressure | heart_rate | smoking | alcohol_intake | exercise_hours | family_history | diabetes | obesity | stress_level | blood_sugar | exercise_induced_angina | chest_pain_type | heart_disease |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-0.2983420 | Male | -0.784268085 | 1.13748976 | -1.4962849 | Current | None | 5 | No | No | No | 9 | -1.7689353 | Yes | Typical Angina | No |
0.6020865 | Female | -1.332244271 | 1.06281212 | 1.2243394 | Never | None | 6 | No | Yes | No | 2 | 1.3197224 | Yes | Asymptomatic | No |
-1.0058214 | Female | 1.150772824 | 0.09200285 | -1.1452366 | Current | Heavy | 3 | No | Yes | Yes | 5 | -0.5662721 | No | Non-anginal Pain | No |
-0.6199235 | Male | -1.623356620 | 1.28684503 | 0.2589566 | Current | Heavy | 8 | Yes | Yes | No | 2 | 0.7730573 | No | Typical Angina | No |
-0.5556072 | Female | 0.003447684 | -0.87880642 | -1.1452366 | Former | None | 6 | Yes | No | Yes | 3 | -0.3749394 | Yes | Non-anginal Pain | No |
-0.6199235 | Female | 0.500051103 | 1.43620030 | 0.1711946 | Current | Moderate | 9 | Yes | No | No | 7 | 0.4177250 | No | Asymptomatic | No |
Table 3 previews how the data looks after normalizing, which is the second feature engineering technique. Table 4 shows the data after creating dummy variables categorical variables.
age | cholesterol | blood_pressure | heart_rate | blood_sugar | heart_disease | gender_Male | smoking_Former | smoking_Never | alcohol_intake_Moderate | alcohol_intake_None | exercise_hours_X1 | exercise_hours_X2 | exercise_hours_X3 | exercise_hours_X4 | exercise_hours_X5 | exercise_hours_X6 | exercise_hours_X7 | exercise_hours_X8 | exercise_hours_X9 | family_history_Yes | diabetes_Yes | obesity_Yes | stress_level_X2 | stress_level_X3 | stress_level_X4 | stress_level_X5 | stress_level_X6 | stress_level_X7 | stress_level_X8 | stress_level_X9 | stress_level_X10 | exercise_induced_angina_Yes | chest_pain_type_Atypical.Angina | chest_pain_type_Non.anginal.Pain | chest_pain_type_Typical.Angina |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-0.2983420 | -0.784268085 | 1.13748976 | -1.4962849 | -1.7689353 | No | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
0.6020865 | -1.332244271 | 1.06281212 | 1.2243394 | 1.3197224 | No | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
-1.0058214 | 1.150772824 | 0.09200285 | -1.1452366 | -0.5662721 | No | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
-0.6199235 | -1.623356620 | 1.28684503 | 0.2589566 | 0.7730573 | No | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
-0.5556072 | 0.003447684 | -0.87880642 | -1.1452366 | -0.3749394 | No | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
-0.6199235 | 0.500051103 | 1.43620030 | 0.1711946 | 0.4177250 | No | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Using the workflowset
function, we’ve tied three recipe objects to the three different models. The K-nearest neighbor model needs tuning as mentioned earlier.
Warning: `grid_latin_hypercube()` was deprecated in dials 1.3.0.
ℹ Please use `grid_space_filling()` instead.
grid_control <- control_race(
save_pred = TRUE,
save_workflow = TRUE
)
knn_grid |>
ggplot(aes(dist_power, neighbors, col = weight_func)) +
geom_point() +
ggthemes::scale_color_colorblind() +
labs(
x = "Minkowski distance",
y = "Number of Neighbors",
title = "k-NN Regular Grid"
) +
facet_wrap(~weight_func) +
theme(
legend.position = "none"
)
knn_latin |>
ggplot(aes(dist_power, neighbors, col = weight_func)) +
geom_point() +
ggthemes::scale_color_tableau() +
labs(
x = "Minkowski distance",
y = "Number of Neighbors",
title = "k-NN Latin Hypercube Grid"
) +
facet_wrap(~weight_func) +
theme(
legend.position = "none"
)
We set the tuning grid for the model and use the option_add
function to specify it. We will test two different grid structures as shown in Figure 4.
option_add
to Specify Model GridsWe can specify the grid to use for each model using the option_add
function. Below is an image of hd_wf_set
that we defined recently, and we will interpret its output.
The image above shows that option column is having zero values as well as the results column.
hd_tune <- hd_wf_set |>
option_add(
id = "norm_knn",
grid = knn_grid,
control = grid_control
) |>
option_add(
id = "form_knn",
grid = knn_grid,
control = grid_control
) |>
option_add(
id = "norm_knn",
grid = knn_latin,
control = grid_control
) |>
option_add(
id = "form_knn",
grid = knn_latin,
control = grid_control
) |>
option_add(
id = "dum_knn",
grid = knn_grid,
control = grid_control
) |>
option_add(
id = "dum_knn",
grid = knn_latin,
control = grid_control
)
After using the option-add
function, we can see that KNN model specification have two options added to it. We can now proceed to tune our model.
i No tuning parameters. `fit_resamples()` will be attempted
i 1 of 6 resampling: form_glm
✔ 1 of 6 resampling: form_glm (825ms)
i 2 of 6 tuning: form_knn
✔ 2 of 6 tuning: form_knn (1m 44.3s)
i No tuning parameters. `fit_resamples()` will be attempted
i 3 of 6 resampling: norm_glm
✔ 3 of 6 resampling: norm_glm (1s)
i 4 of 6 tuning: norm_knn
✔ 4 of 6 tuning: norm_knn (1m 46.6s)
i No tuning parameters. `fit_resamples()` will be attempted
i 5 of 6 resampling: dum_glm
✔ 5 of 6 resampling: dum_glm (1.1s)
i 6 of 6 tuning: dum_knn
✔ 6 of 6 tuning: dum_knn (2m 25.6s)
Model ID | Model Number | mean | std_err | rank |
---|---|---|---|---|
form_knn | Preprocessor1_Model236 | 0.8653333 | 0.013546445 | 1 |
norm_knn | Preprocessor1_Model236 | 0.8653333 | 0.013546445 | 2 |
form_glm | Preprocessor1_Model1 | 0.8613333 | 0.013799266 | 3 |
norm_glm | Preprocessor1_Model1 | 0.8613333 | 0.013799266 | 4 |
dum_glm | Preprocessor1_Model1 | 0.8613333 | 0.013799266 | 5 |
norm_knn | Preprocessor1_Model134 | 0.8520000 | 0.012000000 | 6 |
form_knn | Preprocessor1_Model134 | 0.8520000 | 0.012000000 | 7 |
dum_knn | Preprocessor1_Model134 | 0.7426667 | 0.008899993 | 8 |
dum_knn | Preprocessor1_Model137 | 0.7413333 | 0.009573626 | 9 |
dum_knn | Preprocessor1_Model129 | 0.7386667 | 0.011958777 | 10 |
dum_knn | Preprocessor1_Model103 | 0.7360000 | 0.010850272 | 11 |
dum_knn | Preprocessor1_Model222 | 0.7333333 | 0.015267168 | 12 |
dum_knn | Preprocessor1_Model106 | 0.7333333 | 0.008663817 | 13 |
dum_knn | Preprocessor1_Model221 | 0.7226667 | 0.013303671 | 14 |
Based on the results, it appears that the KNN model with no preprocessing is the best performing model.
The success of our KNN model, particularly with preprocessing, underscores the critical role of the option_add
function. By utilizing option_add
, we efficiently defined and refined our model’s tuning grid, allowing us to systematically explore and optimize hyperparameters. This approach not only enhances model performance but also ensures robustness and reliability in our predictive analytics pipeline.