Show the code
::p_load(nanoparquet, tidymodels, knitr, ggthemr)
pacmanggthemr(palette = "earth", layout = "minimal")
Olamide Adu
September 10, 2024
This project is a recycle of the prediction of fire occurrence in Algeria’s forest with a little twist. Instead of predicting fire occurrence, this project will predict the forest Fire Weather Index (FWI). Also different from the former project is the algorithm used. The former project used logistic regression with a principal component analysis preprocessing to reduce multicollinearity between the features. This project uses a regularized regression to predict the FWI of the forest.
The objective of this project involve:
Data used for this project is exactly the same as the data used in the prediction of fire occurrence project, check here for the data definition. To get more understanding of the data, and the correlation between the different variables, check the posts, as I will dive in to model development for this project.
The processed and clean data is already made available and will be imported. Prior that, we have to load the necessary library for this analysis.
We will dive to model development straight away. We begin with sharing the data, creating resamples and setting the model specification before feature engineering and finally model development. ## Data Sharing The data will be splitted to a 70-30 proportion. 70% for training and 30% for testing.
Due to the size of the data 243 rows, a bootstrap resampling technique will be employed.
We do not know the penalty for to use for our regularized regression, so we tune this parameter. The best value for the elastic-net is also unknown and will also be tuned.
Preprocessing steps carried before except using pca will be employed here.
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 14
── Operations
• Zero variance filter on: all_numeric_predictors()
• Sparse, unbalanced variable filter on: all_numeric_predictors()
• Yeo-Johnson transformation on: all_numeric_predictors()
• Scaling for: all_numeric_predictors()
• Dummy variables from: all_factor_predictors()
The data after undergoing feature engineering is shown in Table 1:
day | month | temperature | rh | ws | rain | ffmc | dmc | dc | isi | bui | fwi | region_Sidi.Bel.Abbes | classes_fire |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.9099765 | 6.257817 | 4.469877 | 5.018083 | 16.04237 | 1.5146337 | 0.6112248 | 1.971387 | 3.257687 | 1.004836 | 2.219341 | 0.5 | 0 | 0 |
3.5100906 | 8.045765 | 2.975905 | 3.822778 | 14.61728 | 0.9683845 | 0.9364749 | 1.666566 | 3.227303 | 1.068358 | 2.007790 | 0.5 | 1 | 0 |
1.7969550 | 6.257817 | 5.622383 | 3.605656 | 13.78430 | 1.8495797 | 1.3793852 | 2.404867 | 3.562018 | 1.187334 | 2.602155 | 0.8 | 0 | 0 |
3.0974426 | 5.363843 | 4.746365 | 3.822778 | 15.71108 | 0.0000000 | 2.6879614 | 3.413360 | 4.732499 | 2.802260 | 3.732665 | 10.6 | 0 | 1 |
0.7144441 | 6.257817 | 5.322644 | 3.968818 | 14.21285 | 0.0000000 | 2.5741665 | 2.375816 | 3.764466 | 2.406109 | 2.692231 | 4.9 | 0 | 1 |
1.0917983 | 7.151791 | 6.567699 | 3.249173 | 12.84074 | 0.0000000 | 2.7659263 | 2.826989 | 3.433657 | 2.499136 | 2.910182 | 5.9 | 1 | 1 |
3.6120029 | 6.257817 | 6.567699 | 3.178716 | 14.61728 | 0.0000000 | 3.0242252 | 4.049493 | 4.665000 | 3.067498 | 4.080263 | 14.5 | 1 | 1 |
1.7969550 | 7.151791 | 6.567699 | 2.159044 | 13.78430 | 0.0000000 | 3.4069902 | 3.716532 | 4.469620 | 3.403472 | 3.773189 | 15.7 | 1 | 1 |
2.1330093 | 8.045765 | 5.622383 | 2.423705 | 13.32829 | 0.0000000 | 3.5003297 | 3.885782 | 4.603991 | 3.450648 | 3.980076 | 17.5 | 1 | 1 |
2.5687041 | 8.045765 | 5.929822 | 1.773226 | 15.36427 | 0.0000000 | 3.4534197 | 3.794553 | 5.041329 | 3.724684 | 4.130098 | 21.6 | 1 | 1 |
In the models specified earlier we have one to two parameters we have to tune. These are the parameters with tune()
in front of them.
We set control to save prediction and the workflow.
A workflow object for each model specification will be made
Below is a breakdown of the process from model specification to feature engineering tied together.
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_zv()
• step_nzv()
• step_YeoJohnson()
• step_scale()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glmnet
Now we tune the parameter(s) of each models
elastic_tune <- tune_grid(
elastic_wf,
resamples = algeria_bstrap,
grid = tune_elastic,
control = grid_control
)
lasso_tune <- tune_grid(
lasso_wf,
resamples = algeria_bstrap,
grid = tune_lasso,
control = grid_control
)
ridge_tune <- tune_grid(
ridge_wf,
resamples = algeria_bstrap,
grid = tune_ridge,
control = grid_control
)
Let’s see the performance of the regularized parameters.
elastic_tune |>
collect_metrics() |>
mutate(
model = "elastic"
) |>
bind_rows(
lasso_tune |>
collect_metrics() |>
mutate(
model = "lasso"
)
) |>
bind_rows(
ridge_tune |>
collect_metrics() |>
mutate(
model = "ridge"
)
) |>
ggplot(aes(penalty, mean, color = model)) +
geom_point() +
geom_smooth(
se = FALSE,
method = "loess",
formula = "y ~ x"
) +
facet_wrap(~.metric, nrow = 2, scales = "free") +
theme_bw()
As shown in Figure 1, the elastic-net regularized model performed the best. We can pick the best model from this and get the final model.
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_zv()
• step_nzv()
• step_YeoJohnson()
• step_scale()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 0.0562341325190349
mixture = 0.0895833333333333
Computational engine: glmnet
Above @final-fw we parameters fitted accordingly and can note that the tune parameter in Section 2.4 has been replace accordingly. Before We finally fit the model to the whole of the data. we can investigate to see the most important variables
Figure 2 shows how the penalty, aka 𝜆 from 0 to infinity.
.metric | .estimator | .estimate | .config |
---|---|---|---|
rmse | standard | 3.3764246 | Preprocessor1_Model1 |
rsq | standard | 0.8563439 | Preprocessor1_Model1 |
With a RMSE of 3.38 the model prediction of the forest fire weather index is reliable as the model can also explain about 86% of the whole data.
This project compared three regularized models, and used it as an alternative of creating a model without a pca preprocessing step, as regularized models penalize the coefficient of features pushing them towards zero or making them exactly zero (lasso regression). The model helped minimize the impact of multicollinearity existing within the data.
I tried using workflow_map()
to tune the three models together combined in a workflow_set()
. When metrics where collected to evaluate the models, the tuning parameters were absent and instead only results where returned. You can run the code below to confirm this.
algerian_wf_set <- workflow_set(
preproc = list(rec = algeria_rec),
models = list(
lasso = lasso_spec,
ridge = ridge_spec,
elastic = elastic_spec
),
cross = TRUE
) |>
option_add(
id = "rec_elastic",
grid = tune_elastic
) |>
option_add(
id = "rec_lasso",
grid = tune_lasso
) |>
option_add(
id = "rec_ridge",
grid = tune_ridge
)
tune_res <- workflow_map(
algerian_wf_set,
resamples = algeria_bstrap,
verbose = TRUE,
seed = 123,
fn = "tune_grid",
grid = 20,
control = grid_control
)
tune_res |>
collect_metrics()