Introduction

This project is a recycle of the prediction of fire occurrence in Algeria’s forest with a little twist. Instead of predicting fire occurrence, this project will predict the forest Fire Weather Index (FWI). Also different from the former project is the algorithm used. The former project used logistic regression with a principal component analysis preprocessing to reduce multicollinearity between the features. This project uses a regularized regression to predict the FWI of the forest.

Objective

The objective of this project involve:

Developing a FWI model.
Evaluating which of the regularized regression will preform best. (significant difference in performance is not the goal but rather getting the best performance).

Data

Data used for this project is exactly the same as the data used in the prediction of fire occurrence project, check here for the data definition. To get more understanding of the data, and the correlation between the different variables, check the posts, as I will dive in to model development for this project.

The processed and clean data is already made available and will be imported. Prior that, we have to load the necessary library for this analysis.

Show the code

pacman::p_load(nanoparquet, tidymodels, knitr, ggthemr)
ggthemr(palette = "earth", layout = "minimal")

Show the code

algeria_ff <- read_parquet("data/algeria.parquet")

Modelling

We will dive to model development straight away. We begin with sharing the data, creating resamples and setting the model specification before feature engineering and finally model development. ## Data Sharing The data will be splitted to a 70-30 proportion. 70% for training and 30% for testing.

Show the code

set.seed(123)
algeria_split <-initial_split(algeria_ff, prop = .7, strata = fwi)
algeria_train <- training(algeria_split)

Due to the size of the data 243 rows, a bootstrap resampling technique will be employed.

Show the code

set.seed(123)
algeria_bstrap <- bootstraps(algeria_train, strata = fwi)

Model Specification

We do not know the penalty for to use for our regularized regression, so we tune this parameter. The best value for the elastic-net is also unknown and will also be tuned.

Show the code

lasso_spec <- linear_reg(
  penalty = tune(),
  mixture = 1
) |> 
  set_engine("glmnet")

ridge_spec <- linear_reg(
  penalty = tune(),
  mixture = 0
) |> 
  set_engine("glmnet")

elastic_spec <- linear_reg(
  penalty = tune(),
  mixture = tune()
) |> 
  set_engine("glmnet")

Feature engineering

Preprocessing steps carried before except using pca will be employed here.

Show the code

algeria_rec <- recipe(fwi ~ ., data = algeria_train) |> 
  step_zv(all_numeric_predictors()) |> 
  step_nzv(all_numeric_predictors()) |> 
  step_YeoJohnson(all_numeric_predictors()) |> 
  step_scale(all_numeric_predictors()) |> 
  step_dummy(all_factor_predictors())

algeria_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 14

── Operations

• Zero variance filter on: all_numeric_predictors()

• Sparse, unbalanced variable filter on: all_numeric_predictors()

• Yeo-Johnson transformation on: all_numeric_predictors()

• Scaling for: all_numeric_predictors()

• Dummy variables from: all_factor_predictors()

The data after undergoing feature engineering is shown in Table 1:

Show the code

set.seed(123)
algeria_rec |> 
  prep() |> 
  juice() |> 
  car::some() |> 
  kable()

Table 1: Data preview after preprocessing

day	month	temperature	rh	ws	rain	ffmc	dmc	dc	isi	bui	fwi	region_Sidi.Bel.Abbes	classes_fire
1.9099765	6.257817	4.469877	5.018083	16.04237	1.5146337	0.6112248	1.971387	3.257687	1.004836	2.219341	0.5	0	0
3.5100906	8.045765	2.975905	3.822778	14.61728	0.9683845	0.9364749	1.666566	3.227303	1.068358	2.007790	0.5	1	0
1.7969550	6.257817	5.622383	3.605656	13.78430	1.8495797	1.3793852	2.404867	3.562018	1.187334	2.602155	0.8	0	0
3.0974426	5.363843	4.746365	3.822778	15.71108	0.0000000	2.6879614	3.413360	4.732499	2.802260	3.732665	10.6	0	1
0.7144441	6.257817	5.322644	3.968818	14.21285	0.0000000	2.5741665	2.375816	3.764466	2.406109	2.692231	4.9	0	1
1.0917983	7.151791	6.567699	3.249173	12.84074	0.0000000	2.7659263	2.826989	3.433657	2.499136	2.910182	5.9	1	1
3.6120029	6.257817	6.567699	3.178716	14.61728	0.0000000	3.0242252	4.049493	4.665000	3.067498	4.080263	14.5	1	1
1.7969550	7.151791	6.567699	2.159044	13.78430	0.0000000	3.4069902	3.716532	4.469620	3.403472	3.773189	15.7	1	1
2.1330093	8.045765	5.622383	2.423705	13.32829	0.0000000	3.5003297	3.885782	4.603991	3.450648	3.980076	17.5	1	1
2.5687041	8.045765	5.929822	1.773226	15.36427	0.0000000	3.4534197	3.794553	5.041329	3.724684	4.130098	21.6	1	1

Tune Grid

In the models specified earlier we have one to two parameters we have to tune. These are the parameters with tune() in front of them.

Show the code

set.seed(123)

tune_elastic <- extract_parameter_set_dials(elastic_spec) |> 
  grid_regular(
    levels = 25
  )

tune_lasso <- extract_parameter_set_dials(lasso_spec) |> 
  grid_regular(levels =  20)

tune_ridge <- extract_parameter_set_dials(ridge_spec) |> 
  grid_random(size = 20)

We set control to save prediction and the workflow.

Show the code

grid_control <-control_grid(
  save_pred = TRUE,
  save_workflow = TRUE
)

Workflow

A workflow object for each model specification will be made

Show the code

elastic_wf <- workflow() |> 
  add_recipe(algeria_rec) |> 
  add_model(elastic_spec)

lasso_wf <- workflow() |> 
  add_recipe(algeria_rec) |> 
  add_model(lasso_spec)

ridge_wf <- workflow() |> 
  add_recipe(algeria_rec) |> 
  add_model(ridge_spec)

Below is a breakdown of the process from model specification to feature engineering tied together.

Show the code

elastic_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_nzv()
• step_YeoJohnson()
• step_scale()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet

Tuning

Now we tune the parameter(s) of each models

Show the code

elastic_tune <- tune_grid(
  elastic_wf,
  resamples = algeria_bstrap,
  grid = tune_elastic,
  control = grid_control
)

lasso_tune <- tune_grid(
  lasso_wf,
  resamples = algeria_bstrap,
  grid = tune_lasso,
  control = grid_control
)

ridge_tune <- tune_grid(
  ridge_wf,
  resamples = algeria_bstrap,
  grid = tune_ridge,
  control = grid_control
)

Tune Performance

Let’s see the performance of the regularized parameters.

Show the code

elastic_tune |> 
  collect_metrics() |>
  mutate(
    model = "elastic"
  ) |> 
  bind_rows(
    lasso_tune |> 
      collect_metrics() |> 
      mutate(
        model = "lasso"
      )
  ) |> 
  bind_rows(
    ridge_tune |> 
      collect_metrics() |> 
      mutate(
        model = "ridge"
      )
  ) |> 
  ggplot(aes(penalty, mean, color = model)) +
  geom_point() +
  geom_smooth(
    se = FALSE,
    method = "loess",
    formula = "y ~ x"
  ) + 
  facet_wrap(~.metric, nrow = 2, scales = "free") +
  theme_bw()

Final workflow

As shown in Figure 1, the elastic-net regularized model performed the best. We can pick the best model from this and get the final model.

Show the code

best_tune <- elastic_tune |> 
  select_best(metric = "rmse")

final_wf <- finalize_workflow(
  x = elastic_wf,
  parameters = best_tune
)  

final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_nzv()
• step_YeoJohnson()
• step_scale()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.0562341325190349
  mixture = 0.0895833333333333

Computational engine: glmnet

Feature Importance

Above @final-fw we parameters fitted accordingly and can note that the tune parameter in Section 2.4 has been replace accordingly. Before We finally fit the model to the whole of the data. we can investigate to see the most important variables

Show the code

library(vip)


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

Show the code

vip(final_wf |> 
      fit(algeria_train))

Final Fit

Show the code

last_fit <- final_wf |> 
  last_fit(split = algeria_split)

Figure 2 shows how the penalty, aka 𝜆 from 0 to infinity.

Show the code

additional_colors <- c("#af4242", "#535364", "#FFC300", "#e09263", "#123367")
set_swatch(c(unique(swatch()), additional_colors))


extract_fit_parsnip(last_fit) |> 
  autoplot() +
  labs(
    x = "Lambda",
    y = "Coefficients"
  ) +
  theme_bw()

Figure 2: Coeﬃcients for our elastic-net regression model as 𝜆 grows from 0 → ∞

Model Eval

Show the code

last_fit |> 
  collect_metrics() |> 
  kable()

.metric	.estimator	.estimate	.config
rmse	standard	3.3764246	Preprocessor1_Model1
rsq	standard	0.8563439	Preprocessor1_Model1

With a RMSE of 3.38 the model prediction of the forest fire weather index is reliable as the model can also explain about 86% of the whole data.

Conclusion

This project compared three regularized models, and used it as an alternative of creating a model without a pca preprocessing step, as regularized models penalize the coefficient of features pushing them towards zero or making them exactly zero (lasso regression). The model helped minimize the impact of multicollinearity existing within the data.

Reflection

I tried using workflow_map() to tune the three models together combined in a workflow_set(). When metrics where collected to evaluate the models, the tuning parameters were absent and instead only results where returned. You can run the code below to confirm this.

Show the code

algerian_wf_set <- workflow_set(
  preproc = list(rec = algeria_rec),
  models = list(
    lasso = lasso_spec,
    ridge = ridge_spec,
    elastic = elastic_spec
  ),
  cross = TRUE
 ) |>
  option_add(
     id = "rec_elastic",
     grid = tune_elastic
  ) |> 
  option_add(
    id = "rec_lasso",
    grid = tune_lasso
  ) |> 
  option_add(
    id = "rec_ridge",
    grid = tune_ridge
  )

tune_res <- workflow_map(
  algerian_wf_set,
  resamples = algeria_bstrap,
  verbose = TRUE,
  seed = 123,
  fn = "tune_grid",
  grid = 20,
  control = grid_control
)

tune_res |> 
  collect_metrics()