Show the code
pacman::p_load(tidyverse, tidymodels, ggthemr)
ggthemr(palette = "flat dark")Forest Type Classification using K-Nearest Neighbors
Olamide Adu
November 10, 2024
For this project, I will be using a K-Nearest Neighbors (KNN) machine learning algorithm to predict the types of forest in the forest type mapping data on UC Irvine ML repository. The data is a multi-temporal remote sensing data of a forested area collected using ASTER satellite imagery in Japan. Using this spectral data different forest types were mapped into for different classes:
To begin I will load the necessary packages and the dataset.
Initial exploration and preview is necessary to understand the data.
| Name | forest_type_training | 
| Number of rows | 198 | 
| Number of columns | 28 | 
| _______________________ | |
| Column type frequency: | |
| character | 1 | 
| numeric | 27 | 
| ________________________ | |
| Group variables | None | 
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace | 
|---|---|---|---|---|---|---|---|
| class | 0 | 1 | 1 | 1 | 0 | 4 | 0 | 
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | 
|---|---|---|---|---|---|---|---|---|---|---|
| b1 | 0 | 1 | 62.95 | 12.78 | 34.00 | 54.00 | 60.00 | 70.75 | 105.00 | ▂▇▃▂▁ | 
| b2 | 0 | 1 | 41.02 | 17.83 | 25.00 | 28.00 | 31.50 | 50.75 | 160.00 | ▇▂▁▁▁ | 
| b3 | 0 | 1 | 63.68 | 17.31 | 47.00 | 52.00 | 57.00 | 69.00 | 196.00 | ▇▂▁▁▁ | 
| b4 | 0 | 1 | 101.41 | 14.80 | 54.00 | 92.25 | 99.50 | 111.75 | 172.00 | ▁▇▆▁▁ | 
| b5 | 0 | 1 | 58.73 | 12.39 | 44.00 | 49.00 | 55.00 | 65.00 | 98.00 | ▇▅▂▁▁ | 
| b6 | 0 | 1 | 100.65 | 11.19 | 84.00 | 92.00 | 98.00 | 107.00 | 136.00 | ▇▇▅▂▁ | 
| b7 | 0 | 1 | 90.60 | 15.59 | 54.00 | 80.00 | 91.00 | 101.00 | 139.00 | ▂▆▇▃▁ | 
| b8 | 0 | 1 | 28.69 | 8.98 | 21.00 | 24.00 | 25.00 | 27.00 | 82.00 | ▇▁▁▁▁ | 
| b9 | 0 | 1 | 61.12 | 9.79 | 50.00 | 55.00 | 58.00 | 63.00 | 109.00 | ▇▂▁▁▁ | 
| pred_minus_obs_H_b1 | 0 | 1 | 50.82 | 12.84 | 7.66 | 40.67 | 53.03 | 59.92 | 83.32 | ▁▃▅▇▁ | 
| pred_minus_obs_H_b2 | 0 | 1 | 9.81 | 18.01 | -112.60 | 0.27 | 18.80 | 22.26 | 29.79 | ▁▁▁▃▇ | 
| pred_minus_obs_H_b3 | 0 | 1 | 32.54 | 17.74 | -106.12 | 27.20 | 37.61 | 43.33 | 55.97 | ▁▁▁▂▇ | 
| pred_minus_obs_H_b4 | 0 | 1 | -3.90 | 15.08 | -77.01 | -15.92 | -2.18 | 6.66 | 40.82 | ▁▁▆▇▁ | 
| pred_minus_obs_H_b5 | 0 | 1 | -33.42 | 12.30 | -73.29 | -39.76 | -29.16 | -23.89 | -19.49 | ▁▁▂▃▇ | 
| pred_minus_obs_H_b6 | 0 | 1 | -40.45 | 10.87 | -76.09 | -46.16 | -37.51 | -32.94 | -25.68 | ▁▁▃▆▇ | 
| pred_minus_obs_H_b7 | 0 | 1 | -13.91 | 15.40 | -62.74 | -23.59 | -14.84 | -3.25 | 24.33 | ▁▂▇▅▂ | 
| pred_minus_obs_H_b8 | 0 | 1 | 1.01 | 9.07 | -52.00 | 1.98 | 4.14 | 5.50 | 10.83 | ▁▁▁▁▇ | 
| pred_minus_obs_H_b9 | 0 | 1 | -5.59 | 9.77 | -53.53 | -6.63 | -2.26 | 0.25 | 5.74 | ▁▁▁▂▇ | 
| pred_minus_obs_S_b1 | 0 | 1 | -20.04 | 4.95 | -32.95 | -23.33 | -20.02 | -17.79 | 5.13 | ▂▇▂▁▁ | 
| pred_minus_obs_S_b2 | 0 | 1 | -1.01 | 1.78 | -8.80 | -1.86 | -0.97 | -0.04 | 12.46 | ▁▇▃▁▁ | 
| pred_minus_obs_S_b3 | 0 | 1 | -4.36 | 2.35 | -11.21 | -5.79 | -4.35 | -2.88 | 7.37 | ▁▇▆▁▁ | 
| pred_minus_obs_S_b4 | 0 | 1 | -21.00 | 6.49 | -40.37 | -24.09 | -20.46 | -17.95 | 1.88 | ▁▃▇▁▁ | 
| pred_minus_obs_S_b5 | 0 | 1 | -0.97 | 0.70 | -3.27 | -1.29 | -0.94 | -0.64 | 3.44 | ▁▇▂▁▁ | 
| pred_minus_obs_S_b6 | 0 | 1 | -4.60 | 1.74 | -8.73 | -5.75 | -4.54 | -3.62 | 3.94 | ▂▇▃▁▁ | 
| pred_minus_obs_S_b7 | 0 | 1 | -18.84 | 5.25 | -34.14 | -22.24 | -19.20 | -16.23 | 3.67 | ▁▇▆▁▁ | 
| pred_minus_obs_S_b8 | 0 | 1 | -1.57 | 1.81 | -8.87 | -2.37 | -1.42 | -0.66 | 8.84 | ▁▅▇▁▁ | 
| pred_minus_obs_S_b9 | 0 | 1 | -4.16 | 1.98 | -10.83 | -5.12 | -4.12 | -3.11 | 7.79 | ▁▇▃▁▁ | 
Result from Table 1 (a) shows that the data is having one character variable, our target variable class and the rest are numeric. There are no missing data in the data for the target variable, Table 1 (b), and the features, Table 1 (c).
A quick check on the number of occurrence for each forest class is given below
forest_type_training |> 
  mutate(
    class = case_when(
      class == "d" ~ "Mixed",
      class == "s" ~ "Sugi",
      class == "h" ~ "Hinoki",
      .default = "Others"
    )
  ) |> 
  ggplot(aes(fct_infreq(class))) +
  geom_bar(
    col = "gray90",
    fill = "tomato4"
  ) +
  geom_text(
    aes(
      y = after_stat(count),
      label = after_stat(count)),
    stat = "count",
    vjust = -.4
  ) +
  labs(
    x = "Forest Class",
    y = "Count",
    title = "Frequency of The Forest Classes"
  ) +
  expand_limits(y = c(0, 65)) +
  guides(fill = "none")Furthermore, Figure 2 hows the heatmap of the numeric variables
forest_type_training |> 
  select(where(is.double)) |> 
  cor() |> 
  as.data.frame() |> 
  rownames_to_column() |> 
  pivot_longer(
    cols = b1:pred_minus_obs_S_b9,
    names_to = "variables",
    values_to = "value"
  ) |> 
  ggplot(aes(rowname, variables, fill = value)) +
  geom_tile() +
  scale_fill_distiller() +
  theme(
    axis.text.x = element_text(angle = 90),
    axis.title = element_blank()
  )Since the data is readily splitted into testing and training, I will proceed with resampling for model evaluation.
A classification model specification will be set using parsnip’s nearest_neighbor() function.
K-Nearest Neighbor Model Specification (classification)
Main Arguments:
  neighbors = tune()
  dist_power = tune()
Computational engine: kknn 
Model fit template:
kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
    ks = min_rows(tune(), data, 5), distance = tune())
K-NN requires quite the preprocessing due to Euclidean distance being sensitive to outliers, as distance measures are sensitive to the scale of the features. To prevent bias from the different features and allowing large predictors contribute the most to the distance between samples I will use the following preprocessing step:
ft_rec <- recipe(
  class ~ .,
  data = forest_type_training
) |> 
  step_nzv(all_numeric_predictors()) |> 
  step_zv(all_numeric_predictors()) |> 
  step_normalize(all_numeric_predictors()) |> 
  step_center(all_numeric_predictors()) |> 
  step_scale(all_numeric_predictors()) |> 
  step_pca(all_numeric_predictors())
ft_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 27
── Operations 
• Sparse, unbalanced variable filter on: all_numeric_predictors()
• Zero variance filter on: all_numeric_predictors()
• Centering and scaling for: all_numeric_predictors()
• Centering for: all_numeric_predictors()
• Scaling for: all_numeric_predictors()
• PCA extraction with: all_numeric_predictors()
The preprocessed data can be seen in Table 2. The features has been reduced to 5 excluding the target variable.
# A tibble: 50 × 6
   class    PC1    PC2    PC3    PC4     PC5
   <fct>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
 1 d     -1.46  3.04   -0.620  1.95  -0.0355
 2 h     -1.81  2.03    4.13   1.68   1.63  
 3 s     -2.56  0.511  -0.532  1.44   1.14  
 4 s     -3.55  3.25    2.00   1.90   1.52  
 5 d      0.561 1.25    0.772  0.684 -1.70  
 6 h     -1.30  1.79    3.50  -2.95   0.256 
 7 s     -2.26  0.0430 -1.98  -1.49   0.125 
 8 d     -1.71  3.19   -1.26   1.92  -0.194 
 9 s     -3.89  4.19   -0.774 -2.46   0.639 
10 o      5.20  3.27   -1.36   0.610  1.02  
# ℹ 40 more rows
Next to streamline the whole process, the model specification and recipe object or preprocessing object are combined to ensure they run together.
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps
• step_nzv()
• step_zv()
• step_normalize()
• step_center()
• step_scale()
• step_pca()
── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)
Main Arguments:
  neighbors = tune()
  dist_power = tune()
Computational engine: kknn 
To ensure we get the best value for k, that is the neighbors and the right type of distance metric to use, Minkowski distance to determine if Manhattan or Euclidean will be optimal.
Below the tuning is done on the resamples and result displayed in Figure 3.
# A tibble: 30 × 8
   neighbors dist_power .metric     .estimator   mean     n std_err .config     
       <int>      <dbl> <chr>       <chr>       <dbl> <int>   <dbl> <chr>       
 1         1      0.733 accuracy    multiclass 0.940     10 0.0171  pre0_mod01_…
 2         1      0.733 brier_class multiclass 0.0598    10 0.0171  pre0_mod01_…
 3         1      0.733 roc_auc     hand_till  0.956     10 0.0132  pre0_mod01_…
 4         2      1.58  accuracy    multiclass 0.944     10 0.0122  pre0_mod02_…
 5         2      1.58  brier_class multiclass 0.0468    10 0.00995 pre0_mod02_…
 6         2      1.58  roc_auc     hand_till  0.982     10 0.00760 pre0_mod02_…
 7         4      0.1   accuracy    multiclass 0.836     10 0.0208  pre0_mod03_…
 8         4      0.1   brier_class multiclass 0.107     10 0.0137  pre0_mod03_…
 9         4      0.1   roc_auc     hand_till  0.966     10 0.00915 pre0_mod03_…
10         5      0.944 accuracy    multiclass 0.938     10 0.0177  pre0_mod04_…
# ℹ 20 more rows
ft_tune |> 
  collect_metrics() |> 
  janitor::clean_names() |> 
  pivot_longer(
    cols = c(neighbors, dist_power),
    names_to = "params",
    values_to = "values"
  ) |> 
  ggplot(aes(values, mean, col = metric)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(0, 15, 1)) +
  facet_grid(metric~params, scales = "free")Warning: The `scale_name` argument of `discrete_scale()` is deprecated as of ggplot2
3.5.0.
The three evaluation metrics have high mean at different points, roc_auc and accuracy has their best mean at neighbor = 11 and distance power = 2 while brier class or score is having it’s best neighbor at 7 and distance power at 1.7888889.
Since I am interested in how well the model distinguishes between the ranking of the classes, I will use the roc_auc. Given that, we will fit the best parameter based on roc_auc to the workflow.
Euclidean distance is best metric with 11 neighbor. The final workflow when the best parameters are fitted is given below:
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps
• step_nzv()
• step_zv()
• step_normalize()
• step_center()
• step_scale()
• step_pca()
── Model ───────────────────────────────────────────────────────────────────────
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(11L,     data, 5), distance = ~2)
Type of response variable: nominal
Minimal misclassification: 0.2523077
Best kernel: optimal
Best k: 11
Next, we fit the model on the training data and see how well it performs on the test data.
Now let’s evaluate how well this can performed. First, I pred
The model have a roc_auc of 0.93, Table 3, the roc_curve is shown in Figure 4.
| .metric | .estimator | .estimate | 
|---|---|---|
| roc_auc | hand_till | 0.9293994 | 
Next, I checked the confusion matrix, Table 4.
          Truth
Prediction   d   h   o   s
         d  89   0   9  13
         h   1  33   0  11
         o   3   0  34   0
         s  12   5   3 112
## Conclusion After fitting the final K-NN workflow on the training data and predicting on the test set, we calculated the ROC AUC to evaluate the model’s classification performance based on the predicted probabilities. The ROC AUC helps assess how well the model discriminates between the different classes. For multi-class classification, the AUC can be computed for each class to assess overall performance.
The plotted ROC curve visually represents the trade-off between the true positive rate (sensitivity) and the false positive rate at different threshold settings, giving further insight into how well the model separates the classes.
An AUC of 0.93 indicates a very good discriminatory power. Based on the ROC curve and AUC score, the model is well suited for classification tasks.