Sugi or Hinoki?

Forest Type Classification using K-Nearest Neighbors

Machine Learning
Nearest Neighbor
Author

Olamide Adu

Published

November 10, 2024

For this project, I will be using a K-Nearest Neighbors (KNN) machine learning algorithm to predict the types of forest in the forest type mapping data on UC Irvine ML repository. The data is a multi-temporal remote sensing data of a forested area collected using ASTER satellite imagery in Japan. Using this spectral data different forest types were mapped into for different classes:

Load Packages and Data

To begin I will load the necessary packages and the dataset.

Show the code
pacman::p_load(tidyverse, tidymodels, ggthemr)
ggthemr(palette = "flat dark")
Show the code
forest_type_training <- read_csv("data/training.csv")
forest_type_test <- read_csv("data/testing.csv")

Data Summary

Initial exploration and preview is necessary to understand the data.

Show the code
skimr::skim(forest_type_training)
Table 1: Data Summary
(a)
Name forest_type_training
Number of rows 198
Number of columns 28
_______________________
Column type frequency:
character 1
numeric 27
________________________
Group variables None

Variable type: character

(b)
skim_variable n_missing complete_rate min max empty n_unique whitespace
class 0 1 1 1 0 4 0

Variable type: numeric

(c)
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
b1 0 1 62.95 12.78 34.00 54.00 60.00 70.75 105.00 ▂▇▃▂▁
b2 0 1 41.02 17.83 25.00 28.00 31.50 50.75 160.00 ▇▂▁▁▁
b3 0 1 63.68 17.31 47.00 52.00 57.00 69.00 196.00 ▇▂▁▁▁
b4 0 1 101.41 14.80 54.00 92.25 99.50 111.75 172.00 ▁▇▆▁▁
b5 0 1 58.73 12.39 44.00 49.00 55.00 65.00 98.00 ▇▅▂▁▁
b6 0 1 100.65 11.19 84.00 92.00 98.00 107.00 136.00 ▇▇▅▂▁
b7 0 1 90.60 15.59 54.00 80.00 91.00 101.00 139.00 ▂▆▇▃▁
b8 0 1 28.69 8.98 21.00 24.00 25.00 27.00 82.00 ▇▁▁▁▁
b9 0 1 61.12 9.79 50.00 55.00 58.00 63.00 109.00 ▇▂▁▁▁
pred_minus_obs_H_b1 0 1 50.82 12.84 7.66 40.67 53.03 59.92 83.32 ▁▃▅▇▁
pred_minus_obs_H_b2 0 1 9.81 18.01 -112.60 0.27 18.80 22.26 29.79 ▁▁▁▃▇
pred_minus_obs_H_b3 0 1 32.54 17.74 -106.12 27.20 37.61 43.33 55.97 ▁▁▁▂▇
pred_minus_obs_H_b4 0 1 -3.90 15.08 -77.01 -15.92 -2.18 6.66 40.82 ▁▁▆▇▁
pred_minus_obs_H_b5 0 1 -33.42 12.30 -73.29 -39.76 -29.16 -23.89 -19.49 ▁▁▂▃▇
pred_minus_obs_H_b6 0 1 -40.45 10.87 -76.09 -46.16 -37.51 -32.94 -25.68 ▁▁▃▆▇
pred_minus_obs_H_b7 0 1 -13.91 15.40 -62.74 -23.59 -14.84 -3.25 24.33 ▁▂▇▅▂
pred_minus_obs_H_b8 0 1 1.01 9.07 -52.00 1.98 4.14 5.50 10.83 ▁▁▁▁▇
pred_minus_obs_H_b9 0 1 -5.59 9.77 -53.53 -6.63 -2.26 0.25 5.74 ▁▁▁▂▇
pred_minus_obs_S_b1 0 1 -20.04 4.95 -32.95 -23.33 -20.02 -17.79 5.13 ▂▇▂▁▁
pred_minus_obs_S_b2 0 1 -1.01 1.78 -8.80 -1.86 -0.97 -0.04 12.46 ▁▇▃▁▁
pred_minus_obs_S_b3 0 1 -4.36 2.35 -11.21 -5.79 -4.35 -2.88 7.37 ▁▇▆▁▁
pred_minus_obs_S_b4 0 1 -21.00 6.49 -40.37 -24.09 -20.46 -17.95 1.88 ▁▃▇▁▁
pred_minus_obs_S_b5 0 1 -0.97 0.70 -3.27 -1.29 -0.94 -0.64 3.44 ▁▇▂▁▁
pred_minus_obs_S_b6 0 1 -4.60 1.74 -8.73 -5.75 -4.54 -3.62 3.94 ▂▇▃▁▁
pred_minus_obs_S_b7 0 1 -18.84 5.25 -34.14 -22.24 -19.20 -16.23 3.67 ▁▇▆▁▁
pred_minus_obs_S_b8 0 1 -1.57 1.81 -8.87 -2.37 -1.42 -0.66 8.84 ▁▅▇▁▁
pred_minus_obs_S_b9 0 1 -4.16 1.98 -10.83 -5.12 -4.12 -3.11 7.79 ▁▇▃▁▁

Result from Table 1 (a) shows that the data is having one character variable, our target variable class and the rest are numeric. There are no missing data in the data for the target variable, Table 1 (b), and the features, Table 1 (c).

Exploratory Data Analysis

A quick check on the number of occurrence for each forest class is given below

Show the code
forest_type_training |> 
  mutate(
    class = case_when(
      class == "d" ~ "Mixed",
      class == "s" ~ "Sugi",
      class == "h" ~ "Hinoki",
      .default = "Others"
    )
  ) |> 
  ggplot(aes(fct_infreq(class))) +
  geom_bar(
    col = "gray90",
    fill = "tomato4"
  ) +
  geom_text(
    aes(
      y = after_stat(count),
      label = after_stat(count)),
    stat = "count",
    vjust = -.4
  ) +
  labs(
    x = "Forest Class",
    y = "Count",
    title = "Frequency of The Forest Classes"
  ) +
  expand_limits(y = c(0, 65)) +
  guides(fill = "none")
Figure 1: Frequency of Forest Classee

Furthermore, Figure 2 hows the heatmap of the numeric variables

Show the code
forest_type_training |> 
  select(where(is.double)) |> 
  cor() |> 
  as.data.frame() |> 
  rownames_to_column() |> 
  pivot_longer(
    cols = b1:pred_minus_obs_S_b9,
    names_to = "variables",
    values_to = "value"
  ) |> 
  ggplot(aes(rowname, variables, fill = value)) +
  geom_tile() +
  scale_fill_distiller() +
  theme(
    axis.text.x = element_text(angle = 90),
    axis.title = element_blank()
  )
Figure 2: Correlation Matrix

Modeling

Since the data is readily splitted into testing and training, I will proceed with resampling for model evaluation.

Show the code
ft_folds <- vfold_cv(forest_type_training, v = 10, strata = class)

Model Specification

A classification model specification will be set using parsnip’s nearest_neighbor() function.

Show the code
ft_knn <- nearest_neighbor(
  dist_power = tune(),
  neighbors = tune()
) |> 
  set_mode("classification") |> 
  set_engine("kknn")

ft_knn |> translate()
K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = tune()
  dist_power = tune()

Computational engine: kknn 

Model fit template:
kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
    ks = min_rows(tune(), data, 5), distance = tune())

Feature Engineering

K-NN requires quite the preprocessing due to Euclidean distance being sensitive to outliers, as distance measures are sensitive to the scale of the features. To prevent bias from the different features and allowing large predictors contribute the most to the distance between samples I will use the following preprocessing step:

  • remove near zero or zero variance features
  • normalize, center and scale numeric predictors
  • perform PCA to decorrelate features
Show the code
ft_rec <- recipe(
  class ~ .,
  data = forest_type_training
) |> 
  step_nzv(all_numeric_predictors()) |> 
  step_zv(all_numeric_predictors()) |> 
  step_normalize(all_numeric_predictors()) |> 
  step_center(all_numeric_predictors()) |> 
  step_scale(all_numeric_predictors()) |> 
  step_pca(all_numeric_predictors())

ft_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 27
── Operations 
• Sparse, unbalanced variable filter on: all_numeric_predictors()
• Zero variance filter on: all_numeric_predictors()
• Centering and scaling for: all_numeric_predictors()
• Centering for: all_numeric_predictors()
• Scaling for: all_numeric_predictors()
• PCA extraction with: all_numeric_predictors()

The preprocessed data can be seen in Table 2. The features has been reduced to 5 excluding the target variable.

Show the code
ft_rec |> 
  prep() |> 
  juice() |> 
  head(n = 50)
Table 2
# A tibble: 50 × 6
   class    PC1    PC2    PC3    PC4     PC5
   <fct>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
 1 d     -1.46  3.04   -0.620  1.95  -0.0355
 2 h     -1.81  2.03    4.13   1.68   1.63  
 3 s     -2.56  0.511  -0.532  1.44   1.14  
 4 s     -3.55  3.25    2.00   1.90   1.52  
 5 d      0.561 1.25    0.772  0.684 -1.70  
 6 h     -1.30  1.79    3.50  -2.95   0.256 
 7 s     -2.26  0.0430 -1.98  -1.49   0.125 
 8 d     -1.71  3.19   -1.26   1.92  -0.194 
 9 s     -3.89  4.19   -0.774 -2.46   0.639 
10 o      5.20  3.27   -1.36   0.610  1.02  
# ℹ 40 more rows

Workflow

Next to streamline the whole process, the model specification and recipe object or preprocessing object are combined to ensure they run together.

Show the code
ft_wf <- workflow() |> 
  add_model(ft_knn) |> 
  add_recipe(ft_rec)

ft_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_nzv()
• step_zv()
• step_normalize()
• step_center()
• step_scale()
• step_pca()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = tune()
  dist_power = tune()

Computational engine: kknn 

Tuning

Tune Grid

To ensure we get the best value for k, that is the neighbors and the right type of distance metric to use, Minkowski distance to determine if Manhattan or Euclidean will be optimal.

Show the code
set.seed(2323)
ft_grid <- ft_knn |> 
  extract_parameter_set_dials() |> 
  grid_regular(levels = 15)

ft_grid
# A tibble: 225 × 2
   neighbors dist_power
       <int>      <dbl>
 1         1        0.1
 2         2        0.1
 3         3        0.1
 4         4        0.1
 5         5        0.1
 6         6        0.1
 7         7        0.1
 8         8        0.1
 9         9        0.1
10        10        0.1
# ℹ 215 more rows

Parameter tuning

Below the tuning is done on the resamples and result displayed in Figure 3.

Show the code
ft_tune <- 
  ft_wf |> 
  tune_grid(
  resamples = ft_folds,
  grid = 10,
  control = control_grid(save_pred = TRUE, save_workflow = TRUE)
)

ft_tune |> 
  collect_metrics()
# A tibble: 30 × 8
   neighbors dist_power .metric     .estimator   mean     n std_err .config     
       <int>      <dbl> <chr>       <chr>       <dbl> <int>   <dbl> <chr>       
 1         4      0.1   accuracy    multiclass 0.891     10 0.0305  Preprocesso…
 2         4      0.1   brier_class multiclass 0.0920    10 0.0183  Preprocesso…
 3         4      0.1   roc_auc     hand_till  0.972     10 0.0107  Preprocesso…
 4         8      0.311 accuracy    multiclass 0.911     10 0.0262  Preprocesso…
 5         8      0.311 brier_class multiclass 0.0676    10 0.0121  Preprocesso…
 6         8      0.311 roc_auc     hand_till  0.986     10 0.00560 Preprocesso…
 7        13      0.522 accuracy    multiclass 0.908     10 0.0218  Preprocesso…
 8        13      0.522 brier_class multiclass 0.0628    10 0.0111  Preprocesso…
 9        13      0.522 roc_auc     hand_till  0.987     10 0.00513 Preprocesso…
10         1      0.733 accuracy    multiclass 0.939     10 0.0165  Preprocesso…
# ℹ 20 more rows

Tune Evaluation

Show the code
ft_tune |> 
  collect_metrics() |> 
  janitor::clean_names() |> 
  pivot_longer(
    cols = c(neighbors, dist_power),
    names_to = "params",
    values_to = "values"
  ) |> 
  ggplot(aes(values, mean, col = metric)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(0, 15, 1)) +
  facet_grid(metric~params, scales = "free")
Warning: The `scale_name` argument of `discrete_scale()` is deprecated as of ggplot2
3.5.0.
Figure 3: Tune Result

The three evaluation metrics have high mean at different points, roc_auc and accuracy has their best mean at neighbor = 15 and distance power = 1.3666667 while brier class or score is having it’s best neighbor at 5 and distance power at 0.9444444.

Since I am interested in how well the model distinguishes between the ranking of the classes, I will use the roc_auc. Given that, we will fit the best parameter based on roc_auc to the workflow.

Show the code
best_tune <- ft_tune |> 
  select_best(metric = "roc_auc")

ft_workflow <- ft_tune |> 
  extract_workflow()

best_tune |> 
  knitr::kable()
neighbors dist_power .config
15 1.366667 Preprocessor1_Model07

Final Model Fit

Euclidean distance is best metric with 15 neighbor. The final workflow when the best parameters are fitted is given below:

Show the code
final_wf <- finalize_workflow(
  x = ft_workflow,
  parameters = best_tune
)


final_wf |> 
  fit(forest_type_test)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_nzv()
• step_zv()
• step_normalize()
• step_center()
• step_scale()
• step_pca()

── Model ───────────────────────────────────────────────────────────────────────

Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(15L,     data, 5), distance = ~1.36666666666667)

Type of response variable: nominal
Minimal misclassification: 0.2492308
Best kernel: optimal
Best k: 15

Next, we fit the model on the training data and see how well it performs on the test data.

Show the code
final_fit <- final_wf |> 
  fit(forest_type_training)

Model Evaluation

Now let’s evaluate how well this can performed. First, I pred

Show the code
model_res <- predict(final_fit, forest_type_test, type = "prob") |> 
  bind_cols(forest_type_test) |> 
  mutate(
    across(where(is.character), factor)
  )

Area Under The Curve and ROC_AUC

The model have a roc_auc of 0.93, Table 3, the roc_curve is shown in Figure 4.

Show the code
roc_auc(
  model_res,
  truth =  class,
  contains(".pred_")
) |> 
  knitr::kable()
Table 3: Evaluation metric of model
.metric .estimator .estimate
roc_auc hand_till 0.9392002
Show the code
roc_curve(
  model_res,
  truth = class,
  contains(".pred_") 
) |> 
  autoplot(roc_data)
Figure 4: Area Under the Curve

Next, I checked the confusion matrix, Table 4.

Show the code
predict(final_fit, forest_type_test) |> 
  bind_cols(forest_type_test) |> 
  mutate(class = factor(class)) |> 
  conf_mat(class, .pred_class)
Table 4: Confusion Matrix
          Truth
Prediction   d   h   o   s
         d  88   0  10  11
         h   1  31   0  11
         o   2   0  33   0
         s  14   7   3 114

## Conclusion After fitting the final K-NN workflow on the training data and predicting on the test set, we calculated the ROC AUC to evaluate the model’s classification performance based on the predicted probabilities. The ROC AUC helps assess how well the model discriminates between the different classes. For multi-class classification, the AUC can be computed for each class to assess overall performance.

The plotted ROC curve visually represents the trade-off between the true positive rate (sensitivity) and the false positive rate at different threshold settings, giving further insight into how well the model separates the classes.

An AUC of 0.93 indicates a very good discriminatory power. Based on the ROC curve and AUC score, the model is well suited for classification tasks.