The Algorithmic Grape

Decoding Wine Quality with Machine Learning

Machine Learning
Nearest Neighbor
Linear Regression
Random Forest
Decision Trees
Published

May 3, 2024

Introduction

wine qualities with chemicals


Wine appreciation is an art form enjoyed by many. However, beyond the subjective experience of taste, there lies a science behind wine quality. Winemakers strive to produce exceptional vintages by carefully controlling various factors during the production process. This project explores the fascinating world of wine quality, specifically focusing on the ability to predict it based on measurable chemical properties.

Objective

This project aims to develop a model that can estimate the quality of a wine using its chemical composition. By analyzing features such as acidity, residual sugar content, and sulfur dioxide levels, we aim to unlock valuable insights into the factors that contribute to a superior wine. This model can potentially benefit wine producers, retailers, and even consumers seeking a more informed approach to wine selection.

Data

Data Source

The data for this project was obtained from the UCI Machine Learning Repository, a renowned resource for publicly available datasets. This repository offers a wealth of datasets for various machine learning tasks, making it a valuable resource for researchers and data scientists.

The specific dataset we utilized is titled Wine Quality and focuses on Portuguese Vinho Verde wines.
Vinho wine logo

Vinho Verde is a light-bodied wine known for its crisp acidity and refreshing taste. The dataset encompasses two separate datasets, one for red wines and another for white wines. For this project, we chose to focus on the white wine dataset.

Data Definition

The dataset contains more than 4500 wine samples, each characterized by 11 chemical properties measured during the winemaking process. These features include factors like:

  • Fixed Acidity: Level of acidity due to tartaric acid.

  • Volatile Acidity: Level of acidity due to volatile acids like acetic acid.

  • Alcohol: Percentage of alcohol content by volume

  • Residual Sugar: Amount of remaining sugar after fermentation.

  • pH: Measure of acidity on a logarithmic scale.

  • Citric Acid: Minute quantity naturally present in grapes. Winemakers may add small amounts to increase tartness and prevent haze formation.

  • Chlorides: Level of chloride salts that can influence a wine’s saltiness and bitterness.

  • Free Sulfur Dioxide: Amount of unbound sulfur dioxide (SO2) gas, a preservative commonly used to prevent spoilage by bacteria and oxidation.

  • Total Sulfur Dioxide: The total level of SO2 gas, including both free and bound forms.

  • Density: The relative density of the wine compared to water. It can be an indicator of the wine’s alcohol content and sugar level.

  • Sulphates: The level of sulfate salts, which can influence a wine’s overall mouth-feel and perception of bitterness.

  • Quality: This is the target variable, a score between 0 and 10 representing the perceived sensory quality of the wine according to human experts. This score serves as the target variable for our machine learning model, allowing us to predict the quality of a new wine based on its chemical makeup.

Setting up the Analysis Environment

wine qualities with chemicals


To begin our analysis, we’ll import essential libraries using the pacman package for efficient dependency management.

Show the code
library(pacman)

p_load(tidyverse, tidymodels, gt, kableExtra, patchwork)

Exploring the Data

Next, we’ll import the wine quality dataset and get a glimpse of its contents. Here, we directly use the janitor package because it’s only needed for this specific step. This approach avoids unnecessary library loading throughout the document.

Show the code
wine_tbl <- read_delim("data/winequality-white.csv", delim = ";") |> 
  janitor::clean_names()

Data Inspection: A Crucial Step

A crucial step in data analysis, after defining our goals and collecting data, is to inspect its quality and structure. This involves examining the data to identify potential issues like missing values, inconsistencies, or formatting errors. Table 1 below gives a preview of our data.

Table 1: Data Preview
Data Preview: Wine Quality for White Wine
Acidity
Citric Acid Residual Sugar Chlorides
Sulfur dioxide
Density pH Sulphates Alcohol Quality
Fixed Volatile Free Total
7.0 0.27 0.36 20.7 0.045 45 170 1.0010 3.00 0.45 8.8 6
6.3 0.30 0.34 1.6 0.049 14 132 0.9940 3.30 0.49 9.5 6
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 6
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 6
Show the code
glimpse(wine_tbl)
Rows: 4,898
Columns: 12
$ fixed_acidity        <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6.3, 8.1,…
$ volatile_acidity     <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32, 0.27, 0…
$ citric_acid          <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16, 0.36, 0…
$ residual_sugar       <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00, 20.70,…
$ chlorides            <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050, 0.045, …
$ free_sulfur_dioxide  <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 11, 17, 1…
$ total_sulfur_dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 132, 129, 6…
$ density              <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0.9951, 0…
$ p_h                  <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18, 3.00, 3…
$ sulphates            <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47, 0.45, 0…
$ alcohol              <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8, 9.5, 11…
$ quality              <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7, 5, 7, 6…

Using the glimpse function we can get a high-level overview of the wine quality data and the data follows the expected structure in the data definition. We have 4898 rows and 12 columns. All the columns are numeric variables.

Show the code
skimr::skim(wine_tbl) |> 
  select(-skim_type) |> 
  gt() |> 
  cols_label(
    skim_variable = "Variable",
    n_missing = "Missing Value",
    complete_rate = "Complete",
    numeric.mean = "Mean",
    numeric.sd = "Standard Deviation",
    numeric.p0 = "Min",
    numeric.p25 = "1st Percentile",
    numeric.p50 = "Median",
    numeric.p75 = "3rd Percentile",
    numeric.p100 = "Max",
    numeric.hist = "Histogram"
  ) |> 
  tab_header(
    title = "White Wine Quality Descriptive Statistics"
  ) |> 
  opt_stylize(
    style = 3,
    color = "gray"
  ) |> 
  as_raw_html()
Table 2: Wine data summary
White Wine Quality Descriptive Statistics
Variable Missing Value Complete Mean Standard Deviation Min 1st Percentile Median 3rd Percentile Max Histogram
fixed_acidity 0 1 6.85478767 0.843868228 3.80000 6.3000000 6.80000 7.3000 14.20000 ▁▇▁▁▁
volatile_acidity 0 1 0.27824112 0.100794548 0.08000 0.2100000 0.26000 0.3200 1.10000 ▇▅▁▁▁
citric_acid 0 1 0.33419151 0.121019804 0.00000 0.2700000 0.32000 0.3900 1.66000 ▇▆▁▁▁
residual_sugar 0 1 6.39141486 5.072057784 0.60000 1.7000000 5.20000 9.9000 65.80000 ▇▁▁▁▁
chlorides 0 1 0.04577236 0.021847968 0.00900 0.0360000 0.04300 0.0500 0.34600 ▇▁▁▁▁
free_sulfur_dioxide 0 1 35.30808493 17.007137325 2.00000 23.0000000 34.00000 46.0000 289.00000 ▇▁▁▁▁
total_sulfur_dioxide 0 1 138.36065741 42.498064554 9.00000 108.0000000 134.00000 167.0000 440.00000 ▂▇▂▁▁
density 0 1 0.99402738 0.002990907 0.98711 0.9917225 0.99374 0.9961 1.03898 ▇▂▁▁▁
p_h 0 1 3.18826664 0.151000600 2.72000 3.0900000 3.18000 3.2800 3.82000 ▁▇▇▂▁
sulphates 0 1 0.48984688 0.114125834 0.22000 0.4100000 0.47000 0.5500 1.08000 ▃▇▂▁▁
alcohol 0 1 10.51426705 1.230620568 8.00000 9.5000000 10.40000 11.4000 14.20000 ▃▇▆▃▁
quality 0 1 5.87790935 0.885638575 3.00000 5.0000000 6.00000 6.0000 9.00000 ▁▅▇▃▁

Table 2 summarizes the key characteristics of our data. It reveals:

  • No Missing Values: There are no missing data points, ensuring a complete dataset for our analysis.

  • Descriptive Statistics: Table 2 presents various descriptive statistics for each variable. This includes the minimum and maximum values highlighting the data’s range. Additionally, the first and third quartile provides an insight of the distribution of values in the data. Finally the mean and standard deviation indicate the average and spread of each variables relative to their mean.

  • Outliers: Initial exploration of the data suggests the presence of potential outliers in several features. Specifically, the maximum values for variables fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulphur_dioxide, and total_sulfur_dioxide seems deviate significantly from the rest of the distribution.

The next step is to statistically confirm these observations, we can’t rely solely on inspecting the maximum values. Instead, we’ll employ appropriate outlier detection tests. Due to the fairly large data size we have, we can use either of:

  • Grubb’s test

  • Rosner’s test

These outlier detection tests help us statistically determine if the observed extreme values are likely outliers within the dataset. Both tests listed here are particularly useful for identifying outliers in a dataset that is approximately normally distributed.

Univariate Inspection

Dealing with Outliers

Before diving into formal tests, it’s important we get a sense of our data. We can visualize each variables of our data to inspect their distribution.

Visual Inspection

Show the code
wine_tbl_longer <- wine_tbl |> 
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  )

wine_tbl_longer |> 
  filter(variable != "quality") |> 
  ggplot(aes(sample = value)) +
  geom_qq(alpha = .2, col = "violetred4") +
  geom_qq_line(col = "gray3") +
  facet_wrap(~variable, scales = "free") +
  labs(
    x = "Theoretical distribution",
    y = "Sample",
    title = "Variables are not normally distributed"
  ) +
  theme_minimal()

wine_tbl_longer |> 
  filter(variable !="quality") |> 
  ggplot(aes(value)) +
  facet_wrap(~variable, scales = "free") +
  geom_histogram(fill = "burlywood", col = "black") +
  geom_density(
    stat = "bin",
    col = "violetred"
  ) +
  labs(
    x = "Value",
    y = "Frequency",
    title = "Variables are fairly normal"
  ) +
  theme_minimal()
(a) QQ plot of all the variables shows they are not normally distributed
(b) histogram of all variables showing tails
  1. shows the qqplot with variables like total and free sulful_dioxide and p_h approximately following normal distribution,(b) exposes the distribution of all variables which are mostly right-skewed.
Figure 1

Figure 1 shows some fairly normal data, but it won’t be bad to test for normality using Shapiro-wilk’s test. We can also use the qqPlot() function from the car package which I really like to test for the normality.

Shapiro Test

Show the code
normality_test <- wine_tbl |> 
  map(~tidy(shapiro.test(.))) |> 
  bind_rows(.id = "variable") |> 
  select(-method) |> 
  janitor::clean_names() |> 
  mutate(
    variable = str_to_title(str_replace_all(variable, "_", " "))
  )
actual_colnames <- colnames(normality_test)
desired_colnames <- str_to_title(names(normality_test))
names(desired_colnames) <- actual_colnames
  
normality_test |> 
  filter(variable != "Quality") |> 
  gt()  |> 
  cols_label(.list = desired_colnames) |> 
  tab_header(
    title = "Shapiro-wilk Normality Test"
  ) |> 
  opt_stylize(
    style = 3,
    color = "gray"
  ) |> 
  as_raw_html()
Table 3: Shapiro-wilk’s normality test: All varibles are not normally distributed
Shapiro-wilk Normality Test
Variable Statistic P_value
Fixed Acidity 0.9765615 1.150151e-27
Volatile Acidity 0.9045497 4.586797e-48
Citric Acid 0.9222473 1.013179e-44
Residual Sugar 0.8845686 2.820710e-51
Chlorides 0.5908084 2.140584e-75
Free Sulfur Dioxide 0.9420691 3.857845e-40
Total Sulfur Dioxide 0.9890146 4.383453e-19
Density 0.9548048 1.780895e-36
P H 0.9880965 6.505521e-20
Sulphates 0.9516094 1.821979e-37
Alcohol 0.9553024 2.569014e-36

Car’s qqPlot Test

Show the code
car::qqPlot(
  wine_tbl$p_h,
  ylab = paste0("pH"),
  main = "Quantile-quantile plot for pH variable"
)
car::qqPlot(
  wine_tbl$alcohol,
  ylab = paste0("pH"),
  main = "Quantile-quantile plot for alcohol variable"
)
[1] 1251 1256
[1] 3919 4504
(a) pH
(b) alcohol
Figure 2: Normality test with the car’s package qqPlot function

Table 3 shows all the variables do not follow a normal distribution, Figure 2 stresses this and we can clearly see the presence of outlier.

Where Domain Knowledge Shines

While our initial exploration identified potential outliers in the data, it’s important to consider them in the context of wine making expertise. Domain knowledge from experienced professionals like Cathy Howard suggests that the expected range for residual sugar in wine can vary widely, from as low as 0.3 g/L to over 35 g/L (as referenced in this post).

This information helps us understand that some seemingly extreme values in our data might actually be plausible for certain wine types. Similarly, consulting resources like Waterhouse’s post on “What’s in Wine”. can provide valuable insights into the expected ranges for other wine components.

Given the new information about our data and the supposed outliers we investigated we can proceed with our analysis thanks to domain knowledge.

Response Variable (Quality)

There are two ways we can proceed with our analysis based on our response variable, we can either take quality as a categorical variable or a continuous variable, I prefer and will proceed with the later.

Show the code
wine_tbl |> 
  ggplot(aes(quality)) +
  geom_histogram(
    binwidth = 1,
    fill = "violetred4"
  ) +
  geom_density(
    stat = "count",
    col = "orangered"
  ) +
  scale_x_continuous(
    breaks = seq(1, 10, 1)
  ) +
  labs(
    x = "Quality",
    y = "Frequency",
    title = "White wine quality frequency"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = .5, face = "bold")
  )
Figure 3: Frequency of the white wine quality ratings shows that the wine is mostly rate between 5 to 7

The distribution of wine ratings in Figure 3 shows a range between 3 and 9. This is the same as what we have in Table 2. White wines rated 6 and 5 are the most occurring wine.

Correlation Matrix and Relationship Between Variables

It is important to see how the different variables relates to wine quality.

Show the code
GGally::ggcorr(
  wine_tbl,
  method = "pairwise",
  geom = "circle",
  palette = "Spectral",
  max_size = 10,
  min_size = 2,
  label = TRUE,
  size = 3,
  hjust = .95
) +
  ggtitle("Wine Quality Feature Correlations") +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold")
  )
Figure 4: Correlation Plot

Figure 4 shows some noteworthy relationships between the variables. Here are the key observations:

  • Quality and Alcohol: Alcohol shows a moderate positive correlation with quality of wine
  • Alcohol and Density: A strong negative correlation exists between alcohol content and density. This implies that as alcohol level increases, the density of the wine decreases.
  • Quality and Density: There’s moderate negative relationship between the quality and wine density.
  • Residual Sugar and Density: The density of wine is largely affected by the amount of residual sugar present in the wine
  • Total Sulfur Dioxide and Free Sulfur Dioxide: A moderate positive correlation is observed between total sulfur dioxide and free sulfur dioxide. This indicates that wines with higher total sulfur dioxide content also tend to have higher levels of unbound free sulfur dioxide.

Overall there’s a fair distribution of correlation across all the variables ranging from as low as ± 0.1 to ± 0.8.

Modeling: Predicting the Perfect Potion

cat connoisseur

Imagine you’re a wine connoisseur, swirling a glass and trying to guess how good a wine will be. That’s kind of what we’re setting out to do here, but with the help of some brainy computer programs!

We’re going to test out three different “recipes” to predict white wine quality. Think of them like different ways of mixing ingredients (the info we have about the wine) to come up with a final answer (how good the wine is). Here’s the lowdown on each recipe:

  • Generalized Linear Model : his is a foundational recipe, like a classic soup base. It establishes a relationship between the independent variables (like acidity, sugar content) and the quality score using a specific link function. For wine quality which might be a score or rank, a linear regression within a GLM framework would likely be ideal.

  • Decision Tree Model: This recipe is more like a detective story with a bunch of “yes or no” questions. It splits the wine data based on different factors, asking questions like “is the alcohol content high?” or “is it a certain grape variety?” to guess the quality.

  • Random Forest: Imagine having a whole bunch of detectives working together! This recipe combines lots of decision tree detectives, making it more accurate and less likely to get stuck on one idea.

  • Nearest Neighbor: This recipe is like asking other wine experts for their opinion. It finds similar wines in the data based on the available information and predicts the quality based on how those similar wines were rated.

Data Splitting Strategy

To ensure robust model evaluation, we’ll employ a data splitting strategy. This involves dividing the dataset into two portions:

  • Training Set (80%): This larger portion of the data will be used to train the models. The training process involves fitting the model parameters to the data, allowing the model to learn the underlying relationships between the features (independent variables) and the target variable (wine quality).

  • Test Set (20%): This smaller portion of the data will be reserved for testing the final model performance. We will not use the test set for training the models in any way. Instead, the trained model will be applied to the test set, and its predictions will be compared to the actual quality scores to assess the model’s generalizability and accuracy on unseen data.

Rationale for Splitting:

The 80/20 split is a common choice for data splitting, but depending on the size of your dataset, a 70/30 split might also be acceptable. The key idea is to have a sufficient amount of data for both training and testing purposes. The training set allows the model to learn, while the test set provides an unbiased assessment of how well the model performs on new data. To ensure balance in the distribution of the response variable (quality) between the test and the training data, we’ll execute the split taking into account the distribution of the response variable, as seen in Figure 3 the response variable is unevenly distributed.

To ensure reproducibility of the code, we’ll also set.seed into 34433

Show the code
set.seed(34433)
wine_split <- initial_split(wine_tbl, prop = .8, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

We also need to evaluate the developed models, for this, we will use a cross-fold validation.

Show the code
set.seed(33323)
wine_folds <- vfold_cv(wine_train, v = 10)

Model Specification

To ensure a good modeling workflow, we specify our model engines, and prepare to tune some parameters, for example, the decision tree model specification, we do not know the tree depth that would be optimal, so we tune this parameter. Other parameters which cannot be predetermine when fitting our model are also tuned.

Show the code
glm_spec <- linear_reg() |> # OLS specification
  set_engine("glm", family = "poisson") |> 
  set_mode("regression")

dec_tree_spec <- decision_tree( # Decision tree specification
  tree_depth = tune(),
  cost_complexity = tune()
) |> 
  set_engine("rpart") |> 
  set_mode("regression")

rand_for_spec <- rand_forest( # Random forest specification
  trees = 1000,
  mtry = tune(),
  min_n = tune()
) |> 
  set_engine("ranger", importance = "impurity") |> 
  set_mode("regression")

knn_spec <- nearest_neighbor(
  neighbors = tune(),
  dist_power = tune(),
  weight_func = tune()
) |> 
  set_engine("kknn") |> 
  set_mode("regression")

glm_spec
Linear Regression Model Specification (regression)

Engine-Specific Arguments:
  family = poisson

Computational engine: glm 
Show the code
dec_tree_spec
Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()

Computational engine: rpart 
Show the code
rand_for_spec
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 
Show the code
knn_spec
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = tune()
  weight_func = tune()
  dist_power = tune()

Computational engine: kknn 

Fine Tuning the Ingredients: Feature Engineering

Now that we’ve chosen our recipe testers (models), let’s take a look at the ingredients (features) we’ll be using. Ideally, we want the ingredients to be in a format that the models can easily understand and use for their predictions. This process of preparing the data is called feature engineering.

There are many resources available to help with feature engineering, like this one link. that provides a reference for different models and recommended pre-processing techniques.

In our case, we’ll start by testing the models on the features without any preprocessing. Some models, like the ones we’ve chosen, can handle data in its raw form and might not necessarily require specific transformations.

However, that doesn’t mean there’s no room for improvement! We can explore additional options like:

  • Log transformation: This can be helpful for features with skewed distributions, making them more symmetrical and easier for some models to work with.

  • Scaled transformations (standardization or normalization): This addresses features with differing scales and units.

Formula Recipe

Here, we assume no preprocessing.

Show the code
formula_rec <- recipe(
  quality ~ .,
  data = wine_train
)
formula_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 11

log_zv preproc

For the second preprocessing, we do the following: - Perform a log transformation on the all predictors, step_log.

  • Scale the variables so they are all on a similar scale.
Show the code
log_sc_rec <- formula_rec |> 
  step_log(all_predictors()) |> 
  step_zv(all_predictors())

log_sc_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 11
── Operations 
• Log transformation on: all_predictors()
• Zero variance filter on: all_predictors()

A preview of the preprocess object is given below Table 4

Table 4: Variables after log transformation and removing zero variance variables
Preview of Preprocessed data
Acidity
Citric Acid Residual Sugar Chlorides
Sulfur dioxide
Density pH Sulphates Alcohol Quality
Fixed Volatile Free Total
2.091864 -1.3093333 -0.8915981 0.3715636 -3.411248 2.397895 4.143135 -0.009242581 1.095273 -0.5798185 2.484907 5
2.151762 -1.4696760 -0.9162907 1.4350845 -3.352407 2.833213 4.691348 -0.005314095 1.144223 -0.6348783 2.272126 5
2.066863 -1.7147984 -0.9942523 0.1823216 -3.218876 2.772589 4.317488 -0.008032172 1.156881 -0.4620355 2.379546 5
2.116256 -0.8675006 -0.4780358 2.9575111 -3.218876 3.713572 5.147494 0.000199980 1.091923 -0.4004776 2.272126 5
1.871802 -1.1711830 -1.9661129 2.0149030 -3.123566 3.526361 4.890349 -0.004510155 1.169381 -0.6931472 2.251292 5
2.028148 -0.4004776 -1.9661129 0.4054651 -2.603690 3.218876 5.123964 -0.006319929 1.115142 -0.6733446 2.230014 5

After applying a log transformation, it’s common to encounter values like NaN (Not a Number), NA (missing values), or Inf (infinity). To ensure our models can process the data effectively, we should check for these potential issues before fitting the model. A quick summary of the preprocessed data can reveal their presence.

Show the code
log_sc_preprocessed <- prep(log_sc_rec) |> 
  juice()

skimr::skim(log_sc_preprocessed) |> 
  gt() |> 
  as_raw_html()
Table 5: Preview of Preprocessed data
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric fixed_acidity 0 1 1.917645176 0.12159317 1.3350011 1.840549633 1.916922612 1.987874348 2.65324196 ▁▃▇▁▁
numeric volatile_acidity 0 1 -1.333889842 0.33357665 -2.5257286 -1.560647748 -1.347073648 -1.139434283 0.09531018 ▁▅▇▂▁
numeric citric_acid 0 1 -Inf NaN -Inf -1.309333320 -1.139434283 -0.941608540 0.50681760 ▁▁▁▇▁
numeric residual_sugar 0 1 1.481001154 0.92310358 -0.5108256 0.530628251 1.648658626 2.292534757 4.18661984 ▅▆▇▆▁
numeric chlorides 0 1 -3.148717269 0.33152116 -4.7105307 -3.324236341 -3.146555163 -2.995732274 -1.06131650 ▁▅▇▁▁
numeric free_sulfur_dioxide 0 1 3.437736555 0.54690744 0.6931472 3.135494216 3.526360525 3.828641396 5.66642669 ▁▁▇▆▁
numeric total_sulfur_dioxide 0 1 4.880002528 0.33468233 2.1972246 4.691347882 4.897839800 5.122471438 6.08677473 ▁▁▁▇▁
numeric density 0 1 -0.005988129 0.00302027 -0.0129738 -0.008304386 -0.006289739 -0.003907625 0.03823946 ▇▂▁▁▁
numeric p_h 0 1 1.157981867 0.04671982 1.0006319 1.124929597 1.156881197 1.187843422 1.34025042 ▁▅▇▂▁
numeric sulphates 0 1 -0.739139995 0.22328883 -1.5141277 -0.891598119 -0.755022584 -0.597837001 0.07696104 ▁▃▇▃▁
numeric alcohol 0 1 2.345977687 0.11535902 2.0794415 2.251291799 2.341805806 2.433613355 2.65324196 ▂▇▆▅▁
numeric quality 0 1 5.882848392 0.87445631 3.0000000 5.000000000 6.000000000 6.000000000 9.00000000 ▁▆▇▃▁

Based on Table 5, the citric_acid feature contains NaN and Inf values. These values can cause issues for machine learning models like KNN. To ensure our models can process the data effectively, we need to rewrite the preprocessing to resolve the issue by adding 1 to values of citric_acid.

Show the code
log_sc_rec2 <- formula_rec |> 
  step_mutate(citric_acid = citric_acid + 1) |> 
  step_log(all_predictors())

log_sc_rec2 |> 
  prep() |> 
  juice() |> 
  skimr::skim() |> 
  gt() |> 
  as_raw_html()
Table 6: Improved preprocessing
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric fixed_acidity 0 1 1.917645176 0.12159317 1.3350011 1.840549633 1.916922612 1.987874348 2.65324196 ▁▃▇▁▁
numeric volatile_acidity 0 1 -1.333889842 0.33357665 -2.5257286 -1.560647748 -1.347073648 -1.139434283 0.09531018 ▁▅▇▂▁
numeric citric_acid 0 1 0.285212528 0.08850576 0.0000000 0.239016900 0.277631737 0.329303747 0.97832612 ▁▇▁▁▁
numeric residual_sugar 0 1 1.481001154 0.92310358 -0.5108256 0.530628251 1.648658626 2.292534757 4.18661984 ▅▆▇▆▁
numeric chlorides 0 1 -3.148717269 0.33152116 -4.7105307 -3.324236341 -3.146555163 -2.995732274 -1.06131650 ▁▅▇▁▁
numeric free_sulfur_dioxide 0 1 3.437736555 0.54690744 0.6931472 3.135494216 3.526360525 3.828641396 5.66642669 ▁▁▇▆▁
numeric total_sulfur_dioxide 0 1 4.880002528 0.33468233 2.1972246 4.691347882 4.897839800 5.122471438 6.08677473 ▁▁▁▇▁
numeric density 0 1 -0.005988129 0.00302027 -0.0129738 -0.008304386 -0.006289739 -0.003907625 0.03823946 ▇▂▁▁▁
numeric p_h 0 1 1.157981867 0.04671982 1.0006319 1.124929597 1.156881197 1.187843422 1.34025042 ▁▅▇▂▁
numeric sulphates 0 1 -0.739139995 0.22328883 -1.5141277 -0.891598119 -0.755022584 -0.597837001 0.07696104 ▁▃▇▃▁
numeric alcohol 0 1 2.345977687 0.11535902 2.0794415 2.251291799 2.341805806 2.433613355 2.65324196 ▂▇▆▅▁
numeric quality 0 1 5.882848392 0.87445631 3.0000000 5.000000000 6.000000000 6.000000000 9.00000000 ▁▆▇▃▁

Our efforts in rewriting the preprocessing step have paid off! As shown in Table 6, the citric_acid feature is now free of NA, NAN, and Inf. We can now proceed with the model training process using the preprocessed data from Table 6.

Workflow

Now that we’ve chosen our feature engineering techniques and defined the models we’ll be using, it’s time to bring them together. To capture this entire workflow, we’ll create a workflow object

Think of this workflow object like a recipe book. It will hold all the ingredients (the recipe objects specifying the feature engineering steps) and the instructions for each dish (the model specifications). This allows us to easily apply both recipe options (feature engineering approaches) to each model we’re testing.

The workflow_set() function will be our handy tool for creating this comprehensive recipe book. It ensures we can seamlessly combine different feature engineering choices with various models, allowing for a more thorough analysis.

This approach allows us to explore different combinations and see which recipe (feature engineering) works best for each model in predicting white wine quality.

Show the code
wf_object <- workflow_set(
  preproc = list(
    no_preproc = formula_rec,
    log_zv = log_sc_rec2
  ),
  model = list(
    glm = glm_spec,
    dec_tree = dec_tree_spec,
    rf = rand_for_spec,
    knn = knn_spec
  ),
  cross = TRUE
)

Finding the Perfect Settings: Parameter Tuning

We’ve chosen our models and put together our workflow, but there’s one more step before we can unleash them on the wine data. Each model has its own set of “knobs” we can adjust to fine-tune its performance. These knobs are called “hyperparameters,” and finding the right settings for them can significantly impact how well the model predicts wine quality.

Imagine you’re baking a cake. You have a recipe to follow (the model in this case), but you can adjust things like the oven temperature or the baking time (the hyperparameters) to get the perfect result. Here, we’ll use a technique called “grid search” to explore different combinations of these hyperparameters for each model.

To make this process faster, we’ll use a package called doParallel. This helps us leverage multiple processing cores on your computer, essentially dividing the work and speeding things up.

The workflow_map() function comes in handy again. It allows us to apply the grid search technique to all the models we defined in our workflow. We’ll also set verbose=TRUE during this step, which means the computer will show us the progress (although we won’t include those details in this explanation).

By exploring different hyperparameter combinations, we aim to find the settings that make each model perform at its best when predicting white wine quality.

Show the code
doParallel::registerDoParallel(cores = 8)


wf_tune_fit <- workflow_map(
  wf_object,
  fn = "tune_grid",
  grid = 10,
  resamples = wine_folds,
  seed = 11,
  verbose = TRUE
)

doParallel::stopImplicitCluster()

Tune Result

Show the code
autoplot(wf_tune_fit) +
  theme(
    legend.position = "bottom"
  ) +
  theme_minimal()
Figure 5: Tune result

Figure 5 provides a visual breakdown of how the different models performed in predicting white wine quality. We used two key metrics to evaluate their success: RMSE and R-Squared.

The clear winner based on these metrics is the Random Forest Model! It achieved the lowest RMSE and highest R-Squared score, indicating both high accuracy (low error) and a strong correlation between its predictions and the actual wine quality score.

Following closely behind is the KNN (K-Nearest Neighbors) model. While it performs well, there’s a noticeable gap between its performance and Random Forest’s. This suggests that Random Forest might be a more robust choice for predicting wine quality in this particular dataset.

Examining Specific Model Results

As Figure 5 revealed, the Random Forest model is the top performer. To gain a better understanding, let’s take a closer look at the tuning process result.

Show the code
top_model <- rank_results(wf_tune_fit, rank_metric = "rmse", select_best = TRUE)

top_model |> 
  select(-c(preprocessor, n)) |> 
  mutate(
    .config = str_remove_all(.config, "Preprocessor1_Model"),
    wflow_id = case_when(
      str_detect(wflow_id, "no_preproc") ~ "No Preprocessing",
      .default = "Log + scale transformation"
    ),
    model = str_replace_all(model, "_", " "),
    model = str_to_title(model)
  ) |> 
  pivot_wider(
    id_cols = c(wflow_id, model, .config, rank),
    names_from = .metric,
    values_from = mean
  ) |> 
  gt() |> 
  cols_label(
    wflow_id = "Recipe",
    .config = "Resample Number",
    model = "Model Type",
    rank = "Rank",
    rmse = "RMSE",
    rsq = "R-Squared"
  ) |> 
  tab_spanner(
    columns = rmse:rsq,
    label = "Evaluation Metric"
  ) |> 
  tab_header(title = "Top Performing Models") |> 
  opt_stylize(
    style = 3,
    color = "blue",
    add_row_striping = TRUE
  ) |> 
  as_raw_html()
Table 7: Ranks of the model, random forest without preprocessing is ranked highest
Top Performing Models
Recipe Model Type Resample Number Rank
Evaluation Metric
RMSE R-Squared
No Preprocessing Rand Forest 08 1 0.6038689 0.5265060
Log + scale transformation Rand Forest 08 2 0.6039895 0.5262881
No Preprocessing Nearest Neighbor 05 3 0.6119481 0.5115739
Log + scale transformation Nearest Neighbor 05 4 0.6126732 0.5109407
No Preprocessing Decision Tree 04 5 0.7286573 0.3182304
Log + scale transformation Decision Tree 04 6 0.7286573 0.3182304
Log + scale transformation Linear Reg 1 7 0.7372858 0.2896381
No Preprocessing Linear Reg 1 8 0.7431961 0.2783234

Table 7 offers a comprehensive overview of each model’s performance, providing valuable insights beyond the visual representation in Figure 5. Here’s what we can learn from this table:

  • Model Performance: It clearly shows the best performing model based on the chosen evaluation metrics (e.g., RMSE, R-squared). This allows for a quick comparison and identification of the champion model.
  • Preprocessing Status: The table indicates whether each model was trained on data that underwent preprocessing (like feature scaling) or not. This information helps us understand if preprocessing played a role in the model’s performance.
  • Resample Selection: It specifies the resample object used to select the best performing model version during the tuning process. This provides context about the training and validation data used for each model.

Preprocessing does not improve the performance of all models except the Generalized Linear Model which showed slight improvement.

Next we get the workflow object and finalize the model

Finalize Model

Finalizing the model involves few steps:

  1. extract the workflow object of the best model which id is no_preproc_rf
Show the code
rf_model <- wf_tune_fit |> 
  extract_workflow(id = "no_preproc_rf")

rf_model
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 
  1. extract the workflow tuning results on the resamples and select the best metric
Show the code
rf_workflow_res <- wf_tune_fit |> 
  extract_workflow_set_result("no_preproc_rf")

best_tune <- select_best(rf_workflow_res, metric = "rmse")
best_tune |> 
  gt() |> 
  as_raw_html()
Table 8: Best Tuning Parameter
mtry min_n .config
7 6 Preprocessor1_Model08
  1. The best tuning parameter is seen in Table 8. We use these values of the parameter with our recommended model with its recommended preprocessing and use the finalize_workflow() function to fit our final workflow.
Show the code
recommended_workflow <- finalize_workflow(
  rf_model,
  parameters = best_tune
)

recommended_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 7
  trees = 1000
  min_n = 6

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 

The result shows that our parameters from tuning have been inserted into the model automatically.

  1. The last step involves fitting the workflow recommended from our analysis to the split data set.
Show the code
final_model <- last_fit(
  object = recommended_workflow,
  split = wine_split
)

Model Evaluation

We can see how well the model performs on the test data. We applied ceiling() function to round our predictions as we they are count data.

Show the code
final_model |> 
  collect_predictions() |> 
  mutate(
    .pred = case_when(
      .pred < 5 ~ floor(.pred),
      .default = ceiling(.pred)
    )
  ) |> 
  select(observed = quality, predicted = .pred) |> 
  pivot_longer(
    cols = c(predicted, observed),
    names_to = "type",
    values_to = "value"
  ) |> 
  ggplot(aes(value, fill = type)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Quality",
    y = "Count",
    title = "Model Performnce on Test Data",
    subtitle = "Model do not capture low values (3, 4, and 5) properly",
    fill = "Value"
  ) +
  scale_fill_viridis_d() +
  theme_minimal()

Feature Importance

Feature importance sheds light on the factors that are driving the decisions which our model makes and the one which might be irrelevant. This can provide a lot of valuable insights on what truly matters to white wine consumers. We can investigate this using the vip library.

Show the code
library(vip)

final_model |>
  extract_fit_parsnip() |> 
  vip(
    num_features = 10,
    geom = "col",
    aesthetics = list(
      fill = "tan",
      col = "orange",
      alpha = .6
    )
  ) +
  theme_minimal()
Figure 6

Apparently, alcohol is the number one feature.

Conclusion

This analysis explored the use of machine learning to predict white wine quality based on its chemical components. We compared four models: Generalized Linear Model (GLM), Decision Tree, Random Forest, and K-Nearest Neighbors (KNN). Through training and evaluation, the Random Forest model emerged as the top performer, demonstrating strong predictive capabilities.

It’s important to acknowledge a potential limitation. While Random Forest excels at continuous predictions, the actual quality scores might be assigned discrete labels (e.g., 6 = good, 8 = excellent). This difference in representation between the model’s predictions and the actual quality scores could lead to minor inaccuracies.

Despite this consideration, the Random Forest model has proven to be a powerful tool for predicting wine quality within the 0-10 range. Future explorations could involve:

Classification Models: Investigating models specifically designed for predicting discrete quality categories (e.g., good, bad) might provide even more accurate results when directly aligned with the scoring system. Feature Engineering: Exploring alternative feature engineering techniques could potentially further refine the model’s ability to capture the nuances of wine quality based on its chemical components.

This analysis has successfully demonstrated the potential of machine learning for white wine quality prediction. By understanding the strengths and limitations of different models and data considerations, we can continue to refine our approach and develop even more accurate models for wine quality assessment.

homepage