Introduction

Wine appreciation is an art form enjoyed by many. However, beyond the subjective experience of taste, there lies a science behind wine quality. Winemakers strive to produce exceptional vintages by carefully controlling various factors during the production process. This project explores the fascinating world of wine quality, specifically focusing on the ability to predict it based on measurable chemical properties.

Objective

This project aims to develop a model that can estimate the quality of a wine using its chemical composition. By analyzing features such as acidity, residual sugar content, and sulfur dioxide levels, we aim to unlock valuable insights into the factors that contribute to a superior wine. This model can potentially benefit wine producers, retailers, and even consumers seeking a more informed approach to wine selection.

Data

Data Source

The data for this project was obtained from the UCI Machine Learning Repository, a renowned resource for publicly available datasets. This repository offers a wealth of datasets for various machine learning tasks, making it a valuable resource for researchers and data scientists.

The specific dataset we utilized is titled Wine Quality and focuses on Portuguese Vinho Verde wines.

Vinho Verde is a light-bodied wine known for its crisp acidity and refreshing taste. The dataset encompasses two separate datasets, one for red wines and another for white wines. For this project, we chose to focus on the white wine dataset.

Data Definition

The dataset contains more than 4500 wine samples, each characterized by 11 chemical properties measured during the winemaking process. These features include factors like:

Fixed Acidity: Level of acidity due to tartaric acid.
Volatile Acidity: Level of acidity due to volatile acids like acetic acid.
Alcohol: Percentage of alcohol content by volume
Residual Sugar: Amount of remaining sugar after fermentation.
pH: Measure of acidity on a logarithmic scale.
Citric Acid: Minute quantity naturally present in grapes. Winemakers may add small amounts to increase tartness and prevent haze formation.
Chlorides: Level of chloride salts that can influence a wine’s saltiness and bitterness.
Free Sulfur Dioxide: Amount of unbound sulfur dioxide (SO2) gas, a preservative commonly used to prevent spoilage by bacteria and oxidation.
Total Sulfur Dioxide: The total level of SO2 gas, including both free and bound forms.
Density: The relative density of the wine compared to water. It can be an indicator of the wine’s alcohol content and sugar level.
Sulphates: The level of sulfate salts, which can influence a wine’s overall mouth-feel and perception of bitterness.
Quality: This is the target variable, a score between 0 and 10 representing the perceived sensory quality of the wine according to human experts. This score serves as the target variable for our machine learning model, allowing us to predict the quality of a new wine based on its chemical makeup.

Setting up the Analysis Environment

To begin our analysis, we’ll import essential libraries using the pacman package for efficient dependency management.

Show the code

library(pacman)

p_load(tidyverse, tidymodels, gt, kableExtra, patchwork)

Exploring the Data

Next, we’ll import the wine quality dataset and get a glimpse of its contents. Here, we directly use the janitor package because it’s only needed for this specific step. This approach avoids unnecessary library loading throughout the document.

Show the code

wine_tbl <- read_delim("data/winequality-white.csv", delim = ";") |> 
  janitor::clean_names()

Data Inspection: A Crucial Step

A crucial step in data analysis, after defining our goals and collecting data, is to inspect its quality and structure. This involves examining the data to identify potential issues like missing values, inconsistencies, or formatting errors. Table 1 below gives a preview of our data.

Table 1: Data Preview

Data Preview: Wine Quality for White Wine
Acidity		Citric Acid	Residual Sugar	Chlorides	Sulfur dioxide		Density	pH	Sulphates	Alcohol	Quality
Fixed	Volatile	Citric Acid	Residual Sugar	Chlorides	Free	Total	Density	pH	Sulphates	Alcohol	Quality
7.0	0.27	0.36	20.7	0.045	45	170	1.0010	3.00	0.45	8.8	6
6.3	0.30	0.34	1.6	0.049	14	132	0.9940	3.30	0.49	9.5	6
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	6

Show the code

glimpse(wine_tbl)

Rows: 4,898
Columns: 12
$ fixed_acidity        <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6.3, 8.1,…
$ volatile_acidity     <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32, 0.27, 0…
$ citric_acid          <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16, 0.36, 0…
$ residual_sugar       <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00, 20.70,…
$ chlorides            <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050, 0.045, …
$ free_sulfur_dioxide  <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 11, 17, 1…
$ total_sulfur_dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 132, 129, 6…
$ density              <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0.9951, 0…
$ p_h                  <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18, 3.00, 3…
$ sulphates            <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47, 0.45, 0…
$ alcohol              <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8, 9.5, 11…
$ quality              <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7, 5, 7, 6…

Using the glimpse function we can get a high-level overview of the wine quality data and the data follows the expected structure in the data definition. We have 4898 rows and 12 columns. All the columns are numeric variables.

Show the code

skimr::skim(wine_tbl) |> 
  select(-skim_type) |> 
  gt() |> 
  cols_label(
    skim_variable = "Variable",
    n_missing = "Missing Value",
    complete_rate = "Complete",
    numeric.mean = "Mean",
    numeric.sd = "Standard Deviation",
    numeric.p0 = "Min",
    numeric.p25 = "1st Percentile",
    numeric.p50 = "Median",
    numeric.p75 = "3rd Percentile",
    numeric.p100 = "Max",
    numeric.hist = "Histogram"
  ) |> 
  tab_header(
    title = "White Wine Quality Descriptive Statistics"
  ) |> 
  opt_stylize(
    style = 3,
    color = "gray"
  ) |> 
  as_raw_html()

Table 2: Wine data summary

White Wine Quality Descriptive Statistics
Variable	Missing Value	Complete	Mean	Standard Deviation	Min	1st Percentile	Median	3rd Percentile	Max	Histogram
fixed_acidity	0	1	6.85478767	0.843868228	3.80000	6.3000000	6.80000	7.3000	14.20000	▁▇▁▁▁
volatile_acidity	0	1	0.27824112	0.100794548	0.08000	0.2100000	0.26000	0.3200	1.10000	▇▅▁▁▁
citric_acid	0	1	0.33419151	0.121019804	0.00000	0.2700000	0.32000	0.3900	1.66000	▇▆▁▁▁
residual_sugar	0	1	6.39141486	5.072057784	0.60000	1.7000000	5.20000	9.9000	65.80000	▇▁▁▁▁
chlorides	0	1	0.04577236	0.021847968	0.00900	0.0360000	0.04300	0.0500	0.34600	▇▁▁▁▁
free_sulfur_dioxide	0	1	35.30808493	17.007137325	2.00000	23.0000000	34.00000	46.0000	289.00000	▇▁▁▁▁
total_sulfur_dioxide	0	1	138.36065741	42.498064554	9.00000	108.0000000	134.00000	167.0000	440.00000	▂▇▂▁▁
density	0	1	0.99402738	0.002990907	0.98711	0.9917225	0.99374	0.9961	1.03898	▇▂▁▁▁
p_h	0	1	3.18826664	0.151000600	2.72000	3.0900000	3.18000	3.2800	3.82000	▁▇▇▂▁
sulphates	0	1	0.48984688	0.114125834	0.22000	0.4100000	0.47000	0.5500	1.08000	▃▇▂▁▁
alcohol	0	1	10.51426705	1.230620568	8.00000	9.5000000	10.40000	11.4000	14.20000	▃▇▆▃▁
quality	0	1	5.87790935	0.885638575	3.00000	5.0000000	6.00000	6.0000	9.00000	▁▅▇▃▁

Table 2 summarizes the key characteristics of our data. It reveals:

No Missing Values: There are no missing data points, ensuring a complete dataset for our analysis.
Descriptive Statistics: Table 2 presents various descriptive statistics for each variable. This includes the minimum and maximum values highlighting the data’s range. Additionally, the first and third quartile provides an insight of the distribution of values in the data. Finally the mean and standard deviation indicate the average and spread of each variables relative to their mean.
Outliers: Initial exploration of the data suggests the presence of potential outliers in several features. Specifically, the maximum values for variables fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulphur_dioxide, and total_sulfur_dioxide seems deviate significantly from the rest of the distribution.

The next step is to statistically confirm these observations, we can’t rely solely on inspecting the maximum values. Instead, we’ll employ appropriate outlier detection tests. Due to the fairly large data size we have, we can use either of:

Grubb’s test
Rosner’s test

These outlier detection tests help us statistically determine if the observed extreme values are likely outliers within the dataset. Both tests listed here are particularly useful for identifying outliers in a dataset that is approximately normally distributed.

Univariate Inspection

Dealing with Outliers

Before diving into formal tests, it’s important we get a sense of our data. We can visualize each variables of our data to inspect their distribution.

Visual Inspection

Show the code

wine_tbl_longer <- wine_tbl |> 
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "value"
  )

wine_tbl_longer |> 
  filter(variable != "quality") |> 
  ggplot(aes(sample = value)) +
  geom_qq(alpha = .2, col = "violetred4") +
  geom_qq_line(col = "gray3") +
  facet_wrap(~variable, scales = "free") +
  labs(
    x = "Theoretical distribution",
    y = "Sample",
    title = "Variables are not normally distributed"
  ) +
  theme_minimal()

wine_tbl_longer |> 
  filter(variable !="quality") |> 
  ggplot(aes(value)) +
  facet_wrap(~variable, scales = "free") +
  geom_histogram(fill = "burlywood", col = "black") +
  geom_density(
    stat = "bin",
    col = "violetred"
  ) +
  labs(
    x = "Value",
    y = "Frequency",
    title = "Variables are fairly normal"
  ) +
  theme_minimal()

(a) QQ plot of all the variables shows they are not normally distributed

Figure 1 shows some fairly normal data, but it won’t be bad to test for normality using Shapiro-wilk’s test. We can also use the qqPlot() function from the car package which I really like to test for the normality.

Shapiro Test

Show the code

normality_test <- wine_tbl |> 
  map(~tidy(shapiro.test(.))) |> 
  bind_rows(.id = "variable") |> 
  select(-method) |> 
  janitor::clean_names() |> 
  mutate(
    variable = str_to_title(str_replace_all(variable, "_", " "))
  )
actual_colnames <- colnames(normality_test)
desired_colnames <- str_to_title(names(normality_test))
names(desired_colnames) <- actual_colnames
  
normality_test |> 
  filter(variable != "Quality") |> 
  gt()  |> 
  cols_label(.list = desired_colnames) |> 
  tab_header(
    title = "Shapiro-wilk Normality Test"
  ) |> 
  opt_stylize(
    style = 3,
    color = "gray"
  ) |> 
  as_raw_html()

Table 3: Shapiro-wilk’s normality test: All varibles are not normally distributed

Shapiro-wilk Normality Test
Variable	Statistic	P_value
Fixed Acidity	0.9765615	1.150151e-27
Volatile Acidity	0.9045497	4.586797e-48
Citric Acid	0.9222473	1.013179e-44
Residual Sugar	0.8845686	2.820710e-51
Chlorides	0.5908084	2.140584e-75
Free Sulfur Dioxide	0.9420691	3.857845e-40
Total Sulfur Dioxide	0.9890146	4.383453e-19
Density	0.9548048	1.780895e-36
P H	0.9880965	6.505521e-20
Sulphates	0.9516094	1.821979e-37
Alcohol	0.9553024	2.569014e-36

Car’s qqPlot Test

Show the code

car::qqPlot(
  wine_tbl$p_h,
  ylab = paste0("pH"),
  main = "Quantile-quantile plot for pH variable"
)
car::qqPlot(
  wine_tbl$alcohol,
  ylab = paste0("pH"),
  main = "Quantile-quantile plot for alcohol variable"
)

Table 3 shows all the variables do not follow a normal distribution, Figure 2 stresses this and we can clearly see the presence of outlier.

Where Domain Knowledge Shines

While our initial exploration identified potential outliers in the data, it’s important to consider them in the context of wine making expertise. Domain knowledge from experienced professionals like Cathy Howard suggests that the expected range for residual sugar in wine can vary widely, from as low as 0.3 g/L to over 35 g/L (as referenced in this post).

This information helps us understand that some seemingly extreme values in our data might actually be plausible for certain wine types. Similarly, consulting resources like Waterhouse’s post on “What’s in Wine”. can provide valuable insights into the expected ranges for other wine components.

Given the new information about our data and the supposed outliers we investigated we can proceed with our analysis thanks to domain knowledge.

Response Variable (Quality)

There are two ways we can proceed with our analysis based on our response variable, we can either take quality as a categorical variable or a continuous variable, I prefer and will proceed with the later.

Show the code

wine_tbl |> 
  ggplot(aes(quality)) +
  geom_histogram(
    binwidth = 1,
    fill = "violetred4"
  ) +
  geom_density(
    stat = "count",
    col = "orangered"
  ) +
  scale_x_continuous(
    breaks = seq(1, 10, 1)
  ) +
  labs(
    x = "Quality",
    y = "Frequency",
    title = "White wine quality frequency"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = .5, face = "bold")
  )

Figure 3: Frequency of the white wine quality ratings shows that the wine is mostly rate between 5 to 7

The distribution of wine ratings in Figure 3 shows a range between 3 and 9. This is the same as what we have in Table 2. White wines rated 6 and 5 are the most occurring wine.

Correlation Matrix and Relationship Between Variables

It is important to see how the different variables relates to wine quality.

Show the code

GGally::ggcorr(
  wine_tbl,
  method = "pairwise",
  geom = "circle",
  palette = "Spectral",
  max_size = 10,
  min_size = 2,
  label = TRUE,
  size = 3,
  hjust = .95
) +
  ggtitle("Wine Quality Feature Correlations") +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold")
  )

Figure 4 shows some noteworthy relationships between the variables. Here are the key observations:

Quality and Alcohol: Alcohol shows a moderate positive correlation with quality of wine
Alcohol and Density: A strong negative correlation exists between alcohol content and density. This implies that as alcohol level increases, the density of the wine decreases.
Quality and Density: There’s moderate negative relationship between the quality and wine density.
Residual Sugar and Density: The density of wine is largely affected by the amount of residual sugar present in the wine
Total Sulfur Dioxide and Free Sulfur Dioxide: A moderate positive correlation is observed between total sulfur dioxide and free sulfur dioxide. This indicates that wines with higher total sulfur dioxide content also tend to have higher levels of unbound free sulfur dioxide.

Overall there’s a fair distribution of correlation across all the variables ranging from as low as ± 0.1 to ± 0.8.

Modeling: Predicting the Perfect Potion

Imagine you’re a wine connoisseur, swirling a glass and trying to guess how good a wine will be. That’s kind of what we’re setting out to do here, but with the help of some brainy computer programs!

We’re going to test out three different “recipes” to predict white wine quality. Think of them like different ways of mixing ingredients (the info we have about the wine) to come up with a final answer (how good the wine is). Here’s the lowdown on each recipe:

Generalized Linear Model : his is a foundational recipe, like a classic soup base. It establishes a relationship between the independent variables (like acidity, sugar content) and the quality score using a specific link function. For wine quality which might be a score or rank, a linear regression within a GLM framework would likely be ideal.
Decision Tree Model: This recipe is more like a detective story with a bunch of “yes or no” questions. It splits the wine data based on different factors, asking questions like “is the alcohol content high?” or “is it a certain grape variety?” to guess the quality.
Random Forest: Imagine having a whole bunch of detectives working together! This recipe combines lots of decision tree detectives, making it more accurate and less likely to get stuck on one idea.
Nearest Neighbor: This recipe is like asking other wine experts for their opinion. It finds similar wines in the data based on the available information and predicts the quality based on how those similar wines were rated.

Data Splitting Strategy

To ensure robust model evaluation, we’ll employ a data splitting strategy. This involves dividing the dataset into two portions:

Training Set (80%): This larger portion of the data will be used to train the models. The training process involves fitting the model parameters to the data, allowing the model to learn the underlying relationships between the features (independent variables) and the target variable (wine quality).
Test Set (20%): This smaller portion of the data will be reserved for testing the final model performance. We will not use the test set for training the models in any way. Instead, the trained model will be applied to the test set, and its predictions will be compared to the actual quality scores to assess the model’s generalizability and accuracy on unseen data.

Rationale for Splitting:

The 80/20 split is a common choice for data splitting, but depending on the size of your dataset, a 70/30 split might also be acceptable. The key idea is to have a sufficient amount of data for both training and testing purposes. The training set allows the model to learn, while the test set provides an unbiased assessment of how well the model performs on new data. To ensure balance in the distribution of the response variable (quality) between the test and the training data, we’ll execute the split taking into account the distribution of the response variable, as seen in Figure 3 the response variable is unevenly distributed.

To ensure reproducibility of the code, we’ll also set.seed into 34433

Show the code

set.seed(34433)
wine_split <- initial_split(wine_tbl, prop = .8, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

We also need to evaluate the developed models, for this, we will use a cross-fold validation.

Show the code

set.seed(33323)
wine_folds <- vfold_cv(wine_train, v = 10)

Model Specification

To ensure a good modeling workflow, we specify our model engines, and prepare to tune some parameters, for example, the decision tree model specification, we do not know the tree depth that would be optimal, so we tune this parameter. Other parameters which cannot be predetermine when fitting our model are also tuned.

Show the code

glm_spec <- linear_reg() |> # OLS specification
  set_engine("glm", family = "poisson") |> 
  set_mode("regression")

dec_tree_spec <- decision_tree( # Decision tree specification
  tree_depth = tune(),
  cost_complexity = tune()
) |> 
  set_engine("rpart") |> 
  set_mode("regression")

rand_for_spec <- rand_forest( # Random forest specification
  trees = 1000,
  mtry = tune(),
  min_n = tune()
) |> 
  set_engine("ranger", importance = "impurity") |> 
  set_mode("regression")

knn_spec <- nearest_neighbor(
  neighbors = tune(),
  dist_power = tune(),
  weight_func = tune()
) |> 
  set_engine("kknn") |> 
  set_mode("regression")

glm_spec

Linear Regression Model Specification (regression)

Engine-Specific Arguments:
  family = poisson

Computational engine: glm

Show the code

dec_tree_spec

Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()

Computational engine: rpart

Show the code

rand_for_spec

Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger

Show the code

knn_spec

K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = tune()
  weight_func = tune()
  dist_power = tune()

Computational engine: kknn

Fine Tuning the Ingredients: Feature Engineering

Now that we’ve chosen our recipe testers (models), let’s take a look at the ingredients (features) we’ll be using. Ideally, we want the ingredients to be in a format that the models can easily understand and use for their predictions. This process of preparing the data is called feature engineering.

There are many resources available to help with feature engineering, like this one link. that provides a reference for different models and recommended pre-processing techniques.

In our case, we’ll start by testing the models on the features without any preprocessing. Some models, like the ones we’ve chosen, can handle data in its raw form and might not necessarily require specific transformations.

However, that doesn’t mean there’s no room for improvement! We can explore additional options like:

Log transformation: This can be helpful for features with skewed distributions, making them more symmetrical and easier for some models to work with.
Scaled transformations (standardization or normalization): This addresses features with differing scales and units.

Formula Recipe

Here, we assume no preprocessing.

Show the code

formula_rec <- recipe(
  quality ~ .,
  data = wine_train
)
formula_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 11

log_zv preproc

For the second preprocessing, we do the following: - Perform a log transformation on the all predictors, step_log.

Scale the variables so they are all on a similar scale.

Show the code

log_sc_rec <- formula_rec |> 
  step_log(all_predictors()) |> 
  step_zv(all_predictors())

log_sc_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 11

── Operations

• Log transformation on: all_predictors()

• Zero variance filter on: all_predictors()

A preview of the preprocess object is given below Table 4

Table 4: Variables after log transformation and removing zero variance variables

Preview of Preprocessed data
Acidity		Citric Acid	Residual Sugar	Chlorides	Sulfur dioxide		Density	pH	Sulphates	Alcohol	Quality
Fixed	Volatile	Citric Acid	Residual Sugar	Chlorides	Free	Total	Density	pH	Sulphates	Alcohol	Quality
2.091864	-1.3093333	-0.8915981	0.3715636	-3.411248	2.397895	4.143135	-0.009242581	1.095273	-0.5798185	2.484907	5
2.151762	-1.4696760	-0.9162907	1.4350845	-3.352407	2.833213	4.691348	-0.005314095	1.144223	-0.6348783	2.272126	5
2.066863	-1.7147984	-0.9942523	0.1823216	-3.218876	2.772589	4.317488	-0.008032172	1.156881	-0.4620355	2.379546	5
2.116256	-0.8675006	-0.4780358	2.9575111	-3.218876	3.713572	5.147494	0.000199980	1.091923	-0.4004776	2.272126	5
1.871802	-1.1711830	-1.9661129	2.0149030	-3.123566	3.526361	4.890349	-0.004510155	1.169381	-0.6931472	2.251292	5
2.028148	-0.4004776	-1.9661129	0.4054651	-2.603690	3.218876	5.123964	-0.006319929	1.115142	-0.6733446	2.230014	5

After applying a log transformation, it’s common to encounter values like NaN (Not a Number), NA (missing values), or Inf (infinity). To ensure our models can process the data effectively, we should check for these potential issues before fitting the model. A quick summary of the preprocessed data can reveal their presence.

Show the code

log_sc_preprocessed <- prep(log_sc_rec) |> 
  juice()

skimr::skim(log_sc_preprocessed) |> 
  gt() |> 
  as_raw_html()

Table 5: Preview of Preprocessed data

skim_type	skim_variable	complete_rate	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
numeric	fixed_acidity	1	1.917645176	0.12159317	1.3350011	1.840549633	1.916922612	1.987874348	2.65324196	▁▃▇▁▁
numeric	volatile_acidity	1	-1.333889842	0.33357665	-2.5257286	-1.560647748	-1.347073648	-1.139434283	0.09531018	▁▅▇▂▁
numeric	citric_acid	1	-Inf	NaN	-Inf	-1.309333320	-1.139434283	-0.941608540	0.50681760	▁▁▁▇▁
numeric	residual_sugar	1	1.481001154	0.92310358	-0.5108256	0.530628251	1.648658626	2.292534757	4.18661984	▅▆▇▆▁
numeric	chlorides	1	-3.148717269	0.33152116	-4.7105307	-3.324236341	-3.146555163	-2.995732274	-1.06131650	▁▅▇▁▁
numeric	free_sulfur_dioxide	1	3.437736555	0.54690744	0.6931472	3.135494216	3.526360525	3.828641396	5.66642669	▁▁▇▆▁
numeric	total_sulfur_dioxide	1	4.880002528	0.33468233	2.1972246	4.691347882	4.897839800	5.122471438	6.08677473	▁▁▁▇▁
numeric	density	1	-0.005988129	0.00302027	-0.0129738	-0.008304386	-0.006289739	-0.003907625	0.03823946	▇▂▁▁▁
numeric	p_h	1	1.157981867	0.04671982	1.0006319	1.124929597	1.156881197	1.187843422	1.34025042	▁▅▇▂▁
numeric	sulphates	1	-0.739139995	0.22328883	-1.5141277	-0.891598119	-0.755022584	-0.597837001	0.07696104	▁▃▇▃▁
numeric	alcohol	1	2.345977687	0.11535902	2.0794415	2.251291799	2.341805806	2.433613355	2.65324196	▂▇▆▅▁
numeric	quality	1	5.882848392	0.87445631	3.0000000	5.000000000	6.000000000	6.000000000	9.00000000	▁▆▇▃▁

Based on Table 5, the citric_acid feature contains NaN and Inf values. These values can cause issues for machine learning models like KNN. To ensure our models can process the data effectively, we need to rewrite the preprocessing to resolve the issue by adding 1 to values of citric_acid.

Show the code

log_sc_rec2 <- formula_rec |> 
  step_mutate(citric_acid = citric_acid + 1) |> 
  step_log(all_predictors())

log_sc_rec2 |> 
  prep() |> 
  juice() |> 
  skimr::skim() |> 
  gt() |> 
  as_raw_html()

Table 6: Improved preprocessing

skim_type	skim_variable	complete_rate	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
numeric	fixed_acidity	1	1.917645176	0.12159317	1.3350011	1.840549633	1.916922612	1.987874348	2.65324196	▁▃▇▁▁
numeric	volatile_acidity	1	-1.333889842	0.33357665	-2.5257286	-1.560647748	-1.347073648	-1.139434283	0.09531018	▁▅▇▂▁
numeric	citric_acid	1	0.285212528	0.08850576	0.0000000	0.239016900	0.277631737	0.329303747	0.97832612	▁▇▁▁▁
numeric	residual_sugar	1	1.481001154	0.92310358	-0.5108256	0.530628251	1.648658626	2.292534757	4.18661984	▅▆▇▆▁
numeric	chlorides	1	-3.148717269	0.33152116	-4.7105307	-3.324236341	-3.146555163	-2.995732274	-1.06131650	▁▅▇▁▁
numeric	free_sulfur_dioxide	1	3.437736555	0.54690744	0.6931472	3.135494216	3.526360525	3.828641396	5.66642669	▁▁▇▆▁
numeric	total_sulfur_dioxide	1	4.880002528	0.33468233	2.1972246	4.691347882	4.897839800	5.122471438	6.08677473	▁▁▁▇▁
numeric	density	1	-0.005988129	0.00302027	-0.0129738	-0.008304386	-0.006289739	-0.003907625	0.03823946	▇▂▁▁▁
numeric	p_h	1	1.157981867	0.04671982	1.0006319	1.124929597	1.156881197	1.187843422	1.34025042	▁▅▇▂▁
numeric	sulphates	1	-0.739139995	0.22328883	-1.5141277	-0.891598119	-0.755022584	-0.597837001	0.07696104	▁▃▇▃▁
numeric	alcohol	1	2.345977687	0.11535902	2.0794415	2.251291799	2.341805806	2.433613355	2.65324196	▂▇▆▅▁
numeric	quality	1	5.882848392	0.87445631	3.0000000	5.000000000	6.000000000	6.000000000	9.00000000	▁▆▇▃▁

Our efforts in rewriting the preprocessing step have paid off! As shown in Table 6, the citric_acid feature is now free of NA, NAN, and Inf. We can now proceed with the model training process using the preprocessed data from Table 6.

Workflow

Now that we’ve chosen our feature engineering techniques and defined the models we’ll be using, it’s time to bring them together. To capture this entire workflow, we’ll create a workflow object

Think of this workflow object like a recipe book. It will hold all the ingredients (the recipe objects specifying the feature engineering steps) and the instructions for each dish (the model specifications). This allows us to easily apply both recipe options (feature engineering approaches) to each model we’re testing.

The workflow_set() function will be our handy tool for creating this comprehensive recipe book. It ensures we can seamlessly combine different feature engineering choices with various models, allowing for a more thorough analysis.

This approach allows us to explore different combinations and see which recipe (feature engineering) works best for each model in predicting white wine quality.

Show the code

wf_object <- workflow_set(
  preproc = list(
    no_preproc = formula_rec,
    log_zv = log_sc_rec2
  ),
  model = list(
    glm = glm_spec,
    dec_tree = dec_tree_spec,
    rf = rand_for_spec,
    knn = knn_spec
  ),
  cross = TRUE
)

Finding the Perfect Settings: Parameter Tuning

We’ve chosen our models and put together our workflow, but there’s one more step before we can unleash them on the wine data. Each model has its own set of “knobs” we can adjust to fine-tune its performance. These knobs are called “hyperparameters,” and finding the right settings for them can significantly impact how well the model predicts wine quality.

Imagine you’re baking a cake. You have a recipe to follow (the model in this case), but you can adjust things like the oven temperature or the baking time (the hyperparameters) to get the perfect result. Here, we’ll use a technique called “grid search” to explore different combinations of these hyperparameters for each model.

To make this process faster, we’ll use a package called doParallel. This helps us leverage multiple processing cores on your computer, essentially dividing the work and speeding things up.

The workflow_map() function comes in handy again. It allows us to apply the grid search technique to all the models we defined in our workflow. We’ll also set verbose=TRUE during this step, which means the computer will show us the progress (although we won’t include those details in this explanation).

By exploring different hyperparameter combinations, we aim to find the settings that make each model perform at its best when predicting white wine quality.

Show the code

doParallel::registerDoParallel(cores = 8)


wf_tune_fit <- workflow_map(
  wf_object,
  fn = "tune_grid",
  grid = 10,
  resamples = wine_folds,
  seed = 11,
  verbose = TRUE
)

doParallel::stopImplicitCluster()

Tune Result

Show the code

autoplot(wf_tune_fit) +
  theme(
    legend.position = "bottom"
  ) +
  theme_minimal()

Figure 5 provides a visual breakdown of how the different models performed in predicting white wine quality. We used two key metrics to evaluate their success: RMSE and R-Squared.

The clear winner based on these metrics is the Random Forest Model! It achieved the lowest RMSE and highest R-Squared score, indicating both high accuracy (low error) and a strong correlation between its predictions and the actual wine quality score.

Following closely behind is the KNN (K-Nearest Neighbors) model. While it performs well, there’s a noticeable gap between its performance and Random Forest’s. This suggests that Random Forest might be a more robust choice for predicting wine quality in this particular dataset.

Examining Specific Model Results

As Figure 5 revealed, the Random Forest model is the top performer. To gain a better understanding, let’s take a closer look at the tuning process result.

Show the code

top_model <- rank_results(wf_tune_fit, rank_metric = "rmse", select_best = TRUE)

top_model |> 
  select(-c(preprocessor, n)) |> 
  mutate(
    .config = str_remove_all(.config, "Preprocessor1_Model"),
    wflow_id = case_when(
      str_detect(wflow_id, "no_preproc") ~ "No Preprocessing",
      .default = "Log + scale transformation"
    ),
    model = str_replace_all(model, "_", " "),
    model = str_to_title(model)
  ) |> 
  pivot_wider(
    id_cols = c(wflow_id, model, .config, rank),
    names_from = .metric,
    values_from = mean
  ) |> 
  gt() |> 
  cols_label(
    wflow_id = "Recipe",
    .config = "Resample Number",
    model = "Model Type",
    rank = "Rank",
    rmse = "RMSE",
    rsq = "R-Squared"
  ) |> 
  tab_spanner(
    columns = rmse:rsq,
    label = "Evaluation Metric"
  ) |> 
  tab_header(title = "Top Performing Models") |> 
  opt_stylize(
    style = 3,
    color = "blue",
    add_row_striping = TRUE
  ) |> 
  as_raw_html()

Table 7: Ranks of the model, random forest without preprocessing is ranked highest

Top Performing Models
Recipe	Model Type	Resample Number	Rank	Evaluation Metric
Recipe	Model Type	Resample Number	Rank	RMSE	R-Squared
No Preprocessing	Rand Forest	08	1	0.6038689	0.5265060
Log + scale transformation	Rand Forest	08	2	0.6039895	0.5262881
No Preprocessing	Nearest Neighbor	05	3	0.6119481	0.5115739
Log + scale transformation	Nearest Neighbor	05	4	0.6126732	0.5109407
No Preprocessing	Decision Tree	04	5	0.7286573	0.3182304
Log + scale transformation	Decision Tree	04	6	0.7286573	0.3182304
Log + scale transformation	Linear Reg	1	7	0.7372858	0.2896381
No Preprocessing	Linear Reg	1	8	0.7431961	0.2783234

Table 7 offers a comprehensive overview of each model’s performance, providing valuable insights beyond the visual representation in Figure 5. Here’s what we can learn from this table:

Model Performance: It clearly shows the best performing model based on the chosen evaluation metrics (e.g., RMSE, R-squared). This allows for a quick comparison and identification of the champion model.
Preprocessing Status: The table indicates whether each model was trained on data that underwent preprocessing (like feature scaling) or not. This information helps us understand if preprocessing played a role in the model’s performance.
Resample Selection: It specifies the resample object used to select the best performing model version during the tuning process. This provides context about the training and validation data used for each model.

Preprocessing does not improve the performance of all models except the Generalized Linear Model which showed slight improvement.

Next we get the workflow object and finalize the model

Finalize Model

Finalizing the model involves few steps:

extract the workflow object of the best model which id is no_preproc_rf

Show the code

rf_model <- wf_tune_fit |> 
  extract_workflow(id = "no_preproc_rf")

rf_model

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 1000
  min_n = tune()

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger

extract the workflow tuning results on the resamples and select the best metric

Show the code

rf_workflow_res <- wf_tune_fit |> 
  extract_workflow_set_result("no_preproc_rf")

best_tune <- select_best(rf_workflow_res, metric = "rmse")
best_tune |> 
  gt() |> 
  as_raw_html()

Table 8: Best Tuning Parameter

mtry	min_n	.config
7	6	Preprocessor1_Model08

The best tuning parameter is seen in Table 8. We use these values of the parameter with our recommended model with its recommended preprocessing and use the finalize_workflow() function to fit our final workflow.

Show the code

recommended_workflow <- finalize_workflow(
  rf_model,
  parameters = best_tune
)

recommended_workflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 7
  trees = 1000
  min_n = 6

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger

The result shows that our parameters from tuning have been inserted into the model automatically.

The last step involves fitting the workflow recommended from our analysis to the split data set.

Show the code

final_model <- last_fit(
  object = recommended_workflow,
  split = wine_split
)

Model Evaluation

We can see how well the model performs on the test data. We applied ceiling() function to round our predictions as we they are count data.

Show the code

final_model |> 
  collect_predictions() |> 
  mutate(
    .pred = case_when(
      .pred < 5 ~ floor(.pred),
      .default = ceiling(.pred)
    )
  ) |> 
  select(observed = quality, predicted = .pred) |> 
  pivot_longer(
    cols = c(predicted, observed),
    names_to = "type",
    values_to = "value"
  ) |> 
  ggplot(aes(value, fill = type)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Quality",
    y = "Count",
    title = "Model Performnce on Test Data",
    subtitle = "Model do not capture low values (3, 4, and 5) properly",
    fill = "Value"
  ) +
  scale_fill_viridis_d() +
  theme_minimal()

Feature Importance

Feature importance sheds light on the factors that are driving the decisions which our model makes and the one which might be irrelevant. This can provide a lot of valuable insights on what truly matters to white wine consumers. We can investigate this using the vip library.

Show the code

library(vip)

final_model |>
  extract_fit_parsnip() |> 
  vip(
    num_features = 10,
    geom = "col",
    aesthetics = list(
      fill = "tan",
      col = "orange",
      alpha = .6
    )
  ) +
  theme_minimal()

Apparently, alcohol is the number one feature.

Conclusion

This analysis explored the use of machine learning to predict white wine quality based on its chemical components. We compared four models: Generalized Linear Model (GLM), Decision Tree, Random Forest, and K-Nearest Neighbors (KNN). Through training and evaluation, the Random Forest model emerged as the top performer, demonstrating strong predictive capabilities.

It’s important to acknowledge a potential limitation. While Random Forest excels at continuous predictions, the actual quality scores might be assigned discrete labels (e.g., 6 = good, 8 = excellent). This difference in representation between the model’s predictions and the actual quality scores could lead to minor inaccuracies.

Despite this consideration, the Random Forest model has proven to be a powerful tool for predicting wine quality within the 0-10 range. Future explorations could involve:

Classification Models: Investigating models specifically designed for predicting discrete quality categories (e.g., good, bad) might provide even more accurate results when directly aligned with the scoring system. Feature Engineering: Exploring alternative feature engineering techniques could potentially further refine the model’s ability to capture the nuances of wine quality based on its chemical components.

This analysis has successfully demonstrated the potential of machine learning for white wine quality prediction. By understanding the strengths and limitations of different models and data considerations, we can continue to refine our approach and develop even more accurate models for wine quality assessment.

homepage