Show the code
library(pacman)
p_load(tidyverse, tidymodels, gt, kableExtra, patchwork)
Decoding Wine Quality with Machine Learning
Olamide Adu
May 3, 2024
Wine appreciation is an art form enjoyed by many. However, beyond the subjective experience of taste, there lies a science behind wine quality. Winemakers strive to produce exceptional vintages by carefully controlling various factors during the production process. This project explores the fascinating world of wine quality, specifically focusing on the ability to predict it based on measurable chemical properties.
This project aims to develop a model that can estimate the quality of a wine using its chemical composition. By analyzing features such as acidity, residual sugar content, and sulfur dioxide levels, we aim to unlock valuable insights into the factors that contribute to a superior wine. This model can potentially benefit wine producers, retailers, and even consumers seeking a more informed approach to wine selection.
The data for this project was obtained from the UCI Machine Learning Repository, a renowned resource for publicly available datasets. This repository offers a wealth of datasets for various machine learning tasks, making it a valuable resource for researchers and data scientists.
The specific dataset we utilized is titled Wine Quality and focuses on Portuguese Vinho Verde wines.
Vinho Verde is a light-bodied wine known for its crisp acidity and refreshing taste. The dataset encompasses two separate datasets, one for red wines and another for white wines. For this project, we chose to focus on the white wine dataset.
The dataset contains more than 4500 wine samples, each characterized by 11 chemical properties measured during the winemaking process. These features include factors like:
Fixed Acidity: Level of acidity due to tartaric acid.
Volatile Acidity: Level of acidity due to volatile acids like acetic acid.
Alcohol: Percentage of alcohol content by volume
Residual Sugar: Amount of remaining sugar after fermentation.
pH: Measure of acidity on a logarithmic scale.
Citric Acid: Minute quantity naturally present in grapes. Winemakers may add small amounts to increase tartness and prevent haze formation.
Chlorides: Level of chloride salts that can influence a wine’s saltiness and bitterness.
Free Sulfur Dioxide: Amount of unbound sulfur dioxide (SO2) gas, a preservative commonly used to prevent spoilage by bacteria and oxidation.
Total Sulfur Dioxide: The total level of SO2 gas, including both free and bound forms.
Density: The relative density of the wine compared to water. It can be an indicator of the wine’s alcohol content and sugar level.
Sulphates: The level of sulfate salts, which can influence a wine’s overall mouth-feel and perception of bitterness.
Quality: This is the target variable, a score between 0 and 10 representing the perceived sensory quality of the wine according to human experts. This score serves as the target variable for our machine learning model, allowing us to predict the quality of a new wine based on its chemical makeup.
To begin our analysis, we’ll import essential libraries using the pacman
package for efficient dependency management.
Next, we’ll import the wine quality dataset and get a glimpse of its contents. Here, we directly use the janitor
package because it’s only needed for this specific step. This approach avoids unnecessary library loading throughout the document.
A crucial step in data analysis, after defining our goals and collecting data, is to inspect its quality and structure. This involves examining the data to identify potential issues like missing values, inconsistencies, or formatting errors. Table 1 below gives a preview of our data.
Data Preview: Wine Quality for White Wine | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Acidity
|
Citric Acid | Residual Sugar | Chlorides |
Sulfur dioxide
|
Density | pH | Sulphates | Alcohol | Quality | ||
Fixed | Volatile | Free | Total | ||||||||
7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45 | 170 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14 | 132 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
Rows: 4,898
Columns: 12
$ fixed_acidity <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6.3, 8.1,…
$ volatile_acidity <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32, 0.27, 0…
$ citric_acid <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16, 0.36, 0…
$ residual_sugar <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00, 20.70,…
$ chlorides <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050, 0.045, …
$ free_sulfur_dioxide <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 11, 17, 1…
$ total_sulfur_dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 132, 129, 6…
$ density <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0.9951, 0…
$ p_h <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18, 3.00, 3…
$ sulphates <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47, 0.45, 0…
$ alcohol <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8, 9.5, 11…
$ quality <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7, 5, 7, 6…
Using the glimpse
function we can get a high-level overview of the wine quality data and the data follows the expected structure in the data definition. We have 4898 rows and 12 columns. All the columns are numeric variables.
skimr::skim(wine_tbl) |>
select(-skim_type) |>
gt() |>
cols_label(
skim_variable = "Variable",
n_missing = "Missing Value",
complete_rate = "Complete",
numeric.mean = "Mean",
numeric.sd = "Standard Deviation",
numeric.p0 = "Min",
numeric.p25 = "1st Percentile",
numeric.p50 = "Median",
numeric.p75 = "3rd Percentile",
numeric.p100 = "Max",
numeric.hist = "Histogram"
) |>
tab_header(
title = "White Wine Quality Descriptive Statistics"
) |>
opt_stylize(
style = 3,
color = "gray"
) |>
as_raw_html()
White Wine Quality Descriptive Statistics | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Variable | Missing Value | Complete | Mean | Standard Deviation | Min | 1st Percentile | Median | 3rd Percentile | Max | Histogram |
fixed_acidity | 0 | 1 | 6.85478767 | 0.843868228 | 3.80000 | 6.3000000 | 6.80000 | 7.3000 | 14.20000 | ▁▇▁▁▁ |
volatile_acidity | 0 | 1 | 0.27824112 | 0.100794548 | 0.08000 | 0.2100000 | 0.26000 | 0.3200 | 1.10000 | ▇▅▁▁▁ |
citric_acid | 0 | 1 | 0.33419151 | 0.121019804 | 0.00000 | 0.2700000 | 0.32000 | 0.3900 | 1.66000 | ▇▆▁▁▁ |
residual_sugar | 0 | 1 | 6.39141486 | 5.072057784 | 0.60000 | 1.7000000 | 5.20000 | 9.9000 | 65.80000 | ▇▁▁▁▁ |
chlorides | 0 | 1 | 0.04577236 | 0.021847968 | 0.00900 | 0.0360000 | 0.04300 | 0.0500 | 0.34600 | ▇▁▁▁▁ |
free_sulfur_dioxide | 0 | 1 | 35.30808493 | 17.007137325 | 2.00000 | 23.0000000 | 34.00000 | 46.0000 | 289.00000 | ▇▁▁▁▁ |
total_sulfur_dioxide | 0 | 1 | 138.36065741 | 42.498064554 | 9.00000 | 108.0000000 | 134.00000 | 167.0000 | 440.00000 | ▂▇▂▁▁ |
density | 0 | 1 | 0.99402738 | 0.002990907 | 0.98711 | 0.9917225 | 0.99374 | 0.9961 | 1.03898 | ▇▂▁▁▁ |
p_h | 0 | 1 | 3.18826664 | 0.151000600 | 2.72000 | 3.0900000 | 3.18000 | 3.2800 | 3.82000 | ▁▇▇▂▁ |
sulphates | 0 | 1 | 0.48984688 | 0.114125834 | 0.22000 | 0.4100000 | 0.47000 | 0.5500 | 1.08000 | ▃▇▂▁▁ |
alcohol | 0 | 1 | 10.51426705 | 1.230620568 | 8.00000 | 9.5000000 | 10.40000 | 11.4000 | 14.20000 | ▃▇▆▃▁ |
quality | 0 | 1 | 5.87790935 | 0.885638575 | 3.00000 | 5.0000000 | 6.00000 | 6.0000 | 9.00000 | ▁▅▇▃▁ |
Table 2 summarizes the key characteristics of our data. It reveals:
No Missing Values: There are no missing data points, ensuring a complete dataset for our analysis.
Descriptive Statistics: Table 2 presents various descriptive statistics for each variable. This includes the minimum and maximum values highlighting the data’s range. Additionally, the first and third quartile provides an insight of the distribution of values in the data. Finally the mean and standard deviation indicate the average and spread of each variables relative to their mean.
Outliers: Initial exploration of the data suggests the presence of potential outliers in several features. Specifically, the maximum values for variables fixed_acidity
, volatile_acidity
, citric_acid
, residual_sugar
, chlorides
, free_sulphur_dioxide
, and total_sulfur_dioxide
seems deviate significantly from the rest of the distribution.
The next step is to statistically confirm these observations, we can’t rely solely on inspecting the maximum values. Instead, we’ll employ appropriate outlier detection tests. Due to the fairly large data size we have, we can use either of:
Grubb’s test
Rosner’s test
These outlier detection tests help us statistically determine if the observed extreme values are likely outliers within the dataset. Both tests listed here are particularly useful for identifying outliers in a dataset that is approximately normally distributed.
Before diving into formal tests, it’s important we get a sense of our data. We can visualize each variables of our data to inspect their distribution.
wine_tbl_longer <- wine_tbl |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "value"
)
wine_tbl_longer |>
filter(variable != "quality") |>
ggplot(aes(sample = value)) +
geom_qq(alpha = .2, col = "violetred4") +
geom_qq_line(col = "gray3") +
facet_wrap(~variable, scales = "free") +
labs(
x = "Theoretical distribution",
y = "Sample",
title = "Variables are not normally distributed"
) +
theme_minimal()
wine_tbl_longer |>
filter(variable !="quality") |>
ggplot(aes(value)) +
facet_wrap(~variable, scales = "free") +
geom_histogram(fill = "burlywood", col = "black") +
geom_density(
stat = "bin",
col = "violetred"
) +
labs(
x = "Value",
y = "Frequency",
title = "Variables are fairly normal"
) +
theme_minimal()
Figure 1 shows some fairly normal data, but it won’t be bad to test for normality using Shapiro-wilk’s test. We can also use the qqPlot()
function from the car
package which I really like to test for the normality.
normality_test <- wine_tbl |>
map(~tidy(shapiro.test(.))) |>
bind_rows(.id = "variable") |>
select(-method) |>
janitor::clean_names() |>
mutate(
variable = str_to_title(str_replace_all(variable, "_", " "))
)
actual_colnames <- colnames(normality_test)
desired_colnames <- str_to_title(names(normality_test))
names(desired_colnames) <- actual_colnames
normality_test |>
filter(variable != "Quality") |>
gt() |>
cols_label(.list = desired_colnames) |>
tab_header(
title = "Shapiro-wilk Normality Test"
) |>
opt_stylize(
style = 3,
color = "gray"
) |>
as_raw_html()
Shapiro-wilk Normality Test | ||
---|---|---|
Variable | Statistic | P_value |
Fixed Acidity | 0.9765615 | 1.150151e-27 |
Volatile Acidity | 0.9045497 | 4.586797e-48 |
Citric Acid | 0.9222473 | 1.013179e-44 |
Residual Sugar | 0.8845686 | 2.820710e-51 |
Chlorides | 0.5908084 | 2.140584e-75 |
Free Sulfur Dioxide | 0.9420691 | 3.857845e-40 |
Total Sulfur Dioxide | 0.9890146 | 4.383453e-19 |
Density | 0.9548048 | 1.780895e-36 |
P H | 0.9880965 | 6.505521e-20 |
Sulphates | 0.9516094 | 1.821979e-37 |
Alcohol | 0.9553024 | 2.569014e-36 |
[1] 1251 1256
[1] 3919 4504
Table 3 shows all the variables do not follow a normal distribution, Figure 2 stresses this and we can clearly see the presence of outlier.
While our initial exploration identified potential outliers in the data, it’s important to consider them in the context of wine making expertise. Domain knowledge from experienced professionals like Cathy Howard suggests that the expected range for residual sugar in wine can vary widely, from as low as 0.3 g/L to over 35 g/L (as referenced in this post).
This information helps us understand that some seemingly extreme values in our data might actually be plausible for certain wine types. Similarly, consulting resources like Waterhouse’s post on “What’s in Wine”. can provide valuable insights into the expected ranges for other wine components.
Given the new information about our data and the supposed outliers we investigated we can proceed with our analysis thanks to domain knowledge.
There are two ways we can proceed with our analysis based on our response variable, we can either take quality as a categorical variable or a continuous variable, I prefer and will proceed with the later.
wine_tbl |>
ggplot(aes(quality)) +
geom_histogram(
binwidth = 1,
fill = "violetred4"
) +
geom_density(
stat = "count",
col = "orangered"
) +
scale_x_continuous(
breaks = seq(1, 10, 1)
) +
labs(
x = "Quality",
y = "Frequency",
title = "White wine quality frequency"
) +
theme_classic() +
theme(
plot.title = element_text(hjust = .5, face = "bold")
)
The distribution of wine ratings in Figure 3 shows a range between 3 and 9. This is the same as what we have in Table 2. White wines rated 6 and 5 are the most occurring wine.
It is important to see how the different variables relates to wine quality.
Figure 4 shows some noteworthy relationships between the variables. Here are the key observations:
Overall there’s a fair distribution of correlation across all the variables ranging from as low as ± 0.1 to ± 0.8.
Imagine you’re a wine connoisseur, swirling a glass and trying to guess how good a wine will be. That’s kind of what we’re setting out to do here, but with the help of some brainy computer programs!
We’re going to test out three different “recipes” to predict white wine quality. Think of them like different ways of mixing ingredients (the info we have about the wine) to come up with a final answer (how good the wine is). Here’s the lowdown on each recipe:
Generalized Linear Model : his is a foundational recipe, like a classic soup base. It establishes a relationship between the independent variables (like acidity, sugar content) and the quality score using a specific link function. For wine quality which might be a score or rank, a linear regression within a GLM framework would likely be ideal.
Decision Tree Model: This recipe is more like a detective story with a bunch of “yes or no” questions. It splits the wine data based on different factors, asking questions like “is the alcohol content high?” or “is it a certain grape variety?” to guess the quality.
Random Forest: Imagine having a whole bunch of detectives working together! This recipe combines lots of decision tree detectives, making it more accurate and less likely to get stuck on one idea.
Nearest Neighbor: This recipe is like asking other wine experts for their opinion. It finds similar wines in the data based on the available information and predicts the quality based on how those similar wines were rated.
To ensure robust model evaluation, we’ll employ a data splitting strategy. This involves dividing the dataset into two portions:
Training Set (80%): This larger portion of the data will be used to train the models. The training process involves fitting the model parameters to the data, allowing the model to learn the underlying relationships between the features (independent variables) and the target variable (wine quality).
Test Set (20%): This smaller portion of the data will be reserved for testing the final model performance. We will not use the test set for training the models in any way. Instead, the trained model will be applied to the test set, and its predictions will be compared to the actual quality scores to assess the model’s generalizability and accuracy on unseen data.
Rationale for Splitting:
The 80/20 split is a common choice for data splitting, but depending on the size of your dataset, a 70/30 split might also be acceptable. The key idea is to have a sufficient amount of data for both training and testing purposes. The training set allows the model to learn, while the test set provides an unbiased assessment of how well the model performs on new data. To ensure balance in the distribution of the response variable (quality) between the test and the training data, we’ll execute the split taking into account the distribution of the response variable, as seen in Figure 3 the response variable is unevenly distributed.
To ensure reproducibility of the code, we’ll also set.seed into 34433
We also need to evaluate the developed models, for this, we will use a cross-fold validation.
To ensure a good modeling workflow, we specify our model engines, and prepare to tune some parameters, for example, the decision tree model specification, we do not know the tree depth that would be optimal, so we tune this parameter. Other parameters which cannot be predetermine when fitting our model are also tuned.
glm_spec <- linear_reg() |> # OLS specification
set_engine("glm", family = "poisson") |>
set_mode("regression")
dec_tree_spec <- decision_tree( # Decision tree specification
tree_depth = tune(),
cost_complexity = tune()
) |>
set_engine("rpart") |>
set_mode("regression")
rand_for_spec <- rand_forest( # Random forest specification
trees = 1000,
mtry = tune(),
min_n = tune()
) |>
set_engine("ranger", importance = "impurity") |>
set_mode("regression")
knn_spec <- nearest_neighbor(
neighbors = tune(),
dist_power = tune(),
weight_func = tune()
) |>
set_engine("kknn") |>
set_mode("regression")
glm_spec
Linear Regression Model Specification (regression)
Engine-Specific Arguments:
family = poisson
Computational engine: glm
Decision Tree Model Specification (regression)
Main Arguments:
cost_complexity = tune()
tree_depth = tune()
Computational engine: rpart
Random Forest Model Specification (regression)
Main Arguments:
mtry = tune()
trees = 1000
min_n = tune()
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
K-Nearest Neighbor Model Specification (regression)
Main Arguments:
neighbors = tune()
weight_func = tune()
dist_power = tune()
Computational engine: kknn
Now that we’ve chosen our recipe testers (models), let’s take a look at the ingredients (features) we’ll be using. Ideally, we want the ingredients to be in a format that the models can easily understand and use for their predictions. This process of preparing the data is called feature engineering.
There are many resources available to help with feature engineering, like this one link. that provides a reference for different models and recommended pre-processing techniques.
In our case, we’ll start by testing the models on the features without any preprocessing. Some models, like the ones we’ve chosen, can handle data in its raw form and might not necessarily require specific transformations.
However, that doesn’t mean there’s no room for improvement! We can explore additional options like:
Log transformation: This can be helpful for features with skewed distributions, making them more symmetrical and easier for some models to work with.
Scaled transformations (standardization or normalization): This addresses features with differing scales and units.
Here, we assume no preprocessing.
For the second preprocessing, we do the following: - Perform a log transformation on the all predictors, step_log
.
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 11
── Operations
• Log transformation on: all_predictors()
• Zero variance filter on: all_predictors()
A preview of the preprocess object is given below Table 4
Preview of Preprocessed data | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Acidity
|
Citric Acid | Residual Sugar | Chlorides |
Sulfur dioxide
|
Density | pH | Sulphates | Alcohol | Quality | ||
Fixed | Volatile | Free | Total | ||||||||
2.091864 | -1.3093333 | -0.8915981 | 0.3715636 | -3.411248 | 2.397895 | 4.143135 | -0.009242581 | 1.095273 | -0.5798185 | 2.484907 | 5 |
2.151762 | -1.4696760 | -0.9162907 | 1.4350845 | -3.352407 | 2.833213 | 4.691348 | -0.005314095 | 1.144223 | -0.6348783 | 2.272126 | 5 |
2.066863 | -1.7147984 | -0.9942523 | 0.1823216 | -3.218876 | 2.772589 | 4.317488 | -0.008032172 | 1.156881 | -0.4620355 | 2.379546 | 5 |
2.116256 | -0.8675006 | -0.4780358 | 2.9575111 | -3.218876 | 3.713572 | 5.147494 | 0.000199980 | 1.091923 | -0.4004776 | 2.272126 | 5 |
1.871802 | -1.1711830 | -1.9661129 | 2.0149030 | -3.123566 | 3.526361 | 4.890349 | -0.004510155 | 1.169381 | -0.6931472 | 2.251292 | 5 |
2.028148 | -0.4004776 | -1.9661129 | 0.4054651 | -2.603690 | 3.218876 | 5.123964 | -0.006319929 | 1.115142 | -0.6733446 | 2.230014 | 5 |
After applying a log transformation, it’s common to encounter values like NaN (Not a Number), NA (missing values), or Inf (infinity). To ensure our models can process the data effectively, we should check for these potential issues before fitting the model. A quick summary of the preprocessed data can reveal their presence.
skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
---|---|---|---|---|---|---|---|---|---|---|---|
numeric | fixed_acidity | 0 | 1 | 1.917645176 | 0.12159317 | 1.3350011 | 1.840549633 | 1.916922612 | 1.987874348 | 2.65324196 | ▁▃▇▁▁ |
numeric | volatile_acidity | 0 | 1 | -1.333889842 | 0.33357665 | -2.5257286 | -1.560647748 | -1.347073648 | -1.139434283 | 0.09531018 | ▁▅▇▂▁ |
numeric | citric_acid | 0 | 1 | -Inf | NaN | -Inf | -1.309333320 | -1.139434283 | -0.941608540 | 0.50681760 | ▁▁▁▇▁ |
numeric | residual_sugar | 0 | 1 | 1.481001154 | 0.92310358 | -0.5108256 | 0.530628251 | 1.648658626 | 2.292534757 | 4.18661984 | ▅▆▇▆▁ |
numeric | chlorides | 0 | 1 | -3.148717269 | 0.33152116 | -4.7105307 | -3.324236341 | -3.146555163 | -2.995732274 | -1.06131650 | ▁▅▇▁▁ |
numeric | free_sulfur_dioxide | 0 | 1 | 3.437736555 | 0.54690744 | 0.6931472 | 3.135494216 | 3.526360525 | 3.828641396 | 5.66642669 | ▁▁▇▆▁ |
numeric | total_sulfur_dioxide | 0 | 1 | 4.880002528 | 0.33468233 | 2.1972246 | 4.691347882 | 4.897839800 | 5.122471438 | 6.08677473 | ▁▁▁▇▁ |
numeric | density | 0 | 1 | -0.005988129 | 0.00302027 | -0.0129738 | -0.008304386 | -0.006289739 | -0.003907625 | 0.03823946 | ▇▂▁▁▁ |
numeric | p_h | 0 | 1 | 1.157981867 | 0.04671982 | 1.0006319 | 1.124929597 | 1.156881197 | 1.187843422 | 1.34025042 | ▁▅▇▂▁ |
numeric | sulphates | 0 | 1 | -0.739139995 | 0.22328883 | -1.5141277 | -0.891598119 | -0.755022584 | -0.597837001 | 0.07696104 | ▁▃▇▃▁ |
numeric | alcohol | 0 | 1 | 2.345977687 | 0.11535902 | 2.0794415 | 2.251291799 | 2.341805806 | 2.433613355 | 2.65324196 | ▂▇▆▅▁ |
numeric | quality | 0 | 1 | 5.882848392 | 0.87445631 | 3.0000000 | 5.000000000 | 6.000000000 | 6.000000000 | 9.00000000 | ▁▆▇▃▁ |
Based on Table 5, the citric_acid feature contains NaN and Inf values. These values can cause issues for machine learning models like KNN. To ensure our models can process the data effectively, we need to rewrite the preprocessing to resolve the issue by adding 1 to values of citric_acid.
skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
---|---|---|---|---|---|---|---|---|---|---|---|
numeric | fixed_acidity | 0 | 1 | 1.917645176 | 0.12159317 | 1.3350011 | 1.840549633 | 1.916922612 | 1.987874348 | 2.65324196 | ▁▃▇▁▁ |
numeric | volatile_acidity | 0 | 1 | -1.333889842 | 0.33357665 | -2.5257286 | -1.560647748 | -1.347073648 | -1.139434283 | 0.09531018 | ▁▅▇▂▁ |
numeric | citric_acid | 0 | 1 | 0.285212528 | 0.08850576 | 0.0000000 | 0.239016900 | 0.277631737 | 0.329303747 | 0.97832612 | ▁▇▁▁▁ |
numeric | residual_sugar | 0 | 1 | 1.481001154 | 0.92310358 | -0.5108256 | 0.530628251 | 1.648658626 | 2.292534757 | 4.18661984 | ▅▆▇▆▁ |
numeric | chlorides | 0 | 1 | -3.148717269 | 0.33152116 | -4.7105307 | -3.324236341 | -3.146555163 | -2.995732274 | -1.06131650 | ▁▅▇▁▁ |
numeric | free_sulfur_dioxide | 0 | 1 | 3.437736555 | 0.54690744 | 0.6931472 | 3.135494216 | 3.526360525 | 3.828641396 | 5.66642669 | ▁▁▇▆▁ |
numeric | total_sulfur_dioxide | 0 | 1 | 4.880002528 | 0.33468233 | 2.1972246 | 4.691347882 | 4.897839800 | 5.122471438 | 6.08677473 | ▁▁▁▇▁ |
numeric | density | 0 | 1 | -0.005988129 | 0.00302027 | -0.0129738 | -0.008304386 | -0.006289739 | -0.003907625 | 0.03823946 | ▇▂▁▁▁ |
numeric | p_h | 0 | 1 | 1.157981867 | 0.04671982 | 1.0006319 | 1.124929597 | 1.156881197 | 1.187843422 | 1.34025042 | ▁▅▇▂▁ |
numeric | sulphates | 0 | 1 | -0.739139995 | 0.22328883 | -1.5141277 | -0.891598119 | -0.755022584 | -0.597837001 | 0.07696104 | ▁▃▇▃▁ |
numeric | alcohol | 0 | 1 | 2.345977687 | 0.11535902 | 2.0794415 | 2.251291799 | 2.341805806 | 2.433613355 | 2.65324196 | ▂▇▆▅▁ |
numeric | quality | 0 | 1 | 5.882848392 | 0.87445631 | 3.0000000 | 5.000000000 | 6.000000000 | 6.000000000 | 9.00000000 | ▁▆▇▃▁ |
Our efforts in rewriting the preprocessing step have paid off! As shown in Table 6, the citric_acid feature is now free of NA, NAN, and Inf. We can now proceed with the model training process using the preprocessed data from Table 6.
Now that we’ve chosen our feature engineering techniques and defined the models we’ll be using, it’s time to bring them together. To capture this entire workflow, we’ll create a workflow object
Think of this workflow object like a recipe book. It will hold all the ingredients (the recipe objects specifying the feature engineering steps) and the instructions for each dish (the model specifications). This allows us to easily apply both recipe options (feature engineering approaches) to each model we’re testing.
The workflow_set() function will be our handy tool for creating this comprehensive recipe book. It ensures we can seamlessly combine different feature engineering choices with various models, allowing for a more thorough analysis.
This approach allows us to explore different combinations and see which recipe (feature engineering) works best for each model in predicting white wine quality.
We’ve chosen our models and put together our workflow, but there’s one more step before we can unleash them on the wine data. Each model has its own set of “knobs” we can adjust to fine-tune its performance. These knobs are called “hyperparameters,” and finding the right settings for them can significantly impact how well the model predicts wine quality.
Imagine you’re baking a cake. You have a recipe to follow (the model in this case), but you can adjust things like the oven temperature or the baking time (the hyperparameters) to get the perfect result. Here, we’ll use a technique called “grid search” to explore different combinations of these hyperparameters for each model.
To make this process faster, we’ll use a package called doParallel
. This helps us leverage multiple processing cores on your computer, essentially dividing the work and speeding things up.
The workflow_map()
function comes in handy again. It allows us to apply the grid search technique to all the models we defined in our workflow. We’ll also set verbose=TRUE
during this step, which means the computer will show us the progress (although we won’t include those details in this explanation).
By exploring different hyperparameter combinations, we aim to find the settings that make each model perform at its best when predicting white wine quality.
Figure 5 provides a visual breakdown of how the different models performed in predicting white wine quality. We used two key metrics to evaluate their success: RMSE and R-Squared.
The clear winner based on these metrics is the Random Forest Model! It achieved the lowest RMSE and highest R-Squared score, indicating both high accuracy (low error) and a strong correlation between its predictions and the actual wine quality score.
Following closely behind is the KNN (K-Nearest Neighbors) model. While it performs well, there’s a noticeable gap between its performance and Random Forest’s. This suggests that Random Forest might be a more robust choice for predicting wine quality in this particular dataset.
As Figure 5 revealed, the Random Forest model is the top performer. To gain a better understanding, let’s take a closer look at the tuning process result.
top_model <- rank_results(wf_tune_fit, rank_metric = "rmse", select_best = TRUE)
top_model |>
select(-c(preprocessor, n)) |>
mutate(
.config = str_remove_all(.config, "Preprocessor1_Model"),
wflow_id = case_when(
str_detect(wflow_id, "no_preproc") ~ "No Preprocessing",
.default = "Log + scale transformation"
),
model = str_replace_all(model, "_", " "),
model = str_to_title(model)
) |>
pivot_wider(
id_cols = c(wflow_id, model, .config, rank),
names_from = .metric,
values_from = mean
) |>
gt() |>
cols_label(
wflow_id = "Recipe",
.config = "Resample Number",
model = "Model Type",
rank = "Rank",
rmse = "RMSE",
rsq = "R-Squared"
) |>
tab_spanner(
columns = rmse:rsq,
label = "Evaluation Metric"
) |>
tab_header(title = "Top Performing Models") |>
opt_stylize(
style = 3,
color = "blue",
add_row_striping = TRUE
) |>
as_raw_html()
Top Performing Models | |||||
---|---|---|---|---|---|
Recipe | Model Type | Resample Number | Rank |
Evaluation Metric
|
|
RMSE | R-Squared | ||||
No Preprocessing | Rand Forest | 08 | 1 | 0.6038689 | 0.5265060 |
Log + scale transformation | Rand Forest | 08 | 2 | 0.6039895 | 0.5262881 |
No Preprocessing | Nearest Neighbor | 05 | 3 | 0.6119481 | 0.5115739 |
Log + scale transformation | Nearest Neighbor | 05 | 4 | 0.6126732 | 0.5109407 |
No Preprocessing | Decision Tree | 04 | 5 | 0.7286573 | 0.3182304 |
Log + scale transformation | Decision Tree | 04 | 6 | 0.7286573 | 0.3182304 |
Log + scale transformation | Linear Reg | 1 | 7 | 0.7372858 | 0.2896381 |
No Preprocessing | Linear Reg | 1 | 8 | 0.7431961 | 0.2783234 |
Table 7 offers a comprehensive overview of each model’s performance, providing valuable insights beyond the visual representation in Figure 5. Here’s what we can learn from this table:
Preprocessing does not improve the performance of all models except the Generalized Linear Model which showed slight improvement.
Next we get the workflow object and finalize the model
Finalizing the model involves few steps:
no_preproc_rf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = tune()
trees = 1000
min_n = tune()
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
mtry | min_n | .config |
---|---|---|
7 | 6 | Preprocessor1_Model08 |
finalize_workflow()
function to fit our final workflow.══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = 7
trees = 1000
min_n = 6
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
The result shows that our parameters from tuning have been inserted into the model automatically.
We can see how well the model performs on the test data. We applied ceiling()
function to round our predictions as we they are count data.
final_model |>
collect_predictions() |>
mutate(
.pred = case_when(
.pred < 5 ~ floor(.pred),
.default = ceiling(.pred)
)
) |>
select(observed = quality, predicted = .pred) |>
pivot_longer(
cols = c(predicted, observed),
names_to = "type",
values_to = "value"
) |>
ggplot(aes(value, fill = type)) +
geom_bar(position = "dodge") +
labs(
x = "Quality",
y = "Count",
title = "Model Performnce on Test Data",
subtitle = "Model do not capture low values (3, 4, and 5) properly",
fill = "Value"
) +
scale_fill_viridis_d() +
theme_minimal()
Feature importance sheds light on the factors that are driving the decisions which our model makes and the one which might be irrelevant. This can provide a lot of valuable insights on what truly matters to white wine consumers. We can investigate this using the vip
library.
Apparently, alcohol is the number one feature.
This analysis explored the use of machine learning to predict white wine quality based on its chemical components. We compared four models: Generalized Linear Model (GLM), Decision Tree, Random Forest, and K-Nearest Neighbors (KNN). Through training and evaluation, the Random Forest model emerged as the top performer, demonstrating strong predictive capabilities.
It’s important to acknowledge a potential limitation. While Random Forest excels at continuous predictions, the actual quality scores might be assigned discrete labels (e.g., 6 = good, 8 = excellent). This difference in representation between the model’s predictions and the actual quality scores could lead to minor inaccuracies.
Despite this consideration, the Random Forest model has proven to be a powerful tool for predicting wine quality within the 0-10 range. Future explorations could involve:
Classification Models: Investigating models specifically designed for predicting discrete quality categories (e.g., good, bad) might provide even more accurate results when directly aligned with the scoring system. Feature Engineering: Exploring alternative feature engineering techniques could potentially further refine the model’s ability to capture the nuances of wine quality based on its chemical components.
This analysis has successfully demonstrated the potential of machine learning for white wine quality prediction. By understanding the strengths and limitations of different models and data considerations, we can continue to refine our approach and develop even more accurate models for wine quality assessment.