The aim of this blog post is to use decision tree machine learning algorithm to classify dry bean based on some features. This is a Kaggle challenge dataset. Images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains. The features of the data are:
Dry Beans Featured.
Feature
Description
Area (A)
The area of a bean zone and the number of pixels within its boundaries.
Perimeter (P)
Bean circumference is defined as the length of its border.
Major axis length (L)
The distance between the ends of the longest line that can be drawn from a bean.
Minor axis length (l)
The longest line that can be drawn from the bean while standing perpendicular to the main axis.
Aspect ratio (K)
Defines the relationship between L and l.
Eccentricity (Ec)
Eccentricity of the ellipse having the same moments as the region.
Convex area (C)
Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
Equivalent diameter (Ed)
The diameter of a circle having the same area as a bean seed area.
Extent (Ex)
The ratio of the pixels in the bounding box to the bean area.
Solidity (S)
Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
Roundness (R)
Calculated with the following formula: (4piA)/(P^2)
Compactness (CO)
Measures the roundness of an object: Ed/L
ShapeFactor1 (SF1)
ShapeFactor2 (SF2)
ShapeFactor3 (SF3)
ShapeFactor4 (SF4)
Class
Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira
Load Packages
To begin, we load the necessary packages. I also set the extra swatch color, just in case we have more than the default provided by ggthemr. I have recently fell in love with using the ggthemr package by Mikata-Project.
This is the first time I am seeing a data with a .arffextension, so, I searched online immediately to see if there’s a package to import a data with an arff extension in R. The best place to search for this is CRAN, of course Google will also give very good result, but I choose CRAN regardless. Fortunately, there’s the farrf package with development starting in 2015 by mlr-org. The package is also pretty straightforward to use. Interestingly the package imports data as data.frame, which is great. Afterwards, I converted to tibble.
Next is EDA, this will be quick and short. Table 1 shows a good summary of the data including the data types, and information on the completeness of the data. From what the result in Table 1 (a) and Table 1 (b) the data is complete. Figure 1 shows the correlation matrix of the numeric variables.
Show the code
skim(bean_tbl)
Table 1: Data summary
(a)
Name
bean_tbl
Number of rows
13611
Number of columns
17
_______________________
Column type frequency:
factor
1
numeric
16
________________________
Group variables
None
Variable type: factor
(b)
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
class
0
1
FALSE
7
DER: 3546, SIR: 2636, SEK: 2027, HOR: 1928
Variable type: numeric
(c)
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
area
0
1
53048.28
29324.10
20420.00
36328.00
44652.00
61332.00
254616.00
▇▂▁▁▁
perimeter
0
1
855.28
214.29
524.74
703.52
794.94
977.21
1985.37
▇▆▁▁▁
major_axis_length
0
1
320.14
85.69
183.60
253.30
296.88
376.50
738.86
▇▆▂▁▁
minor_axis_length
0
1
202.27
44.97
122.51
175.85
192.43
217.03
460.20
▇▇▁▁▁
aspect_ration
0
1
1.58
0.25
1.02
1.43
1.55
1.71
2.43
▂▇▅▂▁
eccentricity
0
1
0.75
0.09
0.22
0.72
0.76
0.81
0.91
▁▁▂▇▇
convex_area
0
1
53768.20
29774.92
20684.00
36714.50
45178.00
62294.00
263261.00
▇▂▁▁▁
equiv_diameter
0
1
253.06
59.18
161.24
215.07
238.44
279.45
569.37
▇▆▁▁▁
extent
0
1
0.75
0.05
0.56
0.72
0.76
0.79
0.87
▁▁▅▇▂
solidity
0
1
0.99
0.00
0.92
0.99
0.99
0.99
0.99
▁▁▁▁▇
roundness
0
1
0.87
0.06
0.49
0.83
0.88
0.92
0.99
▁▁▂▇▇
compactness
0
1
0.80
0.06
0.64
0.76
0.80
0.83
0.99
▂▅▇▂▁
shape_factor1
0
1
0.01
0.00
0.00
0.01
0.01
0.01
0.01
▁▃▇▃▁
shape_factor2
0
1
0.00
0.00
0.00
0.00
0.00
0.00
0.00
▇▇▇▃▁
shape_factor3
0
1
0.64
0.10
0.41
0.58
0.64
0.70
0.97
▂▇▇▃▁
shape_factor4
0
1
1.00
0.00
0.95
0.99
1.00
1.00
1.00
▁▁▁▁▇
Show the code
corrplot(cor(bean_tbl[, 1:16]),method ="circle",addrect =2,pch =5,title ="Correlation of Numeric Features in Dry Bean Data",type ="lower")
The frequency of the different types of dry bean is shown in Figure 2.
Show the code
bean_tbl |>ggplot(aes(fct_infreq(class))) +geom_bar() +labs(x ="Dry Bean Type",y ="Frequency",title ="Frequency Distribution of Dry Bean Varieties" )
Modeling
Data Shairing
The data was split to two, a testing data, which is 30% the number of records of the original data and 70% for the training data. To ensure reproducibility, a seed was set.
Results from tuning is shown in Table 2 with, represented visually in Figure 4 three metrics to measure which combination of the parameters would give the best result.