June 19, 2016
“Beer is made by men, wine by God.” - Martin Luther, circa 1500s
This Analysis is part of the Udacity/Facebook course “Data Analysis with R” and features an analysis of chemical parameters and an expert quality rating of 4898 white wines. We will be finding out what impact those parameters have on the quality and how they are distributed. So let’s get started:
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
summary(white_wine)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Let’s look at some general distributions first:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
So for the white whine we find a relatively normal distribution with most wines being a 5, 6 or 7 in quality. This holds true when we look at the mean of 5.878 and the median of 6. So let’s take a look at some other variables:
#### Alcohol Distribution
ggplot(data = white_wine,
aes(x = alcohol)) +
geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =0.1) +
ggtitle('Alcohol Distribution of White Wines') + xlab('Alcohol Level') + ylab('Number of wines')
#### pH Distribution
ggplot(data = white_wine,
aes(x = pH)) +
geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =0.02) +
ggtitle('pH Distribution of White Wines') + xlab('pH Level') + ylab('Number of wines')
#### residual sugar Distribution
ggplot(data = white_wine,
aes(x = residual.sugar)) +
geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =1) +
ggtitle('Residual sugar Distribution of White Wines') + xlab('Residual sugar Level') + ylab('Number of wines')
We find that there are a few outliers, so let’s take a look at the summary statistics for this variable:
summary(white_wine$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The Outlier seems to be at 65.8 g / dm^3 of residual sugar and this is way above the mean of 6.391 and the 3rd quartile of 9.9. Hence it makes sense to omit the let’s say top 0.5% of values:
#### residual sugar Distribution - Removed Outliers
ggplot(data = white_wine,
aes(x = residual.sugar)) +
geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =1) +
xlim(0, quantile(white_wine$residual.sugar, 0.995)) +
ggtitle('Residual sugar Distribution of White Wines without Outliers') + xlab('Residual sugar Level') + ylab('Number of wines')
Here we go. So as we can see we have a unimodal, long tailed distribution. Let’s see what it looks like with a log10 transformation:
#### residual sugar Distribution - Removed Outliers
ggplot(data = white_wine,
aes(x = residual.sugar)) +
geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =0.05) +
scale_x_log10() +
ggtitle('Residual sugar Distribution of White Wines log(10)') + xlab('Residual sugar Level (log10)') + ylab('Number of wines')
So there seems to be another peak in residual sugar at about 8 - 12 g/ dm^3.
ggplot(data = white_wine,
aes(x = quality, y = pH)) +
geom_jitter(alpha = 0.5, color = I('#2b8cbe')) +
ggtitle('Quality vs. pH Value') + xlab('Quality level') + ylab('pH Value')
This graph doesn’t really reveal any trends other than the fact that most wines are at a quality level of 5 or 6. So let’s add another variable into this graph to see if there is a trend. Now the points are colored using the quality variable:
myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))
ggplot(data = white_wine,
aes(x = alcohol, y = pH, color = quality)) +
geom_jitter(alpha = 0.5) +
ggtitle('Quality of different wines') + xlab('Alcohol Level') + ylab('pH Value') +
scale_colour_gradientn(colours = myPalette(100), limits=c(1, 8)) +
geom_smooth(method = "lm")