Aaron Oertel

June 19, 2016

“Beer is made by men, wine by God.” - Martin Luther, circa 1500s


1. Introduction

This Analysis is part of the Udacity/Facebook course “Data Analysis with R” and features an analysis of chemical parameters and an expert quality rating of 4898 white wines. We will be finding out what impact those parameters have on the quality and how they are distributed. So let’s get started:


2. The Dataset

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.


For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume)

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

summary(white_wine)
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

3. Univariate Plots Section

Let’s look at some general distributions first:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200


So for the white whine we find a relatively normal distribution with most wines being a 5, 6 or 7 in quality. This holds true when we look at the mean of 5.878 and the median of 6. So let’s take a look at some other variables:

#### Alcohol Distribution

ggplot(data = white_wine,
       aes(x = alcohol)) +
  geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =0.1) +
  ggtitle('Alcohol Distribution of White Wines') + xlab('Alcohol Level') + ylab('Number of wines')


#### pH Distribution

ggplot(data = white_wine,
       aes(x = pH)) +
  geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =0.02) +
  ggtitle('pH Distribution of White Wines') + xlab('pH Level') + ylab('Number of wines')


#### residual sugar Distribution

ggplot(data = white_wine,
       aes(x = residual.sugar)) +
  geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =1) +
  ggtitle('Residual sugar Distribution of White Wines') + xlab('Residual sugar Level') + ylab('Number of wines')


We find that there are a few outliers, so let’s take a look at the summary statistics for this variable:

summary(white_wine$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The Outlier seems to be at 65.8 g / dm^3 of residual sugar and this is way above the mean of 6.391 and the 3rd quartile of 9.9. Hence it makes sense to omit the let’s say top 0.5% of values:

#### residual sugar Distribution - Removed Outliers

ggplot(data = white_wine,
       aes(x = residual.sugar)) +
  geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =1) +
  xlim(0, quantile(white_wine$residual.sugar, 0.995)) +
  ggtitle('Residual sugar Distribution of White Wines without Outliers') + xlab('Residual sugar Level') + ylab('Number of wines')


Here we go. So as we can see we have a unimodal, long tailed distribution. Let’s see what it looks like with a log10 transformation:


#### residual sugar Distribution - Removed Outliers

ggplot(data = white_wine,
       aes(x = residual.sugar)) +
  geom_histogram(color = I('black'), fill = I('#2b8cbe'), binwidth =0.05) +
  scale_x_log10() +
  ggtitle('Residual sugar Distribution of White Wines log(10)') + xlab('Residual sugar Level (log10)') + ylab('Number of wines') 

So there seems to be another peak in residual sugar at about 8 - 12 g/ dm^3.


4. Multivariate Plots Section


ggplot(data = white_wine,
       aes(x = quality, y = pH)) +
  geom_jitter(alpha = 0.5, color = I('#2b8cbe')) +
  ggtitle('Quality vs. pH Value') + xlab('Quality level') + ylab('pH Value') 


This graph doesn’t really reveal any trends other than the fact that most wines are at a quality level of 5 or 6. So let’s add another variable into this graph to see if there is a trend. Now the points are colored using the quality variable:


myPalette <- colorRampPalette(rev(brewer.pal(11, "Spectral")))

ggplot(data = white_wine,
       aes(x = alcohol, y = pH, color = quality)) +
  geom_jitter(alpha = 0.5) +
  ggtitle('Quality of different wines') + xlab('Alcohol Level') + ylab('pH Value') +
  scale_colour_gradientn(colours = myPalette(100), limits=c(1, 8)) +
  geom_smooth(method = "lm")