Aaron Oertel

June 19, 2016

“A diamond is forever” - Frances Gerety


The Diamond Dataset is part of the ggplot2 package and contains 10 variables for 53940 observations. Those include:

summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

Descriptive Analysis

So let’s start by looking at some variable distributions. Here is the prize distribution:

ggplot(diamonds, aes(x = price)) + 
  geom_histogram(color = "black", fill = "#337AB7", binwidth = 400) + 
  scale_x_continuous(labels = dollar, breaks = seq(0, 20000, 1000)) + 
  theme(axis.text.x = element_text(angle = 90)) + 
  xlab("Price") + ylab("Count")


We can see that this gives us a long tailed distribution with most of the diamonds at a cheaper price. So it seems like the market for diamonds is mainly focused on cheaper diamonds. So let’s transform the price variable with a log10 transformation to see what happens:

plot1 <- qplot(data = diamonds, x = price, binwidth = 100, color = I('black'), 
      fill = I('#337AB7')) + 
  ggtitle('Price')

plot2 <- qplot(data = diamonds, x = price, binwidth = 0.01, color = I('black'), 
      fill = I('#337AB7')) +
  scale_x_log10() +
  ggtitle('Price (log10)') +
  scale_color_brewer(type = 'div')

grid.arrange(plot1, plot2, ncol = 1)


On a log10 scale the prices seem to be much better behaved. There is also evidence of bimodality which insinuates that there could be markets for rich and poor buyers.

So let’s investigate the pricing factors. For that we can visualize the price per carat for each color using a boxplot:

ggplot(data = subset(diamonds, !is.na(color)),
       aes(x = color, y = price/carat, fill = color)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(1000, 6000)) +
  xlab('Color') + ylab('Price per Carat') +
  scale_y_continuous(labels=dollar)


What’s interesting here is that even though the color D is the best, the prices per carat aren’t the highest. So let’s do some summary statistics to find out about those price ranges for each Color:

diamonds %>%
  group_by(color) %>%
  summarise(max_price = max(price),
            min_price = min(price),
            median_price = median(price),
            median_carat = median(carat),
            median_price_per_carat = median(price/carat))
## Source: local data frame [7 x 6]
## 
##    color max_price min_price median_price median_carat
##   (fctr)     (int)     (int)        (dbl)        (dbl)
## 1      D     18693       357       1838.0         0.53
## 2      E     18731       326       1739.0         0.53
## 3      F     18791       342       2343.5         0.70
## 4      G     18818       354       2242.0         0.70
## 5      H     18803       337       3460.0         0.90
## 6      I     18823       334       3730.0         1.00
## 7      J     18710       335       4234.0         1.11
## Variables not shown: median_price_per_carat (dbl)

So as we can see there is actually no big difference in the maximum prices and not really in price per carat. As a matter of fact the median pricer per carat is on average higher for the worse colors (J, I, H). But there is a trend in diamond carat which is that diamonds of a better color are on average smaller (0.53 median carat for color D)

Influence of Carat on Price

Therefore i feel like looking at the color or cut is a dead end, so let’s take a closer look at the carat sizes. Here is the distribution of carat in the diamond dataset:


Ok now let’s take a look at some prices:

ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_point(alpha = 1/10, color = I('#337AB7')) +
  geom_vline(xintercept = c(0.5, 0.7, 1, 1.2, 1.5, 2), 
             color = "red", linetype="longdash", alpha = 0.5) +
  xlab("Carat Size") + ylab("Price") +
  scale_x_continuous(limits = c(0, quantile(diamonds$carat, 0.99))) +
  scale_y_continuous(breaks = seq(0, 18000, 2000),
                     limits = c(0, quantile(diamonds$price, 0.99)),
                     labels = dollar)


This visualization drives creates 2 interesting segments of diamonds:

  1. Diamonds below 1 carat
  2. Diamonds above 1 carat

We can see that for the diamonds above 1 carat, the variance in prices is significantly higher than below 1 carat. This leads to the conclusion that the market for lower carat, cheaper diamonds is more competetitive and the one for above 1 carat is more prone to highly varying prices.

We can also see that there are many diamonds at flat carat numbers as the red lines show. And we will find the highest fluctuations there.

Now, let’s create a new variable out of x, y, z called volume to test our hypothesis:

diamonds$volume <- diamonds$x * diamonds$y * diamonds$z

ggplot(data = subset(diamonds, !(volume == 0 | volume >= 800) ), 
       aes(x = volume, y = price)) + 
  geom_point(alpha = 1/20) +
  geom_smooth(method = "lm") +
  geom_vline(xintercept = c(160), 
             color = "red", linetype="longdash", alpha = 0.5) +
  xlab("Diamond's Volume") + ylab("Price") +
  scale_x_continuous(limits = c(0, 500)) +
  scale_y_continuous(limits = c(0, 20000))