*June 19, 2016*

“A diamond is forever” - Frances Gerety

The Diamond Dataset is part of the ggplot2 package and contains 10 variables for 53940 observations. Those include:

- Carat (weight of the diamond)
- Cut (quality of the cut)
- Color (diamond color, from J (worst) to D (best))
- Clarity (measurement of how clear the diamond is)
- Depth (total depth percentage)
- Table (width of top of diamond relative to widest point)
- Price (in USD)
- x (length in mm)
- y (width in mm)
- z (depth in mm)

`summary(diamonds)`

```
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
```

So let’s start by looking at some variable distributions. Here is the **prize distribution**:

```
ggplot(diamonds, aes(x = price)) +
geom_histogram(color = "black", fill = "#337AB7", binwidth = 400) +
scale_x_continuous(labels = dollar, breaks = seq(0, 20000, 1000)) +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Price") + ylab("Count")
```

We can see that this gives us a **long tailed distribution** with most of the diamonds at a cheaper price. So it seems like the market for diamonds is mainly focused on cheaper diamonds. So let’s transform the price variable with a log10 transformation to see what happens:

```
plot1 <- qplot(data = diamonds, x = price, binwidth = 100, color = I('black'),
fill = I('#337AB7')) +
ggtitle('Price')
plot2 <- qplot(data = diamonds, x = price, binwidth = 0.01, color = I('black'),
fill = I('#337AB7')) +
scale_x_log10() +
ggtitle('Price (log10)') +
scale_color_brewer(type = 'div')
grid.arrange(plot1, plot2, ncol = 1)
```

On a log10 scale the prices seem to be much better behaved. There is also evidence of bimodality which insinuates that there could be **markets for rich and poor buyers**.

So let’s investigate the pricing factors. For that we can visualize the price per carat for each color using a boxplot:

```
ggplot(data = subset(diamonds, !is.na(color)),
aes(x = color, y = price/carat, fill = color)) +
geom_boxplot() +
coord_cartesian(ylim = c(1000, 6000)) +
xlab('Color') + ylab('Price per Carat') +
scale_y_continuous(labels=dollar)
```

What’s interesting here is that even though the color D is the best, the prices per carat aren’t the highest. So let’s do some summary statistics to find out about those price ranges for each Color:

```
diamonds %>%
group_by(color) %>%
summarise(max_price = max(price),
min_price = min(price),
median_price = median(price),
median_carat = median(carat),
median_price_per_carat = median(price/carat))
```

```
## Source: local data frame [7 x 6]
##
## color max_price min_price median_price median_carat
## (fctr) (int) (int) (dbl) (dbl)
## 1 D 18693 357 1838.0 0.53
## 2 E 18731 326 1739.0 0.53
## 3 F 18791 342 2343.5 0.70
## 4 G 18818 354 2242.0 0.70
## 5 H 18803 337 3460.0 0.90
## 6 I 18823 334 3730.0 1.00
## 7 J 18710 335 4234.0 1.11
## Variables not shown: median_price_per_carat (dbl)
```

So as we can see there is actually no big difference in the maximum prices and not really in price per carat. As a matter of fact the median pricer per carat is on average higher for the worse colors (J, I, H). But there is a trend in diamond carat which is that diamonds of a better color are on average smaller (0.53 median carat for color D)

Therefore i feel like looking at the color or cut is a dead end, so let’s take a closer look at the carat sizes. Here is the **distribution of carat** in the diamond dataset:

Ok now let’s take a look at some prices:

```
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 1/10, color = I('#337AB7')) +
geom_vline(xintercept = c(0.5, 0.7, 1, 1.2, 1.5, 2),
color = "red", linetype="longdash", alpha = 0.5) +
xlab("Carat Size") + ylab("Price") +
scale_x_continuous(limits = c(0, quantile(diamonds$carat, 0.99))) +
scale_y_continuous(breaks = seq(0, 18000, 2000),
limits = c(0, quantile(diamonds$price, 0.99)),
labels = dollar)
```

This visualization drives creates **2 interesting segments** of diamonds:

- Diamonds below 1 carat
- Diamonds above 1 carat

We can see that for the diamonds above 1 carat, the **variance in prices is significantly higher** than below 1 carat. This leads to the conclusion that the market for lower carat, cheaper diamonds is more competetitive and the one for above 1 carat is more prone to highly varying prices.

We can also see that there are many diamonds at flat carat numbers as the red lines show. And we will find the highest fluctuations there.

Now, let’s create a new variable out of x, y, z called **volume** to test our hypothesis:

```
diamonds$volume <- diamonds$x * diamonds$y * diamonds$z
ggplot(data = subset(diamonds, !(volume == 0 | volume >= 800) ),
aes(x = volume, y = price)) +
geom_point(alpha = 1/20) +
geom_smooth(method = "lm") +
geom_vline(xintercept = c(160),
color = "red", linetype="longdash", alpha = 0.5) +
xlab("Diamond's Volume") + ylab("Price") +
scale_x_continuous(limits = c(0, 500)) +
scale_y_continuous(limits = c(0, 20000))
```