1 Introduction

Diamonds are the Precious stone consisting of a clear and colourless Crystalline form of pure carbon. They are the hardest Gemstones known to man and can be scratched only by other Diamonds. Diamonds are formed deep within the Earth about 100 miles or so below the surface in the upper mantle. Obviously in that part of the Earth it’s very hot. There’s a lot of pressure, the weight of the overlying rock bearing down, so that combination of high temperature and high pressure is what’s necessary to grow diamond crystals in the Earth.

2 Summary of the Data and Explanations

diamonds is a dataset containing the prices and other attributes of almost 54,000 diamonds. It is built-in dataset of R. Here are those attributes:

  • Carat : Carat weight of the Diamond.

  • Cut : Describe cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal .

  • Color : Color of the Diamond. With D being the best and J the worst.

  • Clarity : Diamond Clarity refers to the absence of the Inclusions and Blemishes. (In order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3

  • Depth : The Height of a Diamond, measured from the Culet to the table, divided by its average Girdle Diameter.

  • Table : The Width of the Diamond’s Table expressed as a Percentage of its Average Diameter.

  • Price : the Price of the Diamond.

  • X : Length of the Diamond in mm.

  • Y : Width of the Diamond in mm.

  • Z : Height of the Diamond in mm.

Qualitative Features (Categorical) : Cut, Color, Clarity.

Quantitative Features (Numerical) : Carat, Depth , Table , Price , X , Y, Z.

2.1 Importing Libraries

library(tidyverse) 
library(rpart) 
library(rpart.plot) 
library(rattle)
library(reshape2)
library(gridExtra)
library(grid)
library(ggplot2)

3 Data Preprocessing

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

When we did the first operation on dataset, we have realized some irrational outputs. It does not look possible that a diamond’s length, width, or height equals to 0. However, we witness that minimum values of x,y, and z equal to 0. Let’s fix it.

diamonds_v2 <- diamonds %>%
  filter(x != 0 & y != 0 & z != 0) %>%
  glimpse()
## Rows: 53,920
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
diamonds_v2$cut <- as.factor(diamonds_v2$cut)
diamonds_v2$color <- as.factor(diamonds_v2$color) 
diamonds_v2$clarity <- as.factor(diamonds_v2$clarity) 

To make easier the control of categorical variables we convert them factors.

colSums(is.na(diamonds_v2))
##   carat     cut   color clarity   depth   table   price       x       y       z 
##       0       0       0       0       0       0       0       0       0       0

It is relieving that knowing there is no NAs in dataset for now.

categorical_var <- diamonds_v2[,c(2,3,4)]

continous_var <- diamonds_v2[,-c(2,3,4)]
data_main <- cbind(continous_var,categorical_var) 

str(data_main) 
## 'data.frame':    53920 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...

4 Exploratory Data Analysis

4.1 Correlation

corr <- cor(cor(data_main[-c(8, 9, 10)])) 

melted_corr <- melt(corr)

ggplot(melted_corr, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  scale_fill_gradient(low="white", high="coral3") +
  theme(axis.text.x = element_text(angle = 45), legend.position = "top",
        legend.title = element_blank()) +
  labs(x = "Variable X",
       y = "Variable Y") +
  ggtitle("Correlation Heatmap of Diamonds Features")

To see the correlation between attributes of diamonds dataset, we melt the data with the help of rshape2 and create a heatmap. As you see:

  1. Depth is inversely related to Price.

  2. The Price of the Diamond is highly correlated to Carat.

  3. The Length(x) , Width(y) and Height(z) seems to be higly related to Price and even each other.

  4. Self Relation ie. of a feature to itself is 1 as expected.

This part of our exploratory data analysis will also help us in the following steps on creating a new model to predict price of diamonds.

4.2 Visualization of the Features

Let’s visualize the attributes against the price values.

par(mfrow=c(1,1))

4.2.1 Cut vs Price

cut_price <- ggplot(data = data_main)+
  geom_bar(aes(x=cut,y=price),stat = "summary",fun.y="mean",alpha=1,fill="blue")+
  xlab("Cut Type")+
  ylab("Average Price")+
  ggtitle("Cut vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(cut_price)

According to the average prices, the most expensive diamonds are belongs to Premium if we just look for the cut variable.

4.2.2 Color vs Price

color_price <- ggplot(data = data_main)+
  geom_bar(aes(x = color, y = price),stat = "summary",fun.y="mean",alpha=1,fill="red")+
  xlab("Color") +
  ylab("Average Price") +
  ggtitle("Color vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(color_price)

According to the average prices, the most expensive diamonds are belongs to J if we just look for the color variable.

4.2.3 Clarity vs Price

clarity_price <- ggplot(data = data_main)+
  geom_bar(aes(x = clarity, y = price),stat = "summary",fun.y="mean",alpha=1,fill="green")+
  xlab("Clarity Type")+
  ylab("Average Price")+
  ggtitle("Clarity vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(clarity_price)

According to the average prices, the most expensive diamonds are belongs to SI2 if we just look for the clarity variable.

4.2.4 Carat vs Price

carat_price <- ggplot(data = data_main)+
  geom_bar(aes(x = carat, y = price),stat = "summary",fun.y="mean",alpha=1,fill="orange")+
  xlab("Carat")+
  ylab("Average Price")+
  ggtitle("Carat vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(carat_price)

4.2.5 Depth vs Price

depth_price <- ggplot(data = data_main)+
  geom_bar(aes(x = depth, y = price),stat = "summary",fun.y="mean",alpha=1,fill="pink")+
  xlab("Depth")+
  ylab("Average Price")+
  ggtitle("Depth vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(depth_price)

4.2.6 Table vs Price

table_price <- ggplot(data = data_main)+
  geom_bar(aes(x = table, y = price),stat = "summary",fun.y="mean",alpha=1,fill="purple")+
  xlab("Table")+
  ylab("Average Price")+
  ggtitle("Table vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(table_price)

4.2.7 X vs Price

x_price <- ggplot(data = data_main)+
  geom_bar(aes(x = x, y = price),stat = "summary",fun.y="mean",alpha=1,fill="purple")+
  xlab("X")+
  ylab("Average Price")+
  ggtitle("X vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(x_price)

4.2.8 Y vs Price

y_price <- ggplot(data = data_main)+
  geom_bar(aes(x = y, y = price),stat = "summary",fun.y="mean",alpha=1,fill="purple")+
  xlab("Y")+
  ylab("Average Price")+
  ggtitle("Y vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(y_price)

4.2.9 Z vs Price

z_price <- ggplot(data = data_main)+
  geom_bar(aes(x = z, y = price),stat = "summary",fun.y="mean",alpha=1,fill="purple")+
  xlab("Z")+
  ylab("Average Price")+
  ggtitle("Z vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(z_price)

grid.arrange(cut_price, color_price, clarity_price, carat_price, depth_price, table_price, x_price, y_price, z_price, ncol = 3, nrow = 3)

5 Price Estimation by the Model

5.1 Train and Test Datasets

set.seed(503)
library(tidyverse)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

diamonds_train
## # A tibble: 43,143 x 11
##    carat cut       color clarity depth table price     x     y     z diamond_id
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <int>
##  1 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31          2
##  2 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31          3
##  3 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63          4
##  4 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75          5
##  5 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47          7
##  6 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53          8
##  7 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49          9
##  8 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39         10
##  9 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73         11
## 10 0.23  Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46         12
## # ... with 43,133 more rows
diamonds_test
## # A tibble: 10,797 x 11
##    carat cut   color clarity depth table price     x     y     z diamond_id
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <int>
##  1 1.7   Fair  D     I1       64.7    56  5617  7.46  7.37  4.8       13776
##  2 0.580 Fair  D     SI2      65.1    58  1156  5.25  5.22  3.41      40690
##  3 1     Fair  D     SI2      65.6    66  3767  6.1   6.01  3.97       5152
##  4 1.05  Fair  D     SI2      65.4    59  3816  6.3   6.24  4.1        5359
##  5 0.7   Fair  D     SI2      64.5    60  2167  5.53  5.47  3.55      49828
##  6 0.9   Fair  D     SI2      64.6    59  3847  6.04  6.01  3.89       5510
##  7 1.01  Fair  D     SI2      64.6    56  3003  6.31  6.24  4.05       1555
##  8 1.1   Fair  D     SI2      64.6    54  4725  6.56  6.49  4.22      10151
##  9 0.62  Fair  D     SI2      64.6    57  1410  5.39  5.33  3.46      43397
## 10 1.5   Fair  D     SI2      68.8    57  7469  6.9   6.86  4.73      18359
## # ... with 10,787 more rows

5.2 PCA

diamonds_train <- diamonds_train %>%
  select(-diamond_id)

diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)

summary(diamonds_pca,loadings=TRUE)
## Importance of components:
##                          Comp.1    Comp.2    Comp.3    Comp.4      Comp.5
## Standard deviation     1.979800 1.1336225 0.8255050 0.2273068 0.218071172
## Proportion of Variance 0.653268 0.2141833 0.1135764 0.0086114 0.007925839
## Cumulative Proportion  0.653268 0.8674513 0.9810277 0.9896391 0.997564967
##                             Comp.6
## Standard deviation     0.120872645
## Proportion of Variance 0.002435033
## Cumulative Proportion  1.000000000
## 
## Loadings:
##       Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## carat  0.496                0.638  0.442  0.385
## depth         0.734 -0.671                     
## table  0.122 -0.669 -0.733                     
## x      0.501                0.114        -0.855
## y      0.494               -0.760  0.361  0.201
## z      0.493  0.103               -0.817  0.279

From the summary of PCA model that we create, we can see that two can almost explain 87% of the data. We can choose to use two components for creating a model. Now, we need to add these components to both train and test datasets.

pca_results <- predict(diamonds_pca, newdata = diamonds_train)

diamonds_train$pca1 = pca_results[, 1]
diamonds_train$pca2 = pca_results[, 2]


pca_results_test <- predict(diamonds_pca, newdata = diamonds_test)

diamonds_test$pca1 = pca_results_test[, 1]
diamonds_test$pca2 = pca_results_test[, 2]

5.3 Price Estimation with CART

diamonds_fit <- rpart(price~. - pca1 - pca2, data=diamonds_train, method = "anova")



fancyRpartPlot(diamonds_fit)

Method argument of the rpart() function is “anova” because the price variable is numeric rather than class.

rpart.plot(diamonds_fit)

plotcp(diamonds_fit)

print(diamonds_fit$cptable)
##           CP nsplit rel error    xerror        xstd
## 1 0.60913716      0 1.0000000 1.0000357 0.009817373
## 2 0.18517972      1 0.3908628 0.3909391 0.004372856
## 3 0.03361554      2 0.2056831 0.2061950 0.002303084
## 4 0.02661853      3 0.1720676 0.1726668 0.002299963
## 5 0.02583190      4 0.1454491 0.1525199 0.002060203
## 6 0.01000000      5 0.1196172 0.1209142 0.001742561

5.4 Improvement of the Model

Since we deduce that as splitting number increases xerror and rel error are affected inversely from Complexity Parameter(CP) plot and table, we will try to improve our model by lowering Complexity Parameter.

diamonds_fit <- rpart(price~. - pca1 - pca2, data = diamonds_train, method = "anova", cp = 0.00001)



fancyRpartPlot(diamonds_fit)

rpart.plot(diamonds_fit)

plotcp(diamonds_fit)

print(tail(diamonds_fit$cptable))
##               CP nsplit  rel error     xerror         xstd
## 444 1.019450e-05    489 0.01704495 0.02576320 0.0005527474
## 445 1.014082e-05    490 0.01703476 0.02573855 0.0005525477
## 446 1.012304e-05    491 0.01702462 0.02572650 0.0005521675
## 447 1.001973e-05    492 0.01701450 0.02573169 0.0005528368
## 448 1.001469e-05    494 0.01699446 0.02573291 0.0005531837
## 449 1.000000e-05    495 0.01698444 0.02573252 0.0005531871

5.5 Prediction

diamonds_prediction <- predict(diamonds_fit, newdata = diamonds_test, type = "vector")

diamonds_fit_r2 <- 1 - (sum((diamonds_prediction - diamonds_test$price) ^2)/sum((diamonds_test$price - mean(diamonds_train$price)) ^2))

diamonds_fit_r2
## [1] 0.9746056

Here, we have calculated Coefficient of Determination (R^2) to see how well the regression predictions approximate the real data points.

5.6 Errors

Finally, we can measure accuracy for continuous variables that we used by calculating RMSE and MAE.

Root Mean Squared Error (RMSE) and Mean Absolute (MAE) Error terms for predictions:

errors <- c(RMSE = Metrics::rmse(actual = diamonds_test$price, predicted = diamonds_prediction), MAE = 
Metrics::mae(actual = diamonds_test$price, predicted = diamonds_prediction))

errors
##     RMSE      MAE 
## 630.6222 341.7021