1 Diamond Price Estimation

1.1 Introduction

In India in the 4th century BC the earliest diamonds were found, while the youngest of these deposits was formed 900 million years ago. Most of these early stones were transported along the trade route network that linked India and China, commonly known as the Silk Road. Diamonds were admired at the time of their discovery for their strength and beauty, and for their ability to refract light and engrave metal. Diamonds were worn as ornamentation, used as cutting tools, acted as a talisman to defend against evil, and were thought to provide defense in war. Diamonds were also used as medicinal aid in the Dark Ages, and were believed when swallowed to cure disease and heal wounds.

My assignment consists of finding the price of a diamond given its properties. I will use diamonds data set to estimate price of any diamond.

1.1.1 Package Load

library(tidyverse) 
## -- Attaching packages ---------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## <U+221A> ggplot2 3.3.2     <U+221A> purrr   0.3.4
## <U+221A> tibble  3.0.3     <U+221A> dplyr   1.0.1
## <U+221A> tidyr   1.1.1     <U+221A> stringr 1.4.0
## <U+221A> readr   1.3.1     <U+221A> forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dplyr)

library(rpart) library(rpart.plot) library(rattle) library(FactoMineR) library(corrplot)

str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

1.1.2 Variables

Carat: Diamond weight, 200 mg equivalent (should be a good indicator)

Cut: Quality of the cut

Color: Diamond color from J to D (worst to best)

Clarity: Clearity of Diamond; I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)

Depth: Total depth percentage (relative to x and y). Will likely be collinear.

Table: Width of top of diamond relative to widest point (43–95)

Price: US dollars ($)

x, y, z: Dimensions of the diamond

1.2 Explanatory Data Analysis

1.2.1 Determining Diamond Price

ggplot(diamonds, aes(x=price)) + geom_histogram()+ggtitle("Quantity of Diamonds")+xlab("Quantity")+ylab("Price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph above shows that how many diamonds in the dataset in terms of price.

diamonds %>%
  ggplot(aes(x=(price))) +
  geom_histogram(stat="bin",binwidth= 500) +
  facet_wrap(~cut, scales = "free")+
  ggtitle("Change in price(Carat and Cut)")+xlab("Price")+ylab("Quantity")

Graph above shows number of diamonds with their “Cut”s in term of price.

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()+
  ggtitle("Change in price(Quantity and Cut)")+xlab("Price")+ylab("Quantity")

This helps us to examine the relationship of each of these variables, and how it affects the relationship between carat and price.

ggplot(diamonds, aes(x=color, y=price)) + geom_boxplot() + scale_y_log10()+
  ggtitle("Change in price(Color)")+xlab("Color")+ylab("Price")

The graph above, unlike the other graphs, we anaylzed the change of the price according to colors using histogram. This histogram shows the collapses more clearly.

new %>% spread(cut,n2)
## # A tibble: 7 x 6
##   color   Fair   Good `Very Good` Premium  Ideal
##   <ord>  <dbl>  <dbl>       <dbl>   <dbl>  <dbl>
## 1 D     0.101  0.135       0.125   0.116  0.132 
## 2 E     0.139  0.190       0.199   0.169  0.181 
## 3 F     0.194  0.185       0.179   0.169  0.178 
## 4 G     0.195  0.178       0.190   0.212  0.227 
## 5 H     0.188  0.143       0.151   0.171  0.145 
## 6 I     0.109  0.106       0.0997  0.104  0.0971
## 7 J     0.0739 0.0626      0.0561  0.0586 0.0416
ggplot(new,aes(x=color, y=cut)) +
  geom_tile(aes(fill=n2*100), colour = "white") +
  scale_fill_gradient(low="white",high="black") +
  labs(fill = "Density")+
  ggtitle("Density Graph(Color and Cut)")+xlab("Color")+ylab("Cut")