Exploratory Data Analysis

R has diamonds dataset by default. The data are shown in details below;

glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ve…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1,…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 6…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 5…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 3…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2…

First 6 rows of the data;

head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Summary of data with respect to variables is;

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

By checking correlation between price and numeric elements of diamond, we can figure out which variables are important for our price estimation model;

correlation_wrt_price <- data.frame(
                cor_carat_to_price = cor(diamonds$carat, diamonds$price),
                cor_depth_to_price = cor(diamonds$depth, diamonds$price),
                cor_table_to_price = cor(diamonds$table, diamonds$price),
                cor_x_to_price = cor(diamonds$x, diamonds$price),
                cor_y_to_price = cor(diamonds$y, diamonds$price),
                cor_z_to_price = cor(diamonds$z, diamonds$price)
)
print(correlation_wrt_price)     
##   cor_carat_to_price cor_depth_to_price cor_table_to_price cor_x_to_price
## 1          0.9215913         -0.0106474          0.1271339      0.8844352
##   cor_y_to_price cor_z_to_price
## 1      0.8654209      0.8612494

As is seen above, price is highly correlated with carat, x, y and z since correlation values are close to one for those variables. However, depth and table are not important statistically.

Diamond price vs. carat of diamond scatterplot is shown below;

ggplot(diamonds) + 
  geom_point(aes(x = carat, y = price), alpha=0.4, color = "red") + 
  labs(title = "Diamond Price vs. Carat", x = "Carat of diamond", y = "Price of Diamond (in $)")

Diamond price vs. length scatterplot is shown below;

ggplot(diamonds) + 
  geom_point(aes(x = x, y = price), alpha=0.4, color = "dark grey") + 
  labs(title = "Diamond Price vs. Length", x = "Length of diamond (mm)", y = "Price of Diamond (in $)")

Diamond price vs. height scatterplot is shown below;

ggplot(diamonds) + 
  geom_point(aes(x = y, y = price), alpha=0.4, color = "blue") + 
  labs(title = "Diamond Price vs. Height", x = "Height of diamond (mm)", y = "Price of Diamond (in $)")

Diamond price vs. width scatterplot is shown below;

ggplot(diamonds) + 
  geom_point(aes(x = z, y = price), alpha=0.3, color = "green") + 
  labs(title = "Diamond Price vs. Width", x = "Width of diamond (mm)", y = "Price of Diamond (in $)")

Scatterplots also give some insight about correlation of variables.

In order to look into diamond classification data regarding diamond price and carat, we draw three colored scatterplots by clarity, cut, and color classes.

ggplot(diamonds) + 
  geom_point(aes(x = carat, y = price, color = clarity), alpha=0.65) + 
  labs(title = "Diamond Price vs. Carat by Clarity Classes", x = "Carat of Diamond", y = "Price of Diamond (in $)")

Clarity classification plot implies that clarity of diamond mean something for diamond’s price.

The other classification plots are as follow;

ggplot(diamonds) + 
  geom_point(aes(x = carat, y = price, color = cut), alpha=0.65) + 
  labs(title = "Diamond Price vs. Carat by Cut Classes", x = "Carat of Diamond", y = "Price of Diamond (in $)")

ggplot(diamonds) + 
  geom_point(aes(x = carat, y = price, color = color), alpha=0.65) + 
  labs(title = "Diamond Price vs. Carat by Color Classes", x = "Carat of Diamond", y = "Price of Diamond (in $)")

CART Modelling

In order to get price estimation model, we divide our diamonds data for training and testing, 80% and 20% respectively.

#Seeding
set.seed(503)

#Diamonds Train Data and Test Data is defined.

diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

Diamonds Train Data are shown below;

glimpse(diamonds_train)
## Rows: 43,143
## Columns: 11
## $ carat      <dbl> 0.21, 0.23, 0.29, 0.31, 0.24, 0.26, 0.22, 0.23, 0.30, 0.23…
## $ cut        <ord> Premium, Good, Premium, Good, Very Good, Very Good, Fair, …
## $ color      <ord> E, E, I, J, I, H, E, H, J, J, F, J, E, I, J, J, J, H, J, G…
## $ clarity    <ord> SI1, VS1, VS2, SI2, VVS1, SI1, VS2, VS1, SI1, VS1, SI1, SI…
## $ depth      <dbl> 59.8, 56.9, 62.4, 63.3, 62.3, 61.9, 65.1, 59.4, 64.0, 62.8…
## $ table      <dbl> 61, 65, 58, 58, 57, 55, 61, 61, 55, 56, 61, 54, 62, 54, 54…
## $ price      <int> 326, 327, 334, 335, 336, 337, 337, 338, 339, 340, 342, 344…
## $ x          <dbl> 3.89, 4.05, 4.20, 4.34, 3.95, 4.07, 3.87, 4.00, 4.25, 3.93…
## $ y          <dbl> 3.84, 4.07, 4.23, 4.35, 3.98, 4.11, 3.78, 4.05, 4.28, 3.90…
## $ z          <dbl> 2.31, 2.31, 2.63, 2.75, 2.47, 2.53, 2.49, 2.39, 2.73, 2.46…
## $ diamond_id <int> 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 2…

Diamonds Test Data are shown below;

glimpse(diamonds_test)
## Rows: 10,797
## Columns: 11
## $ carat      <dbl> 1.70, 0.58, 1.00, 1.05, 0.70, 0.90, 1.01, 1.10, 0.62, 1.50…
## $ cut        <ord> Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair…
## $ color      <ord> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D…
## $ clarity    <ord> I1, SI2, SI2, SI2, SI2, SI2, SI2, SI2, SI2, SI2, SI2, SI2,…
## $ depth      <dbl> 64.7, 65.1, 65.6, 65.4, 64.5, 64.6, 64.6, 64.6, 64.6, 68.8…
## $ table      <dbl> 56, 58, 66, 59, 60, 59, 56, 54, 57, 57, 58, 66, 60, 55, 58…
## $ price      <int> 5617, 1156, 3767, 3816, 2167, 3847, 3003, 4725, 1410, 7469…
## $ x          <dbl> 7.46, 5.25, 6.10, 6.30, 5.53, 6.04, 6.31, 6.56, 5.39, 6.90…
## $ y          <dbl> 7.37, 5.22, 6.01, 6.24, 5.47, 6.01, 6.24, 6.49, 5.33, 6.86…
## $ z          <dbl> 4.80, 3.41, 3.97, 4.10, 3.55, 3.89, 4.05, 4.22, 3.46, 4.73…
## $ diamond_id <int> 13776, 40690, 5152, 5359, 49828, 5510, 1555, 10151, 43397,…

In the exploratory data analysis, we found that depth and table are not important statistically for diamond price estimation model.

print(correlation_wrt_price)
##   cor_carat_to_price cor_depth_to_price cor_table_to_price cor_x_to_price
## 1          0.9215913         -0.0106474          0.1271339      0.8844352
##   cor_y_to_price cor_z_to_price
## 1      0.8654209      0.8612494

Therefore, in our price estimation model, we take predictors as carat, cut, color, clarity. x. y and z from training data.

price_estimation_model <- rpart(price ~ carat+cut+color+clarity+x+y+z , data=diamonds_train, method='anova')

summary(price_estimation_model)
## Call:
## rpart(formula = price ~ carat + cut + color + clarity + x + y + 
##     z, data = diamonds_train, method = "anova")
##   n= 43143 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.60913716      0 1.0000000 1.0000357 0.009817373
## 2 0.18517972      1 0.3908628 0.3909391 0.004372856
## 3 0.03361554      2 0.2056831 0.2061950 0.002303084
## 4 0.02661853      3 0.1720676 0.1726668 0.002299963
## 5 0.02583190      4 0.1454491 0.1525199 0.002060203
## 6 0.01000000      5 0.1196172 0.1209142 0.001742561
## 
## Variable importance
##   carat       y       x       z clarity   color 
##      25      24      24      23       3       1 
## 
## Node number 1: 43143 observations,    complexity param=0.6091372
##   mean=3938.805, MSE=1.597919e+07 
##   left son=2 (27894 obs) right son=3 (15249 obs)
##   Primary splits:
##       carat < 0.995 to the left,  improve=0.60913720, (0 missing)
##       y     < 6.345 to the left,  improve=0.60805950, (0 missing)
##       x     < 6.315 to the left,  improve=0.60428840, (0 missing)
##       z     < 3.915 to the left,  improve=0.59942370, (0 missing)
##       color splits as  LLLLRRR,   improve=0.02261306, (0 missing)
##   Surrogate splits:
##       x       < 6.265 to the left,  agree=0.983, adj=0.953, (0 split)
##       y       < 6.285 to the left,  agree=0.981, adj=0.946, (0 split)
##       z       < 3.895 to the left,  agree=0.978, adj=0.937, (0 split)
##       clarity splits as  RRLLLLLL,  agree=0.679, adj=0.091, (0 split)
##       color   splits as  LLLLLRR,   agree=0.661, adj=0.040, (0 split)
## 
## Node number 2: 27894 observations,    complexity param=0.03361554
##   mean=1632.056, MSE=1248233 
##   left son=4 (19820 obs) right son=5 (8074 obs)
##   Primary splits:
##       carat   < 0.625 to the left,  improve=0.66557750, (0 missing)
##       y       < 5.535 to the left,  improve=0.66520910, (0 missing)
##       x       < 5.495 to the left,  improve=0.66238230, (0 missing)
##       z       < 3.375 to the left,  improve=0.66219560, (0 missing)
##       clarity splits as  RRRLLLLL,  improve=0.01216575, (0 missing)
##   Surrogate splits:
##       x       < 5.485 to the left,  agree=0.993, adj=0.976, (0 split)
##       y       < 5.515 to the left,  agree=0.991, adj=0.969, (0 split)
##       z       < 3.385 to the left,  agree=0.991, adj=0.969, (0 split)
##       clarity splits as  RRLLLLLL,  agree=0.725, adj=0.051, (0 split)
##       cut     splits as  RLLLL,     agree=0.719, adj=0.028, (0 split)
## 
## Node number 3: 15249 observations,    complexity param=0.1851797
##   mean=8158.389, MSE=1.538716e+07 
##   left son=6 (10296 obs) right son=7 (4953 obs)
##   Primary splits:
##       y       < 7.195 to the left,  improve=0.54407490, (0 missing)
##       x       < 7.195 to the left,  improve=0.53738530, (0 missing)
##       carat   < 1.495 to the left,  improve=0.53615420, (0 missing)
##       z       < 4.435 to the left,  improve=0.52631830, (0 missing)
##       clarity splits as  LLLRRRRR,  improve=0.05470891, (0 missing)
##   Surrogate splits:
##       x     < 7.185 to the left,  agree=0.984, adj=0.950, (0 split)
##       carat < 1.455 to the left,  agree=0.980, adj=0.937, (0 split)
##       z     < 4.435 to the left,  agree=0.964, adj=0.890, (0 split)
##       color splits as  LLLLLLR,   agree=0.679, adj=0.012, (0 split)
## 
## Node number 4: 19820 observations
##   mean=1050.302, MSE=264735.3 
## 
## Node number 5: 8074 observations
##   mean=3060.143, MSE=792289.4 
## 
## Node number 6: 10296 observations,    complexity param=0.02661853
##   mean=6151.569, MSE=4751812 
##   left son=12 (7824 obs) right son=13 (2472 obs)
##   Primary splits:
##       clarity splits as  LLLLRRRR,  improve=0.3750779, (0 missing)
##       y       < 6.785 to the left,  improve=0.1210156, (0 missing)
##       carat   < 1.195 to the left,  improve=0.1066457, (0 missing)
##       color   splits as  RRRRLLL,   improve=0.1054268, (0 missing)
##       x       < 6.805 to the left,  improve=0.1042365, (0 missing)
##   Surrogate splits:
##       y < 4.175 to the right, agree=0.76, adj=0.001, (0 split)
## 
## Node number 7: 4953 observations,    complexity param=0.0258319
##   mean=12330.05, MSE=1.172079e+07 
##   left son=14 (3223 obs) right son=15 (1730 obs)
##   Primary splits:
##       y       < 7.855 to the left,  improve=0.30675820, (0 missing)
##       x       < 7.845 to the left,  improve=0.30102780, (0 missing)
##       carat   < 1.915 to the left,  improve=0.29420540, (0 missing)
##       z       < 4.805 to the left,  improve=0.27774230, (0 missing)
##       clarity splits as  LRRRRRRR,  improve=0.07274662, (0 missing)
##   Surrogate splits:
##       x       < 7.905 to the left,  agree=0.983, adj=0.952, (0 split)
##       carat   < 1.895 to the left,  agree=0.974, adj=0.924, (0 split)
##       z       < 4.825 to the left,  agree=0.952, adj=0.862, (0 split)
##       clarity splits as  RRLLLLLL,  agree=0.675, adj=0.068, (0 split)
## 
## Node number 12: 7824 observations
##   mean=5401.156, MSE=2073671 
## 
## Node number 13: 2472 observations
##   mean=8526.661, MSE=5804893 
## 
## Node number 14: 3223 observations
##   mean=10940.83, MSE=8720276 
## 
## Node number 15: 1730 observations
##   mean=14918.16, MSE=7016977

Summary tells us that carat, x, y, and z are important variables for the model. Clarity is slightly important.

Decision tree of the model is as follows;

fancyRpartPlot(price_estimation_model)

Price estimation model in details;

printcp(price_estimation_model)
## 
## Regression tree:
## rpart(formula = price ~ carat + cut + color + clarity + x + y + 
##     z, data = diamonds_train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] carat   clarity y      
## 
## Root node error: 6.8939e+11/43143 = 15979193
## 
## n= 43143 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.609137      0   1.00000 1.00004 0.0098174
## 2 0.185180      1   0.39086 0.39094 0.0043729
## 3 0.033616      2   0.20568 0.20619 0.0023031
## 4 0.026619      3   0.17207 0.17267 0.0023000
## 5 0.025832      4   0.14545 0.15252 0.0020602
## 6 0.010000      5   0.11962 0.12091 0.0017426
plotcp(price_estimation_model)

By using diamonds_test data, we test our price estimation model and predict diamond prices. First twenty estimations are as follow;

predicted_Diamond_data <- predict(price_estimation_model, newdata = diamonds_test)

head(predicted_Diamond_data, 20)
##         1         2         3         4         5         6         7         8 
## 10940.832  1050.302  5401.156  5401.156  3060.143  3060.143  5401.156  5401.156 
##         9        10        11        12        13        14        15        16 
##  1050.302  5401.156  5401.156  5401.156  3060.143 10940.832  1050.302  1050.302 
##        17        18        19        20 
##  5401.156  3060.143  3060.143  3060.143

Price estimation model’s performance regarding real data that we used for testing can be considered as 0.88. R-squared value of our model says that the model works succesfully.

rsquared <- format(cor(predicted_Diamond_data,diamonds_test$price)^2, digits=2)

rsquared
## [1] "0.88"