Importing Necessary Libraries

Dataset

The diamonds dataset consists of prices of diamonds with different properties. It contains 53,940 observations with 1 price column and 9 property columns.

  • cut, color and clarity columns are categorical.
  • carat, depth, table, x, y and z columns are numerical.
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

Since some of x, y and z values are equal to 0, we can exclude them from the dataset, in order to clean the dataset.

diamonds <- diamonds %>% filter(x!=0 & y!=0 & z!=0)

We excluded 20 observations from the dataset.

Data Analysis

In order to build a model, first we need to find the relationship of each variable with the price.

For categorical columns, we can take the mean of the prices in order to compare.

cut_df_mean <- diamonds %>%
  group_by(cut) %>%
  summarise(AveragePrice = mean(price))

color_df_mean <- diamonds %>%
  group_by(color) %>%
  summarise(AveragePrice = mean(price))

clarity_df_mean <- diamonds %>%
  group_by(clarity) %>%
  summarise(AveragePrice = mean(price))


cut_plot_mean <- ggplot(cut_df_mean, aes(x=cut, y=AveragePrice)) + 
  geom_bar(stat="identity", aes(fill=cut)) +
  theme(legend.position = "none")
color_plot_mean <- ggplot(color_df_mean, aes(x=color, y=AveragePrice)) + 
  geom_bar(stat="identity", aes(fill=color)) +
  theme(legend.position = "none") + labs(y="")
clarity_plot_mean <- ggplot(clarity_df_mean, aes(x=clarity, y=AveragePrice)) + 
  geom_bar(stat="identity", aes(fill=clarity)) +
  theme(legend.position = "none") + labs(y="")

grid.arrange(cut_plot_mean, color_plot_mean, clarity_plot_mean, ncol=3)

Since the best choice for the cut is Ideal, the lowest mean price belongs to it. Also, for color and clarity, the lowest values are the best options. So, let us check the median of the prices to compare.

cut_df_median <- diamonds %>%
  group_by(cut) %>%
  summarise(MedianPrice = median(price))

color_df_median <- diamonds %>%
  group_by(color) %>%
  summarise(MedianPrice = median(price))

clarity_df_median <- diamonds %>%
  group_by(clarity) %>%
  summarise(MedianPrice = median(price))


cut_plot_median <- ggplot(cut_df_median, aes(x=cut, y=MedianPrice)) + 
  geom_bar(stat="identity", aes(fill=cut)) +
  theme(legend.position = "none")
color_plot_median <- ggplot(color_df_median, aes(x=color, y=MedianPrice)) + 
  geom_bar(stat="identity", aes(fill=color)) +
  theme(legend.position = "none") + labs(y="")
clarity_plot_median <- ggplot(clarity_df_median, aes(x=clarity, y=MedianPrice)) + 
  geom_bar(stat="identity", aes(fill=clarity)) +
  theme(legend.position = "none") + labs(y="")

grid.arrange(cut_plot_median, color_plot_median, clarity_plot_median, ncol=3)

Same case occurs here as well. The best options have the lowest median prices. We might want to consider this during building our model.

Now, we can find the correlation between our numerical variables.

corr_df <- correlate(diamonds %>% select(-cut,-clarity,-color))
corr_df
## # A tibble: 7 x 8
##   term    carat   depth  table   price       x       y       z
##   <chr>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 carat NA       0.0283  0.182  0.922   0.978   0.954   0.961 
## 2 depth  0.0283 NA      -0.296 -0.0107 -0.0250 -0.0291  0.0950
## 3 table  0.182  -0.296  NA      0.127   0.196   0.184   0.152 
## 4 price  0.922  -0.0107  0.127 NA       0.887   0.868   0.868 
## 5 x      0.978  -0.0250  0.196  0.887  NA       0.975   0.975 
## 6 y      0.954  -0.0291  0.184  0.868   0.975  NA       0.957 
## 7 z      0.961   0.0950  0.152  0.868   0.975   0.957  NA

We can find the significant correlations by filtering the absolute percentages.

corr_df %>% 
  focus(price) %>%
  filter(abs(price) > 0.8) %>%
  transmute(Property = term, Correlation = round(price,4))
## # A tibble: 4 x 2
##   Property Correlation
##   <chr>          <dbl>
## 1 carat          0.922
## 2 x              0.887
## 3 y              0.868
## 4 z              0.868

Only carat, x, y and z values have above 80% of correlation. During building the model, we might want to exclude other columns.

Building Models

First, we define our test and train datasets.

set.seed(503)

diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
  group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
                            diamonds_test, by = "diamond_id")

Model 1 - CART with ANOVA

model_1 <- rpart(price~.-diamond_id, data=diamonds_train, method = "anova")
fancyRpartPlot(model_1, caption = "")

printcp(model_1)
## 
## Regression tree:
## rpart(formula = price ~ . - diamond_id, data = diamonds_train, 
##     method = "anova")
## 
## Variables actually used in tree construction:
## [1] carat   clarity color   y      
## 
## Root node error: 6.8738e+11/43126 = 15938776
## 
## n= 43126 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.609624      0  1.000000 1.00006 0.0098265
## 2 0.184802      1  0.390376 0.39042 0.0043747
## 3 0.033672      2  0.205574 0.20562 0.0022953
## 4 0.026470      3  0.171902 0.17203 0.0022923
## 5 0.025926      4  0.145431 0.14795 0.0020325
## 6 0.010196      5  0.119505 0.12095 0.0017433
## 7 0.010000      7  0.099112 0.11065 0.0016092

Now, we need to predict our test data and measure the performance of this model. After we test our data we get Mean Absolute Error value as 837.71.

prediction_1 <- predict(model_1, diamonds_test)
mean_abs_err_1 <- mean(abs(diamonds_test$price - prediction_1))
print(paste0("Mean Absolute Error : ", round(mean_abs_err_1,2)))
## [1] "Mean Absolute Error : 837.71"

Model 2 - Generalized Linear Models

model_2 <- glm(price~.-diamond_id, data=diamonds_train, family = Gamma(link = "identity"), start = rep(0.5, 24))
summary(model_2)
## 
## Call:
## glm(formula = price ~ . - diamond_id, family = Gamma(link = "identity"), 
##     data = diamonds_train, start = rep(0.5, 24))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9389  -0.1884  -0.0453   0.1478   4.6570  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1072.6217   435.4727   2.463  0.01378 *  
## carat       6096.9840    67.7107  90.045  < 2e-16 ***
## cut.L         49.4370    28.8913   1.711  0.08706 .  
## cut.Q        -29.6650    23.1591  -1.281  0.20023    
## cut.C        -59.9690    18.4383  -3.252  0.00115 ** 
## cut^4        -37.5264    12.0368  -3.118  0.00182 ** 
## color.L     -251.6945    15.9879 -15.743  < 2e-16 ***
## color.Q      -66.1849    15.3618  -4.308 1.65e-05 ***
## color.C       21.3193    13.8491   1.539  0.12371    
## color^4       49.6046    12.6117   3.933 8.39e-05 ***
## color^5       -2.1316    11.6665  -0.183  0.85502    
## color^6      -49.1810    11.2997  -4.352 1.35e-05 ***
## clarity.L    600.2553    33.2919  18.030  < 2e-16 ***
## clarity.Q   -176.3864    31.8666  -5.535 3.13e-08 ***
## clarity.C    122.5323    26.7297   4.584 4.57e-06 ***
## clarity^4   -122.0145    20.7338  -5.885 4.01e-09 ***
## clarity^5    114.3344    15.7773   7.247 4.34e-13 ***
## clarity^6    -57.0050    12.5953  -4.526 6.03e-06 ***
## clarity^7     58.8607    10.9191   5.391 7.06e-08 ***
## depth        -29.1333     4.8997  -5.946 2.77e-09 ***
## table         -0.8093     2.9066  -0.278  0.78069    
## x            -59.2253    32.5655  -1.819  0.06897 .  
## y            -39.8610    21.4311  -1.860  0.06290 .  
## z            -35.3536    25.9934  -1.360  0.17381    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.6917126)
## 
##     Null deviance: 42342.1  on 43125  degrees of freedom
## Residual deviance:  3497.2  on 43102  degrees of freedom
## AIC: 686300
## 
## Number of Fisher Scoring iterations: 25

Mean Absolute Error value of second model is 937.92.

prediction_2 <- predict(model_2, diamonds_test)
mean_abs_err_2 <- mean(abs(diamonds_test$price - prediction_2))
print(paste0("Mean Absolute Error : ", round(mean_abs_err_2,2)))
## [1] "Mean Absolute Error : 937.92"

Model 3 - Linear Models

model_3 <- lm(price~.-diamond_id, data=diamonds_train)
summary(model_3)
## 
## Call:
## lm(formula = price ~ . - diamond_id, data = diamonds_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22022.3   -586.1   -182.3    372.5  10754.4 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6552.822    445.198   14.719  < 2e-16 ***
## carat       11582.968     57.702  200.737  < 2e-16 ***
## cut.L         588.681     25.095   23.458  < 2e-16 ***
## cut.Q        -298.598     20.044  -14.897  < 2e-16 ***
## cut.C         153.292     17.240    8.892  < 2e-16 ***
## cut^4         -26.108     13.749   -1.899  0.05759 .  
## color.L     -1959.744     19.312 -101.477  < 2e-16 ***
## color.Q      -686.413     17.553  -39.104  < 2e-16 ***
## color.C      -164.255     16.369  -10.034  < 2e-16 ***
## color^4        28.394     15.037    1.888  0.05899 .  
## color^5       -99.937     14.203   -7.036 2.00e-12 ***
## color^6       -58.213     12.910   -4.509 6.53e-06 ***
## clarity.L    4079.765     33.742  120.911  < 2e-16 ***
## clarity.Q   -1942.226     31.481  -61.695  < 2e-16 ***
## clarity.C     978.432     26.921   36.344  < 2e-16 ***
## clarity^4    -370.350     21.477  -17.244  < 2e-16 ***
## clarity^5     239.237     17.524   13.652  < 2e-16 ***
## clarity^6      14.165     15.246    0.929  0.35287    
## clarity^7      85.858     13.453    6.382 1.76e-10 ***
## depth         -66.888      5.084  -13.158  < 2e-16 ***
## table         -28.313      3.231   -8.762  < 2e-16 ***
## x           -1118.347     36.909  -30.300  < 2e-16 ***
## y              15.908     19.452    0.818  0.41346    
## z             -99.063     38.376   -2.581  0.00984 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1123 on 43102 degrees of freedom
## Multiple R-squared:  0.9209, Adjusted R-squared:  0.9208 
## F-statistic: 2.181e+04 on 23 and 43102 DF,  p-value: < 2.2e-16

Mean Absolute Error value of third model is 740.89.

prediction_3 <- predict(model_3, diamonds_test)
mean_abs_err_3 <- mean(abs(diamonds_test$price - prediction_3))
print(paste0("Mean Absolute Error : ", round(mean_abs_err_3,2)))
## [1] "Mean Absolute Error : 740.89"

Conclusion

Among all the models, the best one is the Linear Model since it gives the lowest Mean Absolute Error value. However, a better model can be developed in future attempts by doing feature engineering and implementing new properties to the model.