The diamonds dataset consists of prices of diamonds with different properties. It contains 53,940 observations with 1 price
column and 9 property columns.
cut
, color
and clarity
columns are categorical.carat
, depth
, table
, x
, y
and z
columns are numerical.summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Since some of x, y and z values are equal to 0, we can exclude them from the dataset, in order to clean the dataset.
diamonds <- diamonds %>% filter(x!=0 & y!=0 & z!=0)
We excluded 20 observations from the dataset.
In order to build a model, first we need to find the relationship of each variable with the price.
For categorical columns, we can take the mean of the prices in order to compare.
cut_df_mean <- diamonds %>%
group_by(cut) %>%
summarise(AveragePrice = mean(price))
color_df_mean <- diamonds %>%
group_by(color) %>%
summarise(AveragePrice = mean(price))
clarity_df_mean <- diamonds %>%
group_by(clarity) %>%
summarise(AveragePrice = mean(price))
cut_plot_mean <- ggplot(cut_df_mean, aes(x=cut, y=AveragePrice)) +
geom_bar(stat="identity", aes(fill=cut)) +
theme(legend.position = "none")
color_plot_mean <- ggplot(color_df_mean, aes(x=color, y=AveragePrice)) +
geom_bar(stat="identity", aes(fill=color)) +
theme(legend.position = "none") + labs(y="")
clarity_plot_mean <- ggplot(clarity_df_mean, aes(x=clarity, y=AveragePrice)) +
geom_bar(stat="identity", aes(fill=clarity)) +
theme(legend.position = "none") + labs(y="")
grid.arrange(cut_plot_mean, color_plot_mean, clarity_plot_mean, ncol=3)
Since the best choice for the cut is Ideal, the lowest mean price belongs to it. Also, for color and clarity, the lowest values are the best options. So, let us check the median of the prices to compare.
cut_df_median <- diamonds %>%
group_by(cut) %>%
summarise(MedianPrice = median(price))
color_df_median <- diamonds %>%
group_by(color) %>%
summarise(MedianPrice = median(price))
clarity_df_median <- diamonds %>%
group_by(clarity) %>%
summarise(MedianPrice = median(price))
cut_plot_median <- ggplot(cut_df_median, aes(x=cut, y=MedianPrice)) +
geom_bar(stat="identity", aes(fill=cut)) +
theme(legend.position = "none")
color_plot_median <- ggplot(color_df_median, aes(x=color, y=MedianPrice)) +
geom_bar(stat="identity", aes(fill=color)) +
theme(legend.position = "none") + labs(y="")
clarity_plot_median <- ggplot(clarity_df_median, aes(x=clarity, y=MedianPrice)) +
geom_bar(stat="identity", aes(fill=clarity)) +
theme(legend.position = "none") + labs(y="")
grid.arrange(cut_plot_median, color_plot_median, clarity_plot_median, ncol=3)
Same case occurs here as well. The best options have the lowest median prices. We might want to consider this during building our model.
Now, we can find the correlation between our numerical variables.
corr_df <- correlate(diamonds %>% select(-cut,-clarity,-color))
corr_df
## # A tibble: 7 x 8
## term carat depth table price x y z
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 carat NA 0.0283 0.182 0.922 0.978 0.954 0.961
## 2 depth 0.0283 NA -0.296 -0.0107 -0.0250 -0.0291 0.0950
## 3 table 0.182 -0.296 NA 0.127 0.196 0.184 0.152
## 4 price 0.922 -0.0107 0.127 NA 0.887 0.868 0.868
## 5 x 0.978 -0.0250 0.196 0.887 NA 0.975 0.975
## 6 y 0.954 -0.0291 0.184 0.868 0.975 NA 0.957
## 7 z 0.961 0.0950 0.152 0.868 0.975 0.957 NA
We can find the significant correlations by filtering the absolute percentages.
corr_df %>%
focus(price) %>%
filter(abs(price) > 0.8) %>%
transmute(Property = term, Correlation = round(price,4))
## # A tibble: 4 x 2
## Property Correlation
## <chr> <dbl>
## 1 carat 0.922
## 2 x 0.887
## 3 y 0.868
## 4 z 0.868
Only carat, x, y and z values have above 80% of correlation. During building the model, we might want to exclude other columns.
First, we define our test and train datasets.
set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
model_1 <- rpart(price~.-diamond_id, data=diamonds_train, method = "anova")
fancyRpartPlot(model_1, caption = "")
printcp(model_1)
##
## Regression tree:
## rpart(formula = price ~ . - diamond_id, data = diamonds_train,
## method = "anova")
##
## Variables actually used in tree construction:
## [1] carat clarity color y
##
## Root node error: 6.8738e+11/43126 = 15938776
##
## n= 43126
##
## CP nsplit rel error xerror xstd
## 1 0.609624 0 1.000000 1.00006 0.0098265
## 2 0.184802 1 0.390376 0.39042 0.0043747
## 3 0.033672 2 0.205574 0.20562 0.0022953
## 4 0.026470 3 0.171902 0.17203 0.0022923
## 5 0.025926 4 0.145431 0.14795 0.0020325
## 6 0.010196 5 0.119505 0.12095 0.0017433
## 7 0.010000 7 0.099112 0.11065 0.0016092
Now, we need to predict our test data and measure the performance of this model. After we test our data we get Mean Absolute Error value as 837.71.
prediction_1 <- predict(model_1, diamonds_test)
mean_abs_err_1 <- mean(abs(diamonds_test$price - prediction_1))
print(paste0("Mean Absolute Error : ", round(mean_abs_err_1,2)))
## [1] "Mean Absolute Error : 837.71"
model_2 <- glm(price~.-diamond_id, data=diamonds_train, family = Gamma(link = "identity"), start = rep(0.5, 24))
summary(model_2)
##
## Call:
## glm(formula = price ~ . - diamond_id, family = Gamma(link = "identity"),
## data = diamonds_train, start = rep(0.5, 24))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9389 -0.1884 -0.0453 0.1478 4.6570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1072.6217 435.4727 2.463 0.01378 *
## carat 6096.9840 67.7107 90.045 < 2e-16 ***
## cut.L 49.4370 28.8913 1.711 0.08706 .
## cut.Q -29.6650 23.1591 -1.281 0.20023
## cut.C -59.9690 18.4383 -3.252 0.00115 **
## cut^4 -37.5264 12.0368 -3.118 0.00182 **
## color.L -251.6945 15.9879 -15.743 < 2e-16 ***
## color.Q -66.1849 15.3618 -4.308 1.65e-05 ***
## color.C 21.3193 13.8491 1.539 0.12371
## color^4 49.6046 12.6117 3.933 8.39e-05 ***
## color^5 -2.1316 11.6665 -0.183 0.85502
## color^6 -49.1810 11.2997 -4.352 1.35e-05 ***
## clarity.L 600.2553 33.2919 18.030 < 2e-16 ***
## clarity.Q -176.3864 31.8666 -5.535 3.13e-08 ***
## clarity.C 122.5323 26.7297 4.584 4.57e-06 ***
## clarity^4 -122.0145 20.7338 -5.885 4.01e-09 ***
## clarity^5 114.3344 15.7773 7.247 4.34e-13 ***
## clarity^6 -57.0050 12.5953 -4.526 6.03e-06 ***
## clarity^7 58.8607 10.9191 5.391 7.06e-08 ***
## depth -29.1333 4.8997 -5.946 2.77e-09 ***
## table -0.8093 2.9066 -0.278 0.78069
## x -59.2253 32.5655 -1.819 0.06897 .
## y -39.8610 21.4311 -1.860 0.06290 .
## z -35.3536 25.9934 -1.360 0.17381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.6917126)
##
## Null deviance: 42342.1 on 43125 degrees of freedom
## Residual deviance: 3497.2 on 43102 degrees of freedom
## AIC: 686300
##
## Number of Fisher Scoring iterations: 25
Mean Absolute Error value of second model is 937.92.
prediction_2 <- predict(model_2, diamonds_test)
mean_abs_err_2 <- mean(abs(diamonds_test$price - prediction_2))
print(paste0("Mean Absolute Error : ", round(mean_abs_err_2,2)))
## [1] "Mean Absolute Error : 937.92"
model_3 <- lm(price~.-diamond_id, data=diamonds_train)
summary(model_3)
##
## Call:
## lm(formula = price ~ . - diamond_id, data = diamonds_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22022.3 -586.1 -182.3 372.5 10754.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6552.822 445.198 14.719 < 2e-16 ***
## carat 11582.968 57.702 200.737 < 2e-16 ***
## cut.L 588.681 25.095 23.458 < 2e-16 ***
## cut.Q -298.598 20.044 -14.897 < 2e-16 ***
## cut.C 153.292 17.240 8.892 < 2e-16 ***
## cut^4 -26.108 13.749 -1.899 0.05759 .
## color.L -1959.744 19.312 -101.477 < 2e-16 ***
## color.Q -686.413 17.553 -39.104 < 2e-16 ***
## color.C -164.255 16.369 -10.034 < 2e-16 ***
## color^4 28.394 15.037 1.888 0.05899 .
## color^5 -99.937 14.203 -7.036 2.00e-12 ***
## color^6 -58.213 12.910 -4.509 6.53e-06 ***
## clarity.L 4079.765 33.742 120.911 < 2e-16 ***
## clarity.Q -1942.226 31.481 -61.695 < 2e-16 ***
## clarity.C 978.432 26.921 36.344 < 2e-16 ***
## clarity^4 -370.350 21.477 -17.244 < 2e-16 ***
## clarity^5 239.237 17.524 13.652 < 2e-16 ***
## clarity^6 14.165 15.246 0.929 0.35287
## clarity^7 85.858 13.453 6.382 1.76e-10 ***
## depth -66.888 5.084 -13.158 < 2e-16 ***
## table -28.313 3.231 -8.762 < 2e-16 ***
## x -1118.347 36.909 -30.300 < 2e-16 ***
## y 15.908 19.452 0.818 0.41346
## z -99.063 38.376 -2.581 0.00984 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1123 on 43102 degrees of freedom
## Multiple R-squared: 0.9209, Adjusted R-squared: 0.9208
## F-statistic: 2.181e+04 on 23 and 43102 DF, p-value: < 2.2e-16
Mean Absolute Error value of third model is 740.89.
prediction_3 <- predict(model_3, diamonds_test)
mean_abs_err_3 <- mean(abs(diamonds_test$price - prediction_3))
print(paste0("Mean Absolute Error : ", round(mean_abs_err_3,2)))
## [1] "Mean Absolute Error : 740.89"
Among all the models, the best one is the Linear Model since it gives the lowest Mean Absolute Error value. However, a better model can be developed in future attempts by doing feature engineering and implementing new properties to the model.