The diamonds data is consist of 11 variables.3 of the variables are categorical, rest of them are numeric. Train and test datasets are generated from the main dataset which is the diamonds. Test dataset is created from %20 of the diamonds dataset which created by generating random numbers. The train dataset contains rest %80.
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
#diamonds_test$cut <- as.factor(diamonds_test$cut)
#diamonds_test$color <- as.factor(diamonds_test$color)
#diamonds_test$clarity <- as.factor(diamonds_test$clarity)
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
train <- diamonds_train %>% select(Price = price,Carat = carat,Cut = cut, Color = color,Clarity = clarity,depth = depth
,table = table,x= x,y= y,z= z)
test <- diamonds_test %>% select(Price = price,Carat = carat,Cut = cut, Color = color,Clarity = clarity,depth = depth
,table = table,x= x,y= y,z= z)
In order to have a CART model, it is needed to create clusterizational tree. That tree will show boolean conditions of some variables and according to its results, the tree will go to the next step. At the bottom of the tree, it shows the number of observations that satisfies these boolean conditions and its percentage.
tree <- rpart(Price ~ Carat+Cut+Color+Clarity+depth+table+x+y+z, data=train,method = "anova")
rpart.plot(tree,type = 3,digits = 3,fallen.leaves = TRUE)
When the tree is done, it is time for making a prediction. In CART model, it creates clusters, and make predictions according to the input, which is in this case our train dataset. From this dataset it generates predictions.
## 1 2 3 4 5 6
## 10940.832 1050.302 5401.156 5401.156 3060.143 3060.143
These predictions should be tested in somehow. That is why Metrics library is used to have a mean absolute error and mean squared error.
MSE
## [1] 1902054
MAE
## [1] 890.0513
However this result can be improved. If a linear model is created, signifant variables that actually effect the dependent variable can be seen. According to that model y and z variables can be eliminated.
##
## Call:
## lm(formula = Price ~ Carat + Cut + Color + Clarity + depth +
## table + x + y + z, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21446.0 -593.0 -182.4 378.8 10701.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6032.795 440.677 13.690 < 2e-16 ***
## Carat 11306.047 55.032 205.445 < 2e-16 ***
## Cut.L 587.057 25.144 23.347 < 2e-16 ***
## Cut.Q -300.964 20.114 -14.963 < 2e-16 ***
## Cut.C 148.352 17.317 8.567 < 2e-16 ***
## Cut^4 -26.568 13.826 -1.922 0.0547 .
## Color.L -1949.838 19.404 -100.484 < 2e-16 ***
## Color.Q -672.931 17.640 -38.148 < 2e-16 ***
## Color.C -161.554 16.458 -9.816 < 2e-16 ***
## Color^4 25.973 15.117 1.718 0.0858 .
## Color^5 -98.819 14.278 -6.921 4.55e-12 ***
## Color^6 -59.129 12.977 -4.556 5.22e-06 ***
## Clarity.L 4129.825 33.880 121.897 < 2e-16 ***
## Clarity.Q -1954.978 31.602 -61.863 < 2e-16 ***
## Clarity.C 997.867 27.030 36.918 < 2e-16 ***
## Clarity^4 -382.021 21.572 -17.709 < 2e-16 ***
## Clarity^5 242.941 17.611 13.795 < 2e-16 ***
## Clarity^6 12.562 15.328 0.820 0.4125
## Clarity^7 87.466 13.525 6.467 1.01e-10 ***
## depth -65.123 4.992 -13.045 < 2e-16 ***
## table -29.084 3.250 -8.950 < 2e-16 ***
## x -1020.664 34.775 -29.350 < 2e-16 ***
## y -1.438 19.396 -0.074 0.9409
## z -38.006 33.858 -1.123 0.2616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 43119 degrees of freedom
## Multiple R-squared: 0.9202, Adjusted R-squared: 0.9202
## F-statistic: 2.162e+04 on 23 and 43119 DF, p-value: < 2.2e-16
tree2 <- rpart(Price ~ Carat+Cut+Color+Clarity+depth+table+x, data=train,method = "anova")
rpart.plot(tree2,type = 3,digits = 3,fallen.leaves = TRUE)
When these variables are excluded, the tree will look like above.
## 1 2 3 4 5 6
## 8293.524 1050.302 5397.237 5397.237 3060.143 3060.143
If the CARS model is revised without these variables, MSE and MAE would be like below.
MSE
## [1] 1619502
MAE
## [1] 841.1341
It can be seen that there is an improvement within these two approaches. The improvement will be 5% in this case.
## [1] -0.05495989
## [1] "-5%"