Data

The diamonds data is consist of 11 variables.3 of the variables are categorical, rest of them are numeric. Train and test datasets are generated from the main dataset which is the diamonds. Test dataset is created from %20 of the diamonds dataset which created by generating random numbers. The train dataset contains rest %80.

## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

In order to have a CART model, it is needed to create clusterizational tree. That tree will show boolean conditions of some variables and according to its results, the tree will go to the next step. At the bottom of the tree, it shows the number of observations that satisfies these boolean conditions and its percentage.

When the tree is done, it is time for making a prediction. In CART model, it creates clusters, and make predictions according to the input, which is in this case our train dataset. From this dataset it generates predictions.

##         1         2         3         4         5         6 
## 10940.832  1050.302  5401.156  5401.156  3060.143  3060.143

These predictions should be tested in somehow. That is why Metrics library is used to have a mean absolute error and mean squared error.

MSE

## [1] 1902054

MAE

## [1] 890.0513

However this result can be improved. If a linear model is created, signifant variables that actually effect the dependent variable can be seen. According to that model y and z variables can be eliminated.

## 
## Call:
## lm(formula = Price ~ Carat + Cut + Color + Clarity + depth + 
##     table + x + y + z, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21446.0   -593.0   -182.4    378.8  10701.8 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6032.795    440.677   13.690  < 2e-16 ***
## Carat       11306.047     55.032  205.445  < 2e-16 ***
## Cut.L         587.057     25.144   23.347  < 2e-16 ***
## Cut.Q        -300.964     20.114  -14.963  < 2e-16 ***
## Cut.C         148.352     17.317    8.567  < 2e-16 ***
## Cut^4         -26.568     13.826   -1.922   0.0547 .  
## Color.L     -1949.838     19.404 -100.484  < 2e-16 ***
## Color.Q      -672.931     17.640  -38.148  < 2e-16 ***
## Color.C      -161.554     16.458   -9.816  < 2e-16 ***
## Color^4        25.973     15.117    1.718   0.0858 .  
## Color^5       -98.819     14.278   -6.921 4.55e-12 ***
## Color^6       -59.129     12.977   -4.556 5.22e-06 ***
## Clarity.L    4129.825     33.880  121.897  < 2e-16 ***
## Clarity.Q   -1954.978     31.602  -61.863  < 2e-16 ***
## Clarity.C     997.867     27.030   36.918  < 2e-16 ***
## Clarity^4    -382.021     21.572  -17.709  < 2e-16 ***
## Clarity^5     242.941     17.611   13.795  < 2e-16 ***
## Clarity^6      12.562     15.328    0.820   0.4125    
## Clarity^7      87.466     13.525    6.467 1.01e-10 ***
## depth         -65.123      4.992  -13.045  < 2e-16 ***
## table         -29.084      3.250   -8.950  < 2e-16 ***
## x           -1020.664     34.775  -29.350  < 2e-16 ***
## y              -1.438     19.396   -0.074   0.9409    
## z             -38.006     33.858   -1.123   0.2616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 43119 degrees of freedom
## Multiple R-squared:  0.9202, Adjusted R-squared:  0.9202 
## F-statistic: 2.162e+04 on 23 and 43119 DF,  p-value: < 2.2e-16

When these variables are excluded, the tree will look like above.

##        1        2        3        4        5        6 
## 8293.524 1050.302 5397.237 5397.237 3060.143 3060.143

If the CARS model is revised without these variables, MSE and MAE would be like below.

MSE

## [1] 1619502

MAE

## [1] 841.1341

It can be seen that there is an improvement within these two approaches. The improvement will be 5% in this case.

## [1] -0.05495989
## [1] "-5%"