Price Estimation Model on Diamonds Dataset

A Glance at the Data

Structure

Dataframe consists of both numeric and categorical variables

str(diamonds)

## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Head

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Visual Inspection

Preparing Data for ggplot

Since price will be response variable in the model first we need to look at if there were any correlation between numeric variables with price. Getting ready a longer dataframe will be used by ggplot

longer_diamonds <- diamonds %>% select(1, 5,6,7,8,9, 10) %>% pivot_longer(-price, names_to = "names", values_to="values")
longer_diamonds

## # A tibble: 323,640 x 3
##    price names values
##    <int> <chr>  <dbl>
##  1   326 carat   0.23
##  2   326 depth  61.5 
##  3   326 table  55   
##  4   326 x       3.95
##  5   326 y       3.98
##  6   326 z       2.43
##  7   326 carat   0.21
##  8   326 depth  59.8 
##  9   326 table  61   
## 10   326 x       3.89
## # ... with 323,630 more rows

Correlation Detection by Visual Inspection

By Visual inspection it looks like carat, x, y and z variables have strong correlation with price

ggplot(longer_diamonds, aes(x=price, y= values)) +
  geom_point() +
  facet_wrap(~ names) +
  scale_y_log10()

Examining Correlation with cor() Function

After cor() function results we could drop depth and table variables for the model since they do not have strong correlation

cor(diamonds$price, diamonds$carat)

## [1] 0.9215913

cor(diamonds$price, diamonds$depth)

## [1] -0.0106474

cor(diamonds$price, diamonds$table)

## [1] 0.1271339

cor(diamonds$price, diamonds$x)

## [1] 0.8844352

cor(diamonds$price, diamonds$y)

## [1] 0.8654209

cor(diamonds$price, diamonds$z)

## [1] 0.8612494

Properties of New Dataframe w/o depth and table

diamonds_new <- diamonds %>% select(7, 2, 3, 4, 1, 8, 9, 10)
str(diamonds_new)

## tibble [53,940 x 8] (S3: tbl_df/tbl/data.frame)
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

head(diamonds_new)

## # A tibble: 6 x 8
##   price cut       color clarity carat     x     y     z
##   <int> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl>
## 1   326 Ideal     E     SI2     0.23   3.95  3.98  2.43
## 2   326 Premium   E     SI1     0.21   3.89  3.84  2.31
## 3   327 Good      E     VS1     0.23   4.05  4.07  2.31
## 4   334 Premium   I     VS2     0.290  4.2   4.23  2.63
## 5   335 Good      J     SI2     0.31   4.34  4.35  2.75
## 6   336 Very Good J     VVS2    0.24   3.94  3.96  2.48

Model Preparation - Splitting Dataset

We need two sets of data for both training and test.

Total Rownumber

To do that first we need to know the total row number of diamonds dataframe

n <- nrow(diamonds_new)

Definition of n_train.

Split will be conducted with ratios = %80

n_train <- round(0.80 * n)

Definition of Randomness

For reproducability of whole process

set.seed(87)

Train_Indices

train_indices <- sample(1:n, n_train)

Definitions of Sets

Training Set

Which is a subset contains the diamonds indices equal to train indices. We will use this data frame to train the model.

diamonds_train <- diamonds_new[train_indices, ]

Test Set

which is a subset contains the diamonds indices not equal to train indices. We will use this data frame to test the model.

diamonds_test <- diamonds_new[-train_indices, ]

Model

Setting the Model

diamonds_model <- rpart(price ~ ., data = diamonds_train)

Summary of Model

summary(diamonds_model)

## Call:
## rpart(formula = price ~ ., data = diamonds_train)
##   n= 43152 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.60838568      0 1.0000000 1.0000383 0.009847565
## 2 0.18548941      1 0.3916143 0.3916439 0.004400847
## 3 0.03380623      2 0.2061249 0.2061762 0.002316826
## 4 0.02639393      3 0.1723187 0.1724946 0.002315555
## 5 0.02586428      4 0.1459248 0.1502857 0.002035272
## 6 0.01000000      5 0.1200605 0.1207104 0.001738025
## 
## Variable importance
##   carat       y       x       z clarity   color 
##      25      24      24      23       3       1 
## 
## Node number 1: 43152 observations,    complexity param=0.6083857
##   mean=3922.407, MSE=1.577363e+07 
##   left son=2 (27894 obs) right son=3 (15258 obs)
##   Primary splits:
##       carat < 0.995 to the left,  improve=0.60838570, (0 missing)
##       y     < 6.325 to the left,  improve=0.60678510, (0 missing)
##       x     < 6.335 to the left,  improve=0.60367980, (0 missing)
##       z     < 3.935 to the left,  improve=0.59772800, (0 missing)
##       color splits as  LLLLRRR,   improve=0.02114851, (0 missing)
##   Surrogate splits:
##       x       < 6.275 to the left,  agree=0.983, adj=0.953, (0 split)
##       y       < 6.265 to the left,  agree=0.981, adj=0.947, (0 split)
##       z       < 3.895 to the left,  agree=0.977, adj=0.936, (0 split)
##       clarity splits as  RRLLLLLL,  agree=0.679, adj=0.092, (0 split)
##       color   splits as  LLLLLRR,   agree=0.661, adj=0.041, (0 split)
## 
## Node number 2: 27894 observations,    complexity param=0.03380623
##   mean=1631.282, MSE=1239638 
##   left son=4 (19550 obs) right son=5 (8344 obs)
##   Primary splits:
##       carat   < 0.605 to the left,  improve=0.66546260, (0 missing)
##       y       < 5.535 to the left,  improve=0.66470980, (0 missing)
##       x       < 5.485 to the left,  improve=0.66228850, (0 missing)
##       z       < 3.375 to the left,  improve=0.66148210, (0 missing)
##       clarity splits as  RRRLLLLL,  improve=0.01218183, (0 missing)
##   Surrogate splits:
##       x       < 5.455 to the left,  agree=0.992, adj=0.972, (0 split)
##       y       < 5.465 to the left,  agree=0.990, adj=0.967, (0 split)
##       z       < 3.365 to the left,  agree=0.990, adj=0.966, (0 split)
##       clarity splits as  RRLLLLLL,  agree=0.718, adj=0.056, (0 split)
##       cut     splits as  RLLLL,     agree=0.709, adj=0.027, (0 split)
## 
## Node number 3: 15258 observations,    complexity param=0.1854894
##   mean=8110.94, MSE=1.520378e+07 
##   left son=6 (10339 obs) right son=7 (4919 obs)
##   Primary splits:
##       y       < 7.195 to the left,  improve=0.54425530, (0 missing)
##       carat   < 1.495 to the left,  improve=0.53760550, (0 missing)
##       x       < 7.195 to the left,  improve=0.53623800, (0 missing)
##       z       < 4.435 to the left,  improve=0.52603750, (0 missing)
##       clarity splits as  LLLRRRRR,  improve=0.05411494, (0 missing)
##   Surrogate splits:
##       x     < 7.215 to the left,  agree=0.984, adj=0.952, (0 split)
##       carat < 1.405 to the left,  agree=0.981, adj=0.941, (0 split)
##       z     < 4.435 to the left,  agree=0.965, adj=0.892, (0 split)
##       color splits as  LLLLLLR,   agree=0.681, adj=0.010, (0 split)
## 
## Node number 4: 19550 observations
##   mean=1037.916, MSE=252318.7 
## 
## Node number 5: 8344 observations
##   mean=3021.541, MSE=795177.8 
## 
## Node number 6: 10339 observations,    complexity param=0.02639393
##   mean=6126.782, MSE=4686199 
##   left son=12 (7885 obs) right son=13 (2454 obs)
##   Primary splits:
##       clarity splits as  LLLLRRRR,  improve=0.3707981, (0 missing)
##       y       < 6.755 to the left,  improve=0.1201337, (0 missing)
##       x       < 6.775 to the left,  improve=0.1072869, (0 missing)
##       carat   < 1.185 to the left,  improve=0.1059556, (0 missing)
##       color   splits as  RRRRLLL,   improve=0.1035364, (0 missing)
## 
## Node number 7: 4919 observations,    complexity param=0.02586428
##   mean=12281.34, MSE=1.164316e+07 
##   left son=14 (3209 obs) right son=15 (1710 obs)
##   Primary splits:
##       y       < 7.855 to the left,  improve=0.30738690, (0 missing)
##       x       < 7.895 to the left,  improve=0.29667200, (0 missing)
##       carat   < 1.905 to the left,  improve=0.28916230, (0 missing)
##       z       < 4.805 to the left,  improve=0.27542900, (0 missing)
##       clarity splits as  LRRRRRRR,  improve=0.07275093, (0 missing)
##   Surrogate splits:
##       x       < 7.905 to the left,  agree=0.982, adj=0.947, (0 split)
##       carat   < 1.865 to the left,  agree=0.972, adj=0.919, (0 split)
##       z       < 4.825 to the left,  agree=0.952, adj=0.861, (0 split)
##       clarity splits as  RRLLLLLL,  agree=0.680, adj=0.080, (0 split)
## 
## Node number 12: 7885 observations
##   mean=5391.396, MSE=2047998 
## 
## Node number 13: 2454 observations
##   mean=8489.668, MSE=5842194 
## 
## Node number 14: 3209 observations
##   mean=10900.35, MSE=8620826 
## 
## Node number 15: 1710 observations
##   mean=14872.92, MSE=7019639

Tree View of Model

fancyRpartPlot(diamonds_model)

Details of Model

printcp(diamonds_model)

## 
## Regression tree:
## rpart(formula = price ~ ., data = diamonds_train)
## 
## Variables actually used in tree construction:
## [1] carat   clarity y      
## 
## Root node error: 6.8066e+11/43152 = 15773633
## 
## n= 43152 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.608386      0   1.00000 1.00004 0.0098476
## 2 0.185489      1   0.39161 0.39164 0.0044008
## 3 0.033806      2   0.20612 0.20618 0.0023168
## 4 0.026394      3   0.17232 0.17249 0.0023156
## 5 0.025864      4   0.14592 0.15029 0.0020353
## 6 0.010000      5   0.12006 0.12071 0.0017380

Plot View

plotcp(diamonds_model)

Prediction

Prediction values from Model

prediction <- predict(diamonds_model, diamonds_test)

head(prediction,50)

##        1        2        3        4        5        6        7        8 
## 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 
##        9       10       11       12       13       14       15       16 
## 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 
##       17       18       19       20       21       22       23       24 
## 1037.916 1037.916 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 
##       25       26       27       28       29       30       31       32 
## 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 
##       33       34       35       36       37       38       39       40 
## 3021.541 3021.541 3021.541 3021.541 1037.916 1037.916 3021.541 3021.541 
##       41       42       43       44       45       46       47       48 
## 3021.541 3021.541 3021.541 3021.541 3021.541 1037.916 1037.916 3021.541 
##       49       50 
## 1037.916 3021.541

Pruning the Tree for Optimal Decision Tree

prune_diamonds_model <- prune(diamonds_model, 
                cp=diamonds_model$cptable[which.min(diamonds_model$cptable[,"xerror"]),"CP"])

fancyRpartPlot(prune_diamonds_model)

New Prediction with Pruned Model

prediction_11 <- predict(prune_diamonds_model, newdata = diamonds_test)

RESULT: Looking for Correlation of Pruned Model

Strong correlation coefficient

cor(diamonds_test$price, prediction_11)^2

## [1] 0.8804278

EXTRA: Additional Version of Model

Turning categorical variables into binary variables

diamonds_new_2 <- mutate(diamonds_new, 
                         clarity.I1=ifelse(clarity=='I1',1,0),
                         clarity.SI2=ifelse(clarity=='SI2',1,0),
                         clarity.SI1=ifelse(clarity=='SI1',1,0),
                         clarity.VS2=ifelse(clarity=='VS2',1,0),
                         clarity.VS1=ifelse(clarity=='VS1',1,0),
                         clarity.VVS2=ifelse(clarity=='VVS2',1,0),
                         clarity.VVS1=ifelse(clarity=='VVS1',1,0),
                         clarity.IF=ifelse(clarity=='IF',1,0),
                         color.J=ifelse(color=='J',1,0),
                         color.I=ifelse(color=='I',1,0),
                         color.H=ifelse(color=='H',1,0),
                         color.G=ifelse(color=='G',1,0),
                         color.F=ifelse(color=='F',1,0),
                         color.E=ifelse(color=='E',1,0),
                         color.D=ifelse(color=='D',1,0),
                         cut.Fair=ifelse(cut=='Fair',1,0),
                         cut.Good=ifelse(cut=='Good',1,0),
                         cut.VeryGood=ifelse(cut=='Very Good',1,0),
                         cut.Premium=ifelse(cut=='Premium',1,0),
                         cut.Ideal=ifelse(cut=='Ideal',1,0)
                         ) %>% select(-clarity) %>% select(-color) %>% select(-cut)

Same steps before

str(diamonds_new_2)

## tibble [53,940 x 25] (S3: tbl_df/tbl/data.frame)
##  $ price       : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ carat       : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ x           : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y           : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z           : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ clarity.I1  : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ clarity.SI2 : num [1:53940] 1 0 0 0 1 0 0 0 0 0 ...
##  $ clarity.SI1 : num [1:53940] 0 1 0 0 0 0 0 1 0 0 ...
##  $ clarity.VS2 : num [1:53940] 0 0 0 1 0 0 0 0 1 0 ...
##  $ clarity.VS1 : num [1:53940] 0 0 1 0 0 0 0 0 0 1 ...
##  $ clarity.VVS2: num [1:53940] 0 0 0 0 0 1 0 0 0 0 ...
##  $ clarity.VVS1: num [1:53940] 0 0 0 0 0 0 1 0 0 0 ...
##  $ clarity.IF  : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ color.J     : num [1:53940] 0 0 0 0 1 1 0 0 0 0 ...
##  $ color.I     : num [1:53940] 0 0 0 1 0 0 1 0 0 0 ...
##  $ color.H     : num [1:53940] 0 0 0 0 0 0 0 1 0 1 ...
##  $ color.G     : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ color.F     : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ color.E     : num [1:53940] 1 1 1 0 0 0 0 0 1 0 ...
##  $ color.D     : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cut.Fair    : num [1:53940] 0 0 0 0 0 0 0 0 1 0 ...
##  $ cut.Good    : num [1:53940] 0 0 1 0 1 0 0 0 0 0 ...
##  $ cut.VeryGood: num [1:53940] 0 0 0 0 0 1 1 1 0 1 ...
##  $ cut.Premium : num [1:53940] 0 1 0 1 0 0 0 0 0 0 ...
##  $ cut.Ideal   : num [1:53940] 1 0 0 0 0 0 0 0 0 0 ...

head(diamonds_new_2)

## # A tibble: 6 x 25
##   price carat     x     y     z clarity.I1 clarity.SI2 clarity.SI1 clarity.VS2
##   <int> <dbl> <dbl> <dbl> <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
## 1   326 0.23   3.95  3.98  2.43          0           1           0           0
## 2   326 0.21   3.89  3.84  2.31          0           0           1           0
## 3   327 0.23   4.05  4.07  2.31          0           0           0           0
## 4   334 0.290  4.2   4.23  2.63          0           0           0           1
## 5   335 0.31   4.34  4.35  2.75          0           1           0           0
## 6   336 0.24   3.94  3.96  2.48          0           0           0           0
## # ... with 16 more variables: clarity.VS1 <dbl>, clarity.VVS2 <dbl>,
## #   clarity.VVS1 <dbl>, clarity.IF <dbl>, color.J <dbl>, color.I <dbl>,
## #   color.H <dbl>, color.G <dbl>, color.F <dbl>, color.E <dbl>, color.D <dbl>,
## #   cut.Fair <dbl>, cut.Good <dbl>, cut.VeryGood <dbl>, cut.Premium <dbl>,
## #   cut.Ideal <dbl>

n <- nrow(diamonds_new_2)
n_train <- round(0.80 * n)
set.seed(87)
train_indices <- sample(1:n, n_train)

diamonds_train_2 <- diamonds_new_2[train_indices, ]
diamonds_test_2 <- diamonds_new_2[-train_indices, ]

diamonds_model_2 <- rpart(price ~ ., data = diamonds_train_2)

summary(diamonds_model_2)

## Call:
## rpart(formula = price ~ ., data = diamonds_train_2)
##   n= 43152 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.60838568      0 1.0000000 1.0000383 0.009847565
## 2 0.18548941      1 0.3916143 0.3916439 0.004400847
## 3 0.03380623      2 0.2061249 0.2061762 0.002316826
## 4 0.02586428      3 0.1723187 0.1724946 0.002315555
## 5 0.01163139      4 0.1464544 0.1471034 0.002139989
## 6 0.01015340      5 0.1348230 0.1354726 0.002001479
## 7 0.01000000      6 0.1246696 0.1275272 0.001897575
## 
## Variable importance
##       carat           y           x           z clarity.SI2     color.J 
##          25          24          24          24           2           1 
## 
## Node number 1: 43152 observations,    complexity param=0.6083857
##   mean=3922.407, MSE=1.577363e+07 
##   left son=2 (27894 obs) right son=3 (15258 obs)
##   Primary splits:
##       carat       < 0.995 to the left,  improve=0.60838570, (0 missing)
##       y           < 6.325 to the left,  improve=0.60678510, (0 missing)
##       x           < 6.335 to the left,  improve=0.60367980, (0 missing)
##       z           < 3.935 to the left,  improve=0.59772800, (0 missing)
##       clarity.SI2 < 0.5   to the left,  improve=0.01708196, (0 missing)
##   Surrogate splits:
##       x           < 6.275 to the left,  agree=0.983, adj=0.953, (0 split)
##       y           < 6.265 to the left,  agree=0.981, adj=0.947, (0 split)
##       z           < 3.895 to the left,  agree=0.977, adj=0.936, (0 split)
##       clarity.SI2 < 0.5   to the left,  agree=0.673, adj=0.075, (0 split)
##       color.J     < 0.5   to the left,  agree=0.658, adj=0.032, (0 split)
## 
## Node number 2: 27894 observations,    complexity param=0.03380623
##   mean=1631.282, MSE=1239638 
##   left son=4 (19550 obs) right son=5 (8344 obs)
##   Primary splits:
##       carat       < 0.605 to the left,  improve=0.665462600, (0 missing)
##       y           < 5.535 to the left,  improve=0.664709800, (0 missing)
##       x           < 5.485 to the left,  improve=0.662288500, (0 missing)
##       z           < 3.375 to the left,  improve=0.661482100, (0 missing)
##       clarity.SI2 < 0.5   to the left,  improve=0.008864988, (0 missing)
##   Surrogate splits:
##       x           < 5.455 to the left,  agree=0.992, adj=0.972, (0 split)
##       y           < 5.465 to the left,  agree=0.990, adj=0.967, (0 split)
##       z           < 3.365 to the left,  agree=0.990, adj=0.966, (0 split)
##       clarity.SI2 < 0.5   to the left,  agree=0.715, adj=0.047, (0 split)
##       cut.Fair    < 0.5   to the left,  agree=0.709, adj=0.027, (0 split)
## 
## Node number 3: 15258 observations,    complexity param=0.1854894
##   mean=8110.94, MSE=1.520378e+07 
##   left son=6 (10339 obs) right son=7 (4919 obs)
##   Primary splits:
##       y          < 7.195 to the left,  improve=0.54425530, (0 missing)
##       carat      < 1.495 to the left,  improve=0.53760550, (0 missing)
##       x          < 7.195 to the left,  improve=0.53623800, (0 missing)
##       z          < 4.435 to the left,  improve=0.52603750, (0 missing)
##       clarity.I1 < 0.5   to the right, improve=0.01851794, (0 missing)
##   Surrogate splits:
##       x       < 7.215 to the left,  agree=0.984, adj=0.952, (0 split)
##       carat   < 1.405 to the left,  agree=0.981, adj=0.941, (0 split)
##       z       < 4.435 to the left,  agree=0.965, adj=0.892, (0 split)
##       color.J < 0.5   to the left,  agree=0.681, adj=0.010, (0 split)
## 
## Node number 4: 19550 observations
##   mean=1037.916, MSE=252318.7 
## 
## Node number 5: 8344 observations
##   mean=3021.541, MSE=795177.8 
## 
## Node number 6: 10339 observations,    complexity param=0.01163139
##   mean=6126.782, MSE=4686199 
##   left son=12 (2682 obs) right son=13 (7657 obs)
##   Primary splits:
##       clarity.SI2  < 0.5   to the right, improve=0.1634049, (0 missing)
##       clarity.VVS2 < 0.5   to the left,  improve=0.1452055, (0 missing)
##       y            < 6.755 to the left,  improve=0.1201337, (0 missing)
##       x            < 6.775 to the left,  improve=0.1072869, (0 missing)
##       carat        < 1.185 to the left,  improve=0.1059556, (0 missing)
##   Surrogate splits:
##       carat < 1.525 to the right, agree=0.741, adj=0.001, (0 split)
##       y     < 5.925 to the left,  agree=0.741, adj=0.001, (0 split)
## 
## Node number 7: 4919 observations,    complexity param=0.02586428
##   mean=12281.34, MSE=1.164316e+07 
##   left son=14 (3209 obs) right son=15 (1710 obs)
##   Primary splits:
##       y          < 7.855 to the left,  improve=0.30738690, (0 missing)
##       x          < 7.895 to the left,  improve=0.29667200, (0 missing)
##       carat      < 1.905 to the left,  improve=0.28916230, (0 missing)
##       z          < 4.805 to the left,  improve=0.27542900, (0 missing)
##       clarity.I1 < 0.5   to the right, improve=0.07275093, (0 missing)
##   Surrogate splits:
##       x           < 7.905 to the left,  agree=0.982, adj=0.947, (0 split)
##       carat       < 1.865 to the left,  agree=0.972, adj=0.919, (0 split)
##       z           < 4.825 to the left,  agree=0.952, adj=0.861, (0 split)
##       clarity.SI2 < 0.5   to the left,  agree=0.683, adj=0.088, (0 split)
## 
## Node number 12: 2682 observations
##   mean=4648.209, MSE=878219.5 
## 
## Node number 13: 7657 observations,    complexity param=0.0101534
##   mean=6644.679, MSE=4986046 
##   left son=26 (2782 obs) right son=27 (4875 obs)
##   Primary splits:
##       clarity.SI1  < 0.5   to the right, improve=0.1810212, (0 missing)
##       clarity.VVS2 < 0.5   to the left,  improve=0.1317752, (0 missing)
##       y            < 6.775 to the left,  improve=0.1245704, (0 missing)
##       x            < 6.775 to the left,  improve=0.1135592, (0 missing)
##       clarity.IF   < 0.5   to the left,  improve=0.1029377, (0 missing)
##   Surrogate splits:
##       clarity.VS2 < 0.5   to the left,  agree=0.646, adj=0.027, (0 split)
##       x           < 6.155 to the left,  agree=0.637, adj=0.000, (0 split)
## 
## Node number 14: 3209 observations
##   mean=10900.35, MSE=8620826 
## 
## Node number 15: 1710 observations
##   mean=14872.92, MSE=7019639 
## 
## Node number 26: 2782 observations
##   mean=5387.052, MSE=1236565 
## 
## Node number 27: 4875 observations
##   mean=7362.364, MSE=5708098

fancyRpartPlot(diamonds_model_2)

printcp(diamonds_model_2)

## 
## Regression tree:
## rpart(formula = price ~ ., data = diamonds_train_2)
## 
## Variables actually used in tree construction:
## [1] carat       clarity.SI1 clarity.SI2 y          
## 
## Root node error: 6.8066e+11/43152 = 15773633
## 
## n= 43152 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.608386      0   1.00000 1.00004 0.0098476
## 2 0.185489      1   0.39161 0.39164 0.0044008
## 3 0.033806      2   0.20612 0.20618 0.0023168
## 4 0.025864      3   0.17232 0.17249 0.0023156
## 5 0.011631      4   0.14645 0.14710 0.0021400
## 6 0.010153      5   0.13482 0.13547 0.0020015
## 7 0.010000      6   0.12467 0.12753 0.0018976

plotcp(diamonds_model_2)

prediction_2 <- predict(diamonds_model_2, newdata = diamonds_test_2)

head(prediction_2, 50)

##        1        2        3        4        5        6        7        8 
## 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 
##        9       10       11       12       13       14       15       16 
## 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 
##       17       18       19       20       21       22       23       24 
## 1037.916 1037.916 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 
##       25       26       27       28       29       30       31       32 
## 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 
##       33       34       35       36       37       38       39       40 
## 3021.541 3021.541 3021.541 3021.541 1037.916 1037.916 3021.541 3021.541 
##       41       42       43       44       45       46       47       48 
## 3021.541 3021.541 3021.541 3021.541 3021.541 1037.916 1037.916 3021.541 
##       49       50 
## 1037.916 3021.541

References - Pruning Tree - Categorical to binary

CART Model Diamonds Data

Selçuk Açıkalın

24 Ara 2020