Price Estimation Model on Diamonds Dataset
Dataframe consists of both numeric and categorical variables
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Since price will be response variable in the model first we need to look at if there were any correlation between numeric variables with price. Getting ready a longer dataframe will be used by ggplot
longer_diamonds <- diamonds %>% select(1, 5,6,7,8,9, 10) %>% pivot_longer(-price, names_to = "names", values_to="values")
longer_diamonds
## # A tibble: 323,640 x 3
## price names values
## <int> <chr> <dbl>
## 1 326 carat 0.23
## 2 326 depth 61.5
## 3 326 table 55
## 4 326 x 3.95
## 5 326 y 3.98
## 6 326 z 2.43
## 7 326 carat 0.21
## 8 326 depth 59.8
## 9 326 table 61
## 10 326 x 3.89
## # ... with 323,630 more rows
By Visual inspection it looks like carat, x, y and z variables have strong correlation with price
ggplot(longer_diamonds, aes(x=price, y= values)) +
geom_point() +
facet_wrap(~ names) +
scale_y_log10()
After cor() function results we could drop depth and table variables for the model since they do not have strong correlation
cor(diamonds$price, diamonds$carat)
## [1] 0.9215913
cor(diamonds$price, diamonds$depth)
## [1] -0.0106474
cor(diamonds$price, diamonds$table)
## [1] 0.1271339
cor(diamonds$price, diamonds$x)
## [1] 0.8844352
cor(diamonds$price, diamonds$y)
## [1] 0.8654209
cor(diamonds$price, diamonds$z)
## [1] 0.8612494
diamonds_new <- diamonds %>% select(7, 2, 3, 4, 1, 8, 9, 10)
str(diamonds_new)
## tibble [53,940 x 8] (S3: tbl_df/tbl/data.frame)
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
head(diamonds_new)
## # A tibble: 6 x 8
## price cut color clarity carat x y z
## <int> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl>
## 1 326 Ideal E SI2 0.23 3.95 3.98 2.43
## 2 326 Premium E SI1 0.21 3.89 3.84 2.31
## 3 327 Good E VS1 0.23 4.05 4.07 2.31
## 4 334 Premium I VS2 0.290 4.2 4.23 2.63
## 5 335 Good J SI2 0.31 4.34 4.35 2.75
## 6 336 Very Good J VVS2 0.24 3.94 3.96 2.48
We need two sets of data for both training and test.
To do that first we need to know the total row number of diamonds dataframe
n <- nrow(diamonds_new)
Split will be conducted with ratios = %80
n_train <- round(0.80 * n)
For reproducability of whole process
set.seed(87)
train_indices <- sample(1:n, n_train)
Which is a subset contains the diamonds indices equal to train indices. We will use this data frame to train the model.
diamonds_train <- diamonds_new[train_indices, ]
which is a subset contains the diamonds indices not equal to train indices. We will use this data frame to test the model.
diamonds_test <- diamonds_new[-train_indices, ]
diamonds_model <- rpart(price ~ ., data = diamonds_train)
summary(diamonds_model)
## Call:
## rpart(formula = price ~ ., data = diamonds_train)
## n= 43152
##
## CP nsplit rel error xerror xstd
## 1 0.60838568 0 1.0000000 1.0000383 0.009847565
## 2 0.18548941 1 0.3916143 0.3916439 0.004400847
## 3 0.03380623 2 0.2061249 0.2061762 0.002316826
## 4 0.02639393 3 0.1723187 0.1724946 0.002315555
## 5 0.02586428 4 0.1459248 0.1502857 0.002035272
## 6 0.01000000 5 0.1200605 0.1207104 0.001738025
##
## Variable importance
## carat y x z clarity color
## 25 24 24 23 3 1
##
## Node number 1: 43152 observations, complexity param=0.6083857
## mean=3922.407, MSE=1.577363e+07
## left son=2 (27894 obs) right son=3 (15258 obs)
## Primary splits:
## carat < 0.995 to the left, improve=0.60838570, (0 missing)
## y < 6.325 to the left, improve=0.60678510, (0 missing)
## x < 6.335 to the left, improve=0.60367980, (0 missing)
## z < 3.935 to the left, improve=0.59772800, (0 missing)
## color splits as LLLLRRR, improve=0.02114851, (0 missing)
## Surrogate splits:
## x < 6.275 to the left, agree=0.983, adj=0.953, (0 split)
## y < 6.265 to the left, agree=0.981, adj=0.947, (0 split)
## z < 3.895 to the left, agree=0.977, adj=0.936, (0 split)
## clarity splits as RRLLLLLL, agree=0.679, adj=0.092, (0 split)
## color splits as LLLLLRR, agree=0.661, adj=0.041, (0 split)
##
## Node number 2: 27894 observations, complexity param=0.03380623
## mean=1631.282, MSE=1239638
## left son=4 (19550 obs) right son=5 (8344 obs)
## Primary splits:
## carat < 0.605 to the left, improve=0.66546260, (0 missing)
## y < 5.535 to the left, improve=0.66470980, (0 missing)
## x < 5.485 to the left, improve=0.66228850, (0 missing)
## z < 3.375 to the left, improve=0.66148210, (0 missing)
## clarity splits as RRRLLLLL, improve=0.01218183, (0 missing)
## Surrogate splits:
## x < 5.455 to the left, agree=0.992, adj=0.972, (0 split)
## y < 5.465 to the left, agree=0.990, adj=0.967, (0 split)
## z < 3.365 to the left, agree=0.990, adj=0.966, (0 split)
## clarity splits as RRLLLLLL, agree=0.718, adj=0.056, (0 split)
## cut splits as RLLLL, agree=0.709, adj=0.027, (0 split)
##
## Node number 3: 15258 observations, complexity param=0.1854894
## mean=8110.94, MSE=1.520378e+07
## left son=6 (10339 obs) right son=7 (4919 obs)
## Primary splits:
## y < 7.195 to the left, improve=0.54425530, (0 missing)
## carat < 1.495 to the left, improve=0.53760550, (0 missing)
## x < 7.195 to the left, improve=0.53623800, (0 missing)
## z < 4.435 to the left, improve=0.52603750, (0 missing)
## clarity splits as LLLRRRRR, improve=0.05411494, (0 missing)
## Surrogate splits:
## x < 7.215 to the left, agree=0.984, adj=0.952, (0 split)
## carat < 1.405 to the left, agree=0.981, adj=0.941, (0 split)
## z < 4.435 to the left, agree=0.965, adj=0.892, (0 split)
## color splits as LLLLLLR, agree=0.681, adj=0.010, (0 split)
##
## Node number 4: 19550 observations
## mean=1037.916, MSE=252318.7
##
## Node number 5: 8344 observations
## mean=3021.541, MSE=795177.8
##
## Node number 6: 10339 observations, complexity param=0.02639393
## mean=6126.782, MSE=4686199
## left son=12 (7885 obs) right son=13 (2454 obs)
## Primary splits:
## clarity splits as LLLLRRRR, improve=0.3707981, (0 missing)
## y < 6.755 to the left, improve=0.1201337, (0 missing)
## x < 6.775 to the left, improve=0.1072869, (0 missing)
## carat < 1.185 to the left, improve=0.1059556, (0 missing)
## color splits as RRRRLLL, improve=0.1035364, (0 missing)
##
## Node number 7: 4919 observations, complexity param=0.02586428
## mean=12281.34, MSE=1.164316e+07
## left son=14 (3209 obs) right son=15 (1710 obs)
## Primary splits:
## y < 7.855 to the left, improve=0.30738690, (0 missing)
## x < 7.895 to the left, improve=0.29667200, (0 missing)
## carat < 1.905 to the left, improve=0.28916230, (0 missing)
## z < 4.805 to the left, improve=0.27542900, (0 missing)
## clarity splits as LRRRRRRR, improve=0.07275093, (0 missing)
## Surrogate splits:
## x < 7.905 to the left, agree=0.982, adj=0.947, (0 split)
## carat < 1.865 to the left, agree=0.972, adj=0.919, (0 split)
## z < 4.825 to the left, agree=0.952, adj=0.861, (0 split)
## clarity splits as RRLLLLLL, agree=0.680, adj=0.080, (0 split)
##
## Node number 12: 7885 observations
## mean=5391.396, MSE=2047998
##
## Node number 13: 2454 observations
## mean=8489.668, MSE=5842194
##
## Node number 14: 3209 observations
## mean=10900.35, MSE=8620826
##
## Node number 15: 1710 observations
## mean=14872.92, MSE=7019639
fancyRpartPlot(diamonds_model)
printcp(diamonds_model)
##
## Regression tree:
## rpart(formula = price ~ ., data = diamonds_train)
##
## Variables actually used in tree construction:
## [1] carat clarity y
##
## Root node error: 6.8066e+11/43152 = 15773633
##
## n= 43152
##
## CP nsplit rel error xerror xstd
## 1 0.608386 0 1.00000 1.00004 0.0098476
## 2 0.185489 1 0.39161 0.39164 0.0044008
## 3 0.033806 2 0.20612 0.20618 0.0023168
## 4 0.026394 3 0.17232 0.17249 0.0023156
## 5 0.025864 4 0.14592 0.15029 0.0020353
## 6 0.010000 5 0.12006 0.12071 0.0017380
plotcp(diamonds_model)
prediction <- predict(diamonds_model, diamonds_test)
head(prediction,50)
## 1 2 3 4 5 6 7 8
## 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916
## 9 10 11 12 13 14 15 16
## 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916
## 17 18 19 20 21 22 23 24
## 1037.916 1037.916 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541
## 25 26 27 28 29 30 31 32
## 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541
## 33 34 35 36 37 38 39 40
## 3021.541 3021.541 3021.541 3021.541 1037.916 1037.916 3021.541 3021.541
## 41 42 43 44 45 46 47 48
## 3021.541 3021.541 3021.541 3021.541 3021.541 1037.916 1037.916 3021.541
## 49 50
## 1037.916 3021.541
prune_diamonds_model <- prune(diamonds_model,
cp=diamonds_model$cptable[which.min(diamonds_model$cptable[,"xerror"]),"CP"])
fancyRpartPlot(prune_diamonds_model)
prediction_11 <- predict(prune_diamonds_model, newdata = diamonds_test)
Strong correlation coefficient
cor(diamonds_test$price, prediction_11)^2
## [1] 0.8804278
diamonds_new_2 <- mutate(diamonds_new,
clarity.I1=ifelse(clarity=='I1',1,0),
clarity.SI2=ifelse(clarity=='SI2',1,0),
clarity.SI1=ifelse(clarity=='SI1',1,0),
clarity.VS2=ifelse(clarity=='VS2',1,0),
clarity.VS1=ifelse(clarity=='VS1',1,0),
clarity.VVS2=ifelse(clarity=='VVS2',1,0),
clarity.VVS1=ifelse(clarity=='VVS1',1,0),
clarity.IF=ifelse(clarity=='IF',1,0),
color.J=ifelse(color=='J',1,0),
color.I=ifelse(color=='I',1,0),
color.H=ifelse(color=='H',1,0),
color.G=ifelse(color=='G',1,0),
color.F=ifelse(color=='F',1,0),
color.E=ifelse(color=='E',1,0),
color.D=ifelse(color=='D',1,0),
cut.Fair=ifelse(cut=='Fair',1,0),
cut.Good=ifelse(cut=='Good',1,0),
cut.VeryGood=ifelse(cut=='Very Good',1,0),
cut.Premium=ifelse(cut=='Premium',1,0),
cut.Ideal=ifelse(cut=='Ideal',1,0)
) %>% select(-clarity) %>% select(-color) %>% select(-cut)
str(diamonds_new_2)
## tibble [53,940 x 25] (S3: tbl_df/tbl/data.frame)
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ clarity.I1 : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
## $ clarity.SI2 : num [1:53940] 1 0 0 0 1 0 0 0 0 0 ...
## $ clarity.SI1 : num [1:53940] 0 1 0 0 0 0 0 1 0 0 ...
## $ clarity.VS2 : num [1:53940] 0 0 0 1 0 0 0 0 1 0 ...
## $ clarity.VS1 : num [1:53940] 0 0 1 0 0 0 0 0 0 1 ...
## $ clarity.VVS2: num [1:53940] 0 0 0 0 0 1 0 0 0 0 ...
## $ clarity.VVS1: num [1:53940] 0 0 0 0 0 0 1 0 0 0 ...
## $ clarity.IF : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
## $ color.J : num [1:53940] 0 0 0 0 1 1 0 0 0 0 ...
## $ color.I : num [1:53940] 0 0 0 1 0 0 1 0 0 0 ...
## $ color.H : num [1:53940] 0 0 0 0 0 0 0 1 0 1 ...
## $ color.G : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
## $ color.F : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
## $ color.E : num [1:53940] 1 1 1 0 0 0 0 0 1 0 ...
## $ color.D : num [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
## $ cut.Fair : num [1:53940] 0 0 0 0 0 0 0 0 1 0 ...
## $ cut.Good : num [1:53940] 0 0 1 0 1 0 0 0 0 0 ...
## $ cut.VeryGood: num [1:53940] 0 0 0 0 0 1 1 1 0 1 ...
## $ cut.Premium : num [1:53940] 0 1 0 1 0 0 0 0 0 0 ...
## $ cut.Ideal : num [1:53940] 1 0 0 0 0 0 0 0 0 0 ...
head(diamonds_new_2)
## # A tibble: 6 x 25
## price carat x y z clarity.I1 clarity.SI2 clarity.SI1 clarity.VS2
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 326 0.23 3.95 3.98 2.43 0 1 0 0
## 2 326 0.21 3.89 3.84 2.31 0 0 1 0
## 3 327 0.23 4.05 4.07 2.31 0 0 0 0
## 4 334 0.290 4.2 4.23 2.63 0 0 0 1
## 5 335 0.31 4.34 4.35 2.75 0 1 0 0
## 6 336 0.24 3.94 3.96 2.48 0 0 0 0
## # ... with 16 more variables: clarity.VS1 <dbl>, clarity.VVS2 <dbl>,
## # clarity.VVS1 <dbl>, clarity.IF <dbl>, color.J <dbl>, color.I <dbl>,
## # color.H <dbl>, color.G <dbl>, color.F <dbl>, color.E <dbl>, color.D <dbl>,
## # cut.Fair <dbl>, cut.Good <dbl>, cut.VeryGood <dbl>, cut.Premium <dbl>,
## # cut.Ideal <dbl>
n <- nrow(diamonds_new_2)
n_train <- round(0.80 * n)
set.seed(87)
train_indices <- sample(1:n, n_train)
diamonds_train_2 <- diamonds_new_2[train_indices, ]
diamonds_test_2 <- diamonds_new_2[-train_indices, ]
diamonds_model_2 <- rpart(price ~ ., data = diamonds_train_2)
summary(diamonds_model_2)
## Call:
## rpart(formula = price ~ ., data = diamonds_train_2)
## n= 43152
##
## CP nsplit rel error xerror xstd
## 1 0.60838568 0 1.0000000 1.0000383 0.009847565
## 2 0.18548941 1 0.3916143 0.3916439 0.004400847
## 3 0.03380623 2 0.2061249 0.2061762 0.002316826
## 4 0.02586428 3 0.1723187 0.1724946 0.002315555
## 5 0.01163139 4 0.1464544 0.1471034 0.002139989
## 6 0.01015340 5 0.1348230 0.1354726 0.002001479
## 7 0.01000000 6 0.1246696 0.1275272 0.001897575
##
## Variable importance
## carat y x z clarity.SI2 color.J
## 25 24 24 24 2 1
##
## Node number 1: 43152 observations, complexity param=0.6083857
## mean=3922.407, MSE=1.577363e+07
## left son=2 (27894 obs) right son=3 (15258 obs)
## Primary splits:
## carat < 0.995 to the left, improve=0.60838570, (0 missing)
## y < 6.325 to the left, improve=0.60678510, (0 missing)
## x < 6.335 to the left, improve=0.60367980, (0 missing)
## z < 3.935 to the left, improve=0.59772800, (0 missing)
## clarity.SI2 < 0.5 to the left, improve=0.01708196, (0 missing)
## Surrogate splits:
## x < 6.275 to the left, agree=0.983, adj=0.953, (0 split)
## y < 6.265 to the left, agree=0.981, adj=0.947, (0 split)
## z < 3.895 to the left, agree=0.977, adj=0.936, (0 split)
## clarity.SI2 < 0.5 to the left, agree=0.673, adj=0.075, (0 split)
## color.J < 0.5 to the left, agree=0.658, adj=0.032, (0 split)
##
## Node number 2: 27894 observations, complexity param=0.03380623
## mean=1631.282, MSE=1239638
## left son=4 (19550 obs) right son=5 (8344 obs)
## Primary splits:
## carat < 0.605 to the left, improve=0.665462600, (0 missing)
## y < 5.535 to the left, improve=0.664709800, (0 missing)
## x < 5.485 to the left, improve=0.662288500, (0 missing)
## z < 3.375 to the left, improve=0.661482100, (0 missing)
## clarity.SI2 < 0.5 to the left, improve=0.008864988, (0 missing)
## Surrogate splits:
## x < 5.455 to the left, agree=0.992, adj=0.972, (0 split)
## y < 5.465 to the left, agree=0.990, adj=0.967, (0 split)
## z < 3.365 to the left, agree=0.990, adj=0.966, (0 split)
## clarity.SI2 < 0.5 to the left, agree=0.715, adj=0.047, (0 split)
## cut.Fair < 0.5 to the left, agree=0.709, adj=0.027, (0 split)
##
## Node number 3: 15258 observations, complexity param=0.1854894
## mean=8110.94, MSE=1.520378e+07
## left son=6 (10339 obs) right son=7 (4919 obs)
## Primary splits:
## y < 7.195 to the left, improve=0.54425530, (0 missing)
## carat < 1.495 to the left, improve=0.53760550, (0 missing)
## x < 7.195 to the left, improve=0.53623800, (0 missing)
## z < 4.435 to the left, improve=0.52603750, (0 missing)
## clarity.I1 < 0.5 to the right, improve=0.01851794, (0 missing)
## Surrogate splits:
## x < 7.215 to the left, agree=0.984, adj=0.952, (0 split)
## carat < 1.405 to the left, agree=0.981, adj=0.941, (0 split)
## z < 4.435 to the left, agree=0.965, adj=0.892, (0 split)
## color.J < 0.5 to the left, agree=0.681, adj=0.010, (0 split)
##
## Node number 4: 19550 observations
## mean=1037.916, MSE=252318.7
##
## Node number 5: 8344 observations
## mean=3021.541, MSE=795177.8
##
## Node number 6: 10339 observations, complexity param=0.01163139
## mean=6126.782, MSE=4686199
## left son=12 (2682 obs) right son=13 (7657 obs)
## Primary splits:
## clarity.SI2 < 0.5 to the right, improve=0.1634049, (0 missing)
## clarity.VVS2 < 0.5 to the left, improve=0.1452055, (0 missing)
## y < 6.755 to the left, improve=0.1201337, (0 missing)
## x < 6.775 to the left, improve=0.1072869, (0 missing)
## carat < 1.185 to the left, improve=0.1059556, (0 missing)
## Surrogate splits:
## carat < 1.525 to the right, agree=0.741, adj=0.001, (0 split)
## y < 5.925 to the left, agree=0.741, adj=0.001, (0 split)
##
## Node number 7: 4919 observations, complexity param=0.02586428
## mean=12281.34, MSE=1.164316e+07
## left son=14 (3209 obs) right son=15 (1710 obs)
## Primary splits:
## y < 7.855 to the left, improve=0.30738690, (0 missing)
## x < 7.895 to the left, improve=0.29667200, (0 missing)
## carat < 1.905 to the left, improve=0.28916230, (0 missing)
## z < 4.805 to the left, improve=0.27542900, (0 missing)
## clarity.I1 < 0.5 to the right, improve=0.07275093, (0 missing)
## Surrogate splits:
## x < 7.905 to the left, agree=0.982, adj=0.947, (0 split)
## carat < 1.865 to the left, agree=0.972, adj=0.919, (0 split)
## z < 4.825 to the left, agree=0.952, adj=0.861, (0 split)
## clarity.SI2 < 0.5 to the left, agree=0.683, adj=0.088, (0 split)
##
## Node number 12: 2682 observations
## mean=4648.209, MSE=878219.5
##
## Node number 13: 7657 observations, complexity param=0.0101534
## mean=6644.679, MSE=4986046
## left son=26 (2782 obs) right son=27 (4875 obs)
## Primary splits:
## clarity.SI1 < 0.5 to the right, improve=0.1810212, (0 missing)
## clarity.VVS2 < 0.5 to the left, improve=0.1317752, (0 missing)
## y < 6.775 to the left, improve=0.1245704, (0 missing)
## x < 6.775 to the left, improve=0.1135592, (0 missing)
## clarity.IF < 0.5 to the left, improve=0.1029377, (0 missing)
## Surrogate splits:
## clarity.VS2 < 0.5 to the left, agree=0.646, adj=0.027, (0 split)
## x < 6.155 to the left, agree=0.637, adj=0.000, (0 split)
##
## Node number 14: 3209 observations
## mean=10900.35, MSE=8620826
##
## Node number 15: 1710 observations
## mean=14872.92, MSE=7019639
##
## Node number 26: 2782 observations
## mean=5387.052, MSE=1236565
##
## Node number 27: 4875 observations
## mean=7362.364, MSE=5708098
fancyRpartPlot(diamonds_model_2)
printcp(diamonds_model_2)
##
## Regression tree:
## rpart(formula = price ~ ., data = diamonds_train_2)
##
## Variables actually used in tree construction:
## [1] carat clarity.SI1 clarity.SI2 y
##
## Root node error: 6.8066e+11/43152 = 15773633
##
## n= 43152
##
## CP nsplit rel error xerror xstd
## 1 0.608386 0 1.00000 1.00004 0.0098476
## 2 0.185489 1 0.39161 0.39164 0.0044008
## 3 0.033806 2 0.20612 0.20618 0.0023168
## 4 0.025864 3 0.17232 0.17249 0.0023156
## 5 0.011631 4 0.14645 0.14710 0.0021400
## 6 0.010153 5 0.13482 0.13547 0.0020015
## 7 0.010000 6 0.12467 0.12753 0.0018976
plotcp(diamonds_model_2)
prediction_2 <- predict(diamonds_model_2, newdata = diamonds_test_2)
head(prediction_2, 50)
## 1 2 3 4 5 6 7 8
## 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916
## 9 10 11 12 13 14 15 16
## 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916 1037.916
## 17 18 19 20 21 22 23 24
## 1037.916 1037.916 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541
## 25 26 27 28 29 30 31 32
## 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541 3021.541
## 33 34 35 36 37 38 39 40
## 3021.541 3021.541 3021.541 3021.541 1037.916 1037.916 3021.541 3021.541
## 41 42 43 44 45 46 47 48
## 3021.541 3021.541 3021.541 3021.541 3021.541 1037.916 1037.916 3021.541
## 49 50
## 1037.916 3021.541
References - Pruning Tree - Categorical to binary