R has diamonds dataset by default. The data are shown in details below;
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ve…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1,…
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 6…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 5…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 3…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2…
First 6 rows of the data;
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Summary of data with respect to variables is;
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
By checking correlation between price and numeric elements of diamond, we can figure out which variables are important for our price estimation model;
correlation_wrt_price <- data.frame(
cor_carat_to_price = cor(diamonds$carat, diamonds$price),
cor_depth_to_price = cor(diamonds$depth, diamonds$price),
cor_table_to_price = cor(diamonds$table, diamonds$price),
cor_x_to_price = cor(diamonds$x, diamonds$price),
cor_y_to_price = cor(diamonds$y, diamonds$price),
cor_z_to_price = cor(diamonds$z, diamonds$price)
)
print(correlation_wrt_price)
## cor_carat_to_price cor_depth_to_price cor_table_to_price cor_x_to_price
## 1 0.9215913 -0.0106474 0.1271339 0.8844352
## cor_y_to_price cor_z_to_price
## 1 0.8654209 0.8612494
As is seen above, price is highly correlated with carat, x, y and z since correlation values are close to one for those variables. However, depth and table are not important statistically.
Diamond price vs. carat of diamond scatterplot is shown below;
ggplot(diamonds) +
geom_point(aes(x = carat, y = price), alpha=0.4, color = "red") +
labs(title = "Diamond Price vs. Carat", x = "Carat of diamond", y = "Price of Diamond (in $)")
Diamond price vs. length scatterplot is shown below;
ggplot(diamonds) +
geom_point(aes(x = x, y = price), alpha=0.4, color = "dark grey") +
labs(title = "Diamond Price vs. Length", x = "Length of diamond (mm)", y = "Price of Diamond (in $)")
Diamond price vs. height scatterplot is shown below;
ggplot(diamonds) +
geom_point(aes(x = y, y = price), alpha=0.4, color = "blue") +
labs(title = "Diamond Price vs. Height", x = "Height of diamond (mm)", y = "Price of Diamond (in $)")
Diamond price vs. width scatterplot is shown below;
ggplot(diamonds) +
geom_point(aes(x = z, y = price), alpha=0.3, color = "green") +
labs(title = "Diamond Price vs. Width", x = "Width of diamond (mm)", y = "Price of Diamond (in $)")
Scatterplots also give some insight about correlation of variables.
In order to look into diamond classification data regarding diamond price and carat, we draw three colored scatterplots by clarity, cut, and color classes.
ggplot(diamonds) +
geom_point(aes(x = carat, y = price, color = clarity), alpha=0.65) +
labs(title = "Diamond Price vs. Carat by Clarity Classes", x = "Carat of Diamond", y = "Price of Diamond (in $)")
Clarity classification plot implies that clarity of diamond mean something for diamond’s price.
The other classification plots are as follow;
ggplot(diamonds) +
geom_point(aes(x = carat, y = price, color = cut), alpha=0.65) +
labs(title = "Diamond Price vs. Carat by Cut Classes", x = "Carat of Diamond", y = "Price of Diamond (in $)")
ggplot(diamonds) +
geom_point(aes(x = carat, y = price, color = color), alpha=0.65) +
labs(title = "Diamond Price vs. Carat by Color Classes", x = "Carat of Diamond", y = "Price of Diamond (in $)")
In order to get price estimation model, we divide our diamonds data for training and testing, 80% and 20% respectively.
#Seeding
set.seed(503)
#Diamonds Train Data and Test Data is defined.
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
Diamonds Train Data are shown below;
glimpse(diamonds_train)
## Rows: 43,143
## Columns: 11
## $ carat <dbl> 0.21, 0.23, 0.29, 0.31, 0.24, 0.26, 0.22, 0.23, 0.30, 0.23…
## $ cut <ord> Premium, Good, Premium, Good, Very Good, Very Good, Fair, …
## $ color <ord> E, E, I, J, I, H, E, H, J, J, F, J, E, I, J, J, J, H, J, G…
## $ clarity <ord> SI1, VS1, VS2, SI2, VVS1, SI1, VS2, VS1, SI1, VS1, SI1, SI…
## $ depth <dbl> 59.8, 56.9, 62.4, 63.3, 62.3, 61.9, 65.1, 59.4, 64.0, 62.8…
## $ table <dbl> 61, 65, 58, 58, 57, 55, 61, 61, 55, 56, 61, 54, 62, 54, 54…
## $ price <int> 326, 327, 334, 335, 336, 337, 337, 338, 339, 340, 342, 344…
## $ x <dbl> 3.89, 4.05, 4.20, 4.34, 3.95, 4.07, 3.87, 4.00, 4.25, 3.93…
## $ y <dbl> 3.84, 4.07, 4.23, 4.35, 3.98, 4.11, 3.78, 4.05, 4.28, 3.90…
## $ z <dbl> 2.31, 2.31, 2.63, 2.75, 2.47, 2.53, 2.49, 2.39, 2.73, 2.46…
## $ diamond_id <int> 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 2…
Diamonds Test Data are shown below;
glimpse(diamonds_test)
## Rows: 10,797
## Columns: 11
## $ carat <dbl> 1.70, 0.58, 1.00, 1.05, 0.70, 0.90, 1.01, 1.10, 0.62, 1.50…
## $ cut <ord> Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair…
## $ color <ord> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D…
## $ clarity <ord> I1, SI2, SI2, SI2, SI2, SI2, SI2, SI2, SI2, SI2, SI2, SI2,…
## $ depth <dbl> 64.7, 65.1, 65.6, 65.4, 64.5, 64.6, 64.6, 64.6, 64.6, 68.8…
## $ table <dbl> 56, 58, 66, 59, 60, 59, 56, 54, 57, 57, 58, 66, 60, 55, 58…
## $ price <int> 5617, 1156, 3767, 3816, 2167, 3847, 3003, 4725, 1410, 7469…
## $ x <dbl> 7.46, 5.25, 6.10, 6.30, 5.53, 6.04, 6.31, 6.56, 5.39, 6.90…
## $ y <dbl> 7.37, 5.22, 6.01, 6.24, 5.47, 6.01, 6.24, 6.49, 5.33, 6.86…
## $ z <dbl> 4.80, 3.41, 3.97, 4.10, 3.55, 3.89, 4.05, 4.22, 3.46, 4.73…
## $ diamond_id <int> 13776, 40690, 5152, 5359, 49828, 5510, 1555, 10151, 43397,…
In the exploratory data analysis, we found that depth and table are not important statistically for diamond price estimation model.
print(correlation_wrt_price)
## cor_carat_to_price cor_depth_to_price cor_table_to_price cor_x_to_price
## 1 0.9215913 -0.0106474 0.1271339 0.8844352
## cor_y_to_price cor_z_to_price
## 1 0.8654209 0.8612494
Therefore, in our price estimation model, we take predictors as carat, cut, color, clarity. x. y and z from training data.
price_estimation_model <- rpart(price ~ carat+cut+color+clarity+x+y+z , data=diamonds_train, method='anova')
summary(price_estimation_model)
## Call:
## rpart(formula = price ~ carat + cut + color + clarity + x + y +
## z, data = diamonds_train, method = "anova")
## n= 43143
##
## CP nsplit rel error xerror xstd
## 1 0.60913716 0 1.0000000 1.0000357 0.009817373
## 2 0.18517972 1 0.3908628 0.3909391 0.004372856
## 3 0.03361554 2 0.2056831 0.2061950 0.002303084
## 4 0.02661853 3 0.1720676 0.1726668 0.002299963
## 5 0.02583190 4 0.1454491 0.1525199 0.002060203
## 6 0.01000000 5 0.1196172 0.1209142 0.001742561
##
## Variable importance
## carat y x z clarity color
## 25 24 24 23 3 1
##
## Node number 1: 43143 observations, complexity param=0.6091372
## mean=3938.805, MSE=1.597919e+07
## left son=2 (27894 obs) right son=3 (15249 obs)
## Primary splits:
## carat < 0.995 to the left, improve=0.60913720, (0 missing)
## y < 6.345 to the left, improve=0.60805950, (0 missing)
## x < 6.315 to the left, improve=0.60428840, (0 missing)
## z < 3.915 to the left, improve=0.59942370, (0 missing)
## color splits as LLLLRRR, improve=0.02261306, (0 missing)
## Surrogate splits:
## x < 6.265 to the left, agree=0.983, adj=0.953, (0 split)
## y < 6.285 to the left, agree=0.981, adj=0.946, (0 split)
## z < 3.895 to the left, agree=0.978, adj=0.937, (0 split)
## clarity splits as RRLLLLLL, agree=0.679, adj=0.091, (0 split)
## color splits as LLLLLRR, agree=0.661, adj=0.040, (0 split)
##
## Node number 2: 27894 observations, complexity param=0.03361554
## mean=1632.056, MSE=1248233
## left son=4 (19820 obs) right son=5 (8074 obs)
## Primary splits:
## carat < 0.625 to the left, improve=0.66557750, (0 missing)
## y < 5.535 to the left, improve=0.66520910, (0 missing)
## x < 5.495 to the left, improve=0.66238230, (0 missing)
## z < 3.375 to the left, improve=0.66219560, (0 missing)
## clarity splits as RRRLLLLL, improve=0.01216575, (0 missing)
## Surrogate splits:
## x < 5.485 to the left, agree=0.993, adj=0.976, (0 split)
## y < 5.515 to the left, agree=0.991, adj=0.969, (0 split)
## z < 3.385 to the left, agree=0.991, adj=0.969, (0 split)
## clarity splits as RRLLLLLL, agree=0.725, adj=0.051, (0 split)
## cut splits as RLLLL, agree=0.719, adj=0.028, (0 split)
##
## Node number 3: 15249 observations, complexity param=0.1851797
## mean=8158.389, MSE=1.538716e+07
## left son=6 (10296 obs) right son=7 (4953 obs)
## Primary splits:
## y < 7.195 to the left, improve=0.54407490, (0 missing)
## x < 7.195 to the left, improve=0.53738530, (0 missing)
## carat < 1.495 to the left, improve=0.53615420, (0 missing)
## z < 4.435 to the left, improve=0.52631830, (0 missing)
## clarity splits as LLLRRRRR, improve=0.05470891, (0 missing)
## Surrogate splits:
## x < 7.185 to the left, agree=0.984, adj=0.950, (0 split)
## carat < 1.455 to the left, agree=0.980, adj=0.937, (0 split)
## z < 4.435 to the left, agree=0.964, adj=0.890, (0 split)
## color splits as LLLLLLR, agree=0.679, adj=0.012, (0 split)
##
## Node number 4: 19820 observations
## mean=1050.302, MSE=264735.3
##
## Node number 5: 8074 observations
## mean=3060.143, MSE=792289.4
##
## Node number 6: 10296 observations, complexity param=0.02661853
## mean=6151.569, MSE=4751812
## left son=12 (7824 obs) right son=13 (2472 obs)
## Primary splits:
## clarity splits as LLLLRRRR, improve=0.3750779, (0 missing)
## y < 6.785 to the left, improve=0.1210156, (0 missing)
## carat < 1.195 to the left, improve=0.1066457, (0 missing)
## color splits as RRRRLLL, improve=0.1054268, (0 missing)
## x < 6.805 to the left, improve=0.1042365, (0 missing)
## Surrogate splits:
## y < 4.175 to the right, agree=0.76, adj=0.001, (0 split)
##
## Node number 7: 4953 observations, complexity param=0.0258319
## mean=12330.05, MSE=1.172079e+07
## left son=14 (3223 obs) right son=15 (1730 obs)
## Primary splits:
## y < 7.855 to the left, improve=0.30675820, (0 missing)
## x < 7.845 to the left, improve=0.30102780, (0 missing)
## carat < 1.915 to the left, improve=0.29420540, (0 missing)
## z < 4.805 to the left, improve=0.27774230, (0 missing)
## clarity splits as LRRRRRRR, improve=0.07274662, (0 missing)
## Surrogate splits:
## x < 7.905 to the left, agree=0.983, adj=0.952, (0 split)
## carat < 1.895 to the left, agree=0.974, adj=0.924, (0 split)
## z < 4.825 to the left, agree=0.952, adj=0.862, (0 split)
## clarity splits as RRLLLLLL, agree=0.675, adj=0.068, (0 split)
##
## Node number 12: 7824 observations
## mean=5401.156, MSE=2073671
##
## Node number 13: 2472 observations
## mean=8526.661, MSE=5804893
##
## Node number 14: 3223 observations
## mean=10940.83, MSE=8720276
##
## Node number 15: 1730 observations
## mean=14918.16, MSE=7016977
Summary tells us that carat, x, y, and z are important variables for the model. Clarity is slightly important.
Decision tree of the model is as follows;
fancyRpartPlot(price_estimation_model)
Price estimation model in details;
printcp(price_estimation_model)
##
## Regression tree:
## rpart(formula = price ~ carat + cut + color + clarity + x + y +
## z, data = diamonds_train, method = "anova")
##
## Variables actually used in tree construction:
## [1] carat clarity y
##
## Root node error: 6.8939e+11/43143 = 15979193
##
## n= 43143
##
## CP nsplit rel error xerror xstd
## 1 0.609137 0 1.00000 1.00004 0.0098174
## 2 0.185180 1 0.39086 0.39094 0.0043729
## 3 0.033616 2 0.20568 0.20619 0.0023031
## 4 0.026619 3 0.17207 0.17267 0.0023000
## 5 0.025832 4 0.14545 0.15252 0.0020602
## 6 0.010000 5 0.11962 0.12091 0.0017426
plotcp(price_estimation_model)
By using diamonds_test data, we test our price estimation model and predict diamond prices. First twenty estimations are as follow;
predicted_Diamond_data <- predict(price_estimation_model, newdata = diamonds_test)
head(predicted_Diamond_data, 20)
## 1 2 3 4 5 6 7 8
## 10940.832 1050.302 5401.156 5401.156 3060.143 3060.143 5401.156 5401.156
## 9 10 11 12 13 14 15 16
## 1050.302 5401.156 5401.156 5401.156 3060.143 10940.832 1050.302 1050.302
## 17 18 19 20
## 5401.156 3060.143 3060.143 3060.143
Price estimation model’s performance regarding real data that we used for testing can be considered as 0.88. R-squared value of our model says that the model works succesfully.
rsquared <- format(cor(predicted_Diamond_data,diamonds_test$price)^2, digits=2)
rsquared
## [1] "0.88"