Diamonds are the Precious stone consisting of a clear and colourless Crystalline form of pure carbon. They are the hardest Gemstones known to man and can be scratched only by other Diamonds. Diamonds are formed deep within the Earth about 100 miles or so below the surface in the upper mantle. Obviously in that part of the Earth it’s very hot. There’s a lot of pressure, the weight of the overlying rock bearing down, so that combination of high temperature and high pressure is what’s necessary to grow diamond crystals in the Earth.
diamonds
is a dataset containing the prices and other attributes of almost 54,000 diamonds. It is built-in dataset of R. Here are those attributes:
Carat : Carat weight of the Diamond.
Cut : Describe cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal .
Color : Color of the Diamond. With D being the best and J the worst.
Clarity : Diamond Clarity refers to the absence of the Inclusions and Blemishes. (In order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
Depth : The Height of a Diamond, measured from the Culet to the table, divided by its average Girdle Diameter.
Table : The Width of the Diamond’s Table expressed as a Percentage of its Average Diameter.
Price : the Price of the Diamond.
X : Length of the Diamond in mm.
Y : Width of the Diamond in mm.
Z : Height of the Diamond in mm.
Qualitative Features (Categorical) : Cut, Color, Clarity.
Quantitative Features (Numerical) : Carat, Depth , Table , Price , X , Y, Z.
library(tidyverse)
library(rpart)
library(rpart.plot)
library(rattle)
library(reshape2)
library(gridExtra)
library(grid)
library(ggplot2)
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
When we did the first operation on dataset, we have realized some irrational outputs. It does not look possible that a diamond’s length, width, or height equals to 0. However, we witness that minimum values of x,y, and z equal to 0. Let’s fix it.
diamonds_v2 <- diamonds %>%
filter(x != 0 & y != 0 & z != 0) %>%
glimpse()
## Rows: 53,920
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
diamonds_v2$cut <- as.factor(diamonds_v2$cut)
diamonds_v2$color <- as.factor(diamonds_v2$color)
diamonds_v2$clarity <- as.factor(diamonds_v2$clarity)
To make easier the control of categorical variables we convert them factors.
colSums(is.na(diamonds_v2))
## carat cut color clarity depth table price x y z
## 0 0 0 0 0 0 0 0 0 0
It is relieving that knowing there is no NAs in dataset for now.
categorical_var <- diamonds_v2[,c(2,3,4)]
continous_var <- diamonds_v2[,-c(2,3,4)]
data_main <- cbind(continous_var,categorical_var)
str(data_main)
## 'data.frame': 53920 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
corr <- cor(cor(data_main[-c(8, 9, 10)]))
melted_corr <- melt(corr)
ggplot(melted_corr, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
scale_fill_gradient(low="white", high="coral3") +
theme(axis.text.x = element_text(angle = 45), legend.position = "top",
legend.title = element_blank()) +
labs(x = "Variable X",
y = "Variable Y") +
ggtitle("Correlation Heatmap of Diamonds Features")
To see the correlation between attributes of diamonds
dataset, we melt the data with the help of rshape2
and create a heatmap. As you see:
Depth is inversely related to Price.
The Price of the Diamond is highly correlated to Carat.
The Length(x) , Width(y) and Height(z) seems to be higly related to Price and even each other.
Self Relation ie. of a feature to itself is 1 as expected.
This part of our exploratory data analysis will also help us in the following steps on creating a new model to predict price of diamonds.
Let’s visualize the attributes against the price values.
par(mfrow=c(1,1))
cut_price <- ggplot(data = data_main)+
geom_bar(aes(x=cut,y=price),stat = "summary",fun.y="mean",alpha=1,fill="blue")+
xlab("Cut Type")+
ylab("Average Price")+
ggtitle("Cut vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(cut_price)
According to the average prices, the most expensive diamonds are belongs to Premium if we just look for the cut variable.
color_price <- ggplot(data = data_main)+
geom_bar(aes(x = color, y = price),stat = "summary",fun.y="mean",alpha=1,fill="red")+
xlab("Color") +
ylab("Average Price") +
ggtitle("Color vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(color_price)
According to the average prices, the most expensive diamonds are belongs to J if we just look for the color variable.
clarity_price <- ggplot(data = data_main)+
geom_bar(aes(x = clarity, y = price),stat = "summary",fun.y="mean",alpha=1,fill="green")+
xlab("Clarity Type")+
ylab("Average Price")+
ggtitle("Clarity vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(clarity_price)
According to the average prices, the most expensive diamonds are belongs to SI2 if we just look for the clarity variable.
carat_price <- ggplot(data = data_main)+
geom_bar(aes(x = carat, y = price),stat = "summary",fun.y="mean",alpha=1,fill="orange")+
xlab("Carat")+
ylab("Average Price")+
ggtitle("Carat vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(carat_price)
depth_price <- ggplot(data = data_main)+
geom_bar(aes(x = depth, y = price),stat = "summary",fun.y="mean",alpha=1,fill="pink")+
xlab("Depth")+
ylab("Average Price")+
ggtitle("Depth vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(depth_price)
table_price <- ggplot(data = data_main)+
geom_bar(aes(x = table, y = price),stat = "summary",fun.y="mean",alpha=1,fill="purple")+
xlab("Table")+
ylab("Average Price")+
ggtitle("Table vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(table_price)
x_price <- ggplot(data = data_main)+
geom_bar(aes(x = x, y = price),stat = "summary",fun.y="mean",alpha=1,fill="purple")+
xlab("X")+
ylab("Average Price")+
ggtitle("X vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(x_price)
y_price <- ggplot(data = data_main)+
geom_bar(aes(x = y, y = price),stat = "summary",fun.y="mean",alpha=1,fill="purple")+
xlab("Y")+
ylab("Average Price")+
ggtitle("Y vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(y_price)
z_price <- ggplot(data = data_main)+
geom_bar(aes(x = z, y = price),stat = "summary",fun.y="mean",alpha=1,fill="purple")+
xlab("Z")+
ylab("Average Price")+
ggtitle("Z vs Price")
## Warning: Ignoring unknown parameters: fun.y
print(z_price)
grid.arrange(cut_price, color_price, clarity_price, carat_price, depth_price, table_price, x_price, y_price, z_price, ncol = 3, nrow = 3)
set.seed(503)
library(tidyverse)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
diamonds_train
## # A tibble: 43,143 x 11
## carat cut color clarity depth table price x y z diamond_id
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 2
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3
## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 4
## 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 5
## 5 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 7
## 6 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 8
## 7 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 9
## 8 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 10
## 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 11
## 10 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 12
## # ... with 43,133 more rows
diamonds_test
## # A tibble: 10,797 x 11
## carat cut color clarity depth table price x y z diamond_id
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 1.7 Fair D I1 64.7 56 5617 7.46 7.37 4.8 13776
## 2 0.580 Fair D SI2 65.1 58 1156 5.25 5.22 3.41 40690
## 3 1 Fair D SI2 65.6 66 3767 6.1 6.01 3.97 5152
## 4 1.05 Fair D SI2 65.4 59 3816 6.3 6.24 4.1 5359
## 5 0.7 Fair D SI2 64.5 60 2167 5.53 5.47 3.55 49828
## 6 0.9 Fair D SI2 64.6 59 3847 6.04 6.01 3.89 5510
## 7 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 1555
## 8 1.1 Fair D SI2 64.6 54 4725 6.56 6.49 4.22 10151
## 9 0.62 Fair D SI2 64.6 57 1410 5.39 5.33 3.46 43397
## 10 1.5 Fair D SI2 68.8 57 7469 6.9 6.86 4.73 18359
## # ... with 10,787 more rows
diamonds_train <- diamonds_train %>%
select(-diamond_id)
diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
summary(diamonds_pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.979800 1.1336225 0.8255050 0.2273068 0.218071172
## Proportion of Variance 0.653268 0.2141833 0.1135764 0.0086114 0.007925839
## Cumulative Proportion 0.653268 0.8674513 0.9810277 0.9896391 0.997564967
## Comp.6
## Standard deviation 0.120872645
## Proportion of Variance 0.002435033
## Cumulative Proportion 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## carat 0.496 0.638 0.442 0.385
## depth 0.734 -0.671
## table 0.122 -0.669 -0.733
## x 0.501 0.114 -0.855
## y 0.494 -0.760 0.361 0.201
## z 0.493 0.103 -0.817 0.279
From the summary of PCA model that we create, we can see that two can almost explain 87% of the data. We can choose to use two components for creating a model. Now, we need to add these components to both train and test datasets.
pca_results <- predict(diamonds_pca, newdata = diamonds_train)
diamonds_train$pca1 = pca_results[, 1]
diamonds_train$pca2 = pca_results[, 2]
pca_results_test <- predict(diamonds_pca, newdata = diamonds_test)
diamonds_test$pca1 = pca_results_test[, 1]
diamonds_test$pca2 = pca_results_test[, 2]
diamonds_fit <- rpart(price~. - pca1 - pca2, data=diamonds_train, method = "anova")
fancyRpartPlot(diamonds_fit)
Method argument of the rpart() function is “anova” because the price variable is numeric rather than class.
rpart.plot(diamonds_fit)
plotcp(diamonds_fit)
print(diamonds_fit$cptable)
## CP nsplit rel error xerror xstd
## 1 0.60913716 0 1.0000000 1.0000357 0.009817373
## 2 0.18517972 1 0.3908628 0.3909391 0.004372856
## 3 0.03361554 2 0.2056831 0.2061950 0.002303084
## 4 0.02661853 3 0.1720676 0.1726668 0.002299963
## 5 0.02583190 4 0.1454491 0.1525199 0.002060203
## 6 0.01000000 5 0.1196172 0.1209142 0.001742561
Since we deduce that as splitting number increases xerror
and rel error
are affected inversely from Complexity Parameter(CP) plot and table, we will try to improve our model by lowering Complexity Parameter.
diamonds_fit <- rpart(price~. - pca1 - pca2, data = diamonds_train, method = "anova", cp = 0.00001)
fancyRpartPlot(diamonds_fit)
rpart.plot(diamonds_fit)
plotcp(diamonds_fit)
print(tail(diamonds_fit$cptable))
## CP nsplit rel error xerror xstd
## 444 1.019450e-05 489 0.01704495 0.02576320 0.0005527474
## 445 1.014082e-05 490 0.01703476 0.02573855 0.0005525477
## 446 1.012304e-05 491 0.01702462 0.02572650 0.0005521675
## 447 1.001973e-05 492 0.01701450 0.02573169 0.0005528368
## 448 1.001469e-05 494 0.01699446 0.02573291 0.0005531837
## 449 1.000000e-05 495 0.01698444 0.02573252 0.0005531871
diamonds_prediction <- predict(diamonds_fit, newdata = diamonds_test, type = "vector")
diamonds_fit_r2 <- 1 - (sum((diamonds_prediction - diamonds_test$price) ^2)/sum((diamonds_test$price - mean(diamonds_train$price)) ^2))
diamonds_fit_r2
## [1] 0.9746056
Here, we have calculated Coefficient of Determination (R^2) to see how well the regression predictions approximate the real data points.
Finally, we can measure accuracy for continuous variables that we used by calculating RMSE and MAE.
Root Mean Squared Error (RMSE) and Mean Absolute (MAE) Error terms for predictions:
errors <- c(RMSE = Metrics::rmse(actual = diamonds_test$price, predicted = diamonds_prediction), MAE =
Metrics::mae(actual = diamonds_test$price, predicted = diamonds_prediction))
errors
## RMSE MAE
## 630.6222 341.7021