set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
data<-diamonds_train
testdata <- diamonds_test
data %>%
summary()
## carat cut color clarity depth
## Min. :0.2000 Fair : 1285 D:5416 SI1 :10449 Min. :43.00
## 1st Qu.:0.4000 Good : 3923 E:7835 VS2 : 9806 1st Qu.:61.00
## Median :0.7000 Very Good: 9662 F:7629 SI2 : 7354 Median :61.80
## Mean :0.7985 Premium :11036 G:9037 VS1 : 6538 Mean :61.75
## 3rd Qu.:1.0400 Ideal :17237 H:6646 VVS2 : 4052 3rd Qu.:62.50
## Max. :5.0100 I:4336 VVS1 : 2923 Max. :79.00
## J:2244 (Other): 2021
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2403 Median : 5.700 Median : 5.710
## Mean :57.45 Mean : 3939 Mean : 5.732 Mean : 5.736
## 3rd Qu.:59.00 3rd Qu.: 5352 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18818 Max. :10.740 Max. :58.900
##
## z diamond_id
## Min. : 0.000 Min. : 2
## 1st Qu.: 2.910 1st Qu.:13586
## Median : 3.520 Median :26991
## Mean : 3.539 Mean :26995
## 3rd Qu.: 4.040 3rd Qu.:40454
## Max. :31.800 Max. :53940
##
The x, y and z represent the shape of the diamond. According to the summary above, there are some outliers since the y’s and z’s max values are quite high and all of the x, y, z have the min values as zero which is not possible. Therefore we can eliminate them
data <- subset(data, x != 0 & y!= 0& z!=0)
data <- subset(data, y<15)
data <- subset(data, z<15)
sum(is.na(data))
## [1] 0
There is no missing value in data
unique(data$cut)
## [1] Premium Good Very Good Fair Ideal
## Levels: Fair < Good < Very Good < Premium < Ideal
The classification of cut is, as in order; Fair(Worst), Good, Very Good, Premium, Ideal(Best).
unique(data$color)
## [1] E I J H F G D
## Levels: D < E < F < G < H < I < J
Diamond colors have a grading system. According to that system, D is very rare and the most valuable one. E and F are more affordable and still accepted as colorless, which is very good quality. But they have high demand. G-H-I-J are graded as Near Colorless. They have some little spots that can be seen with naked eye. They are not as valuable as D, E or F
unique(data$clarity)
## [1] SI1 VS1 VS2 SI2 VVS1 VVS2 I1 IF
## Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
Clarity also have a universal scale from VSS(Better) to I (Worse)
summary(data$carat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7982 1.0400 5.0100
The carat actually means the weight of the gemstone. Generally 0.2 gram is 1 carat for diamonds. However, it depends of the density of the gemstone, therefore 1 carat of a diamond doesn’t mean the same as 1 carat any other gemstone by its volume or size.
On the other hand, the depth and table percentages actually are depended on the shape of diamond. There are no one ideal measure, but for further information the website at the reference section can be visited.
Under these circumstances, we can investigate the data graphically, if the conditions of what makes a diamond more precious, really effects the price or not
ggplot(data, aes(x=carat, y=price)) +
geom_point()
The graph shows us above, there is a linear dependency between carat and price as expected.
color_mean_price <- data %>%
group_by(color)%>%
summarize(mean_price=mean(price), n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(color_mean_price, aes(x=color, y=mean_price)) +
geom_col()
Just opposite of the description above, the D color supposed to be the most valuable diamond. However, it has the lowest average price. On the other hand, the J color which is the least quality one, has the highest mean price
cut_mean_price <- data %>%
group_by(cut)%>%
summarize(mean_price=mean(price), n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(cut_mean_price, aes(x=cut, y=mean_price )) +
geom_col()
According to the graph above, there is no significant difference can be seen between price and cut. The “ideal” cut has the lowest mean price, but the “fair” cut has the second highest mean price
clarity_mean_price <- data %>%
group_by(clarity)%>%
summarize(mean_price =mean(price), n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(clarity_mean_price, aes(x=clarity, y=mean_price )) +
geom_col()
Quite similar to the color and cutting of a diamond, clarity also don’t make a significant difference on price as well.
In the next step, we can check the relationship between variables. Instead of checking the x, y and z separately, I will find the volume of the diamond and examine if the size matters for a diamond by correlation matrix.
data <- data %>%
mutate(volume=x*y*z)
corr_data <- data[,c(1,5,6,7,8,9,10,12)]
round(cor(corr_data), 2)
## carat depth table price x y z volume
## carat 1.00 0.02 0.18 0.92 0.98 0.98 0.98 1.00
## depth 0.02 1.00 -0.30 -0.01 -0.03 -0.03 0.09 0.01
## table 0.18 -0.30 1.00 0.13 0.20 0.19 0.16 0.17
## price 0.92 -0.01 0.13 1.00 0.89 0.89 0.88 0.92
## x 0.98 -0.03 0.20 0.89 1.00 1.00 0.99 0.98
## y 0.98 -0.03 0.19 0.89 1.00 1.00 0.99 0.98
## z 0.98 0.09 0.16 0.88 0.99 0.99 1.00 0.98
## volume 1.00 0.01 0.17 0.92 0.98 0.98 0.98 1.00
The correlation table shows that, there are very strong positive relationships between price and carat, x, y, z and volume. But, as explained above, carat actually means the weight of the diamond and it depends on its shape as well. This can be a sign of the multicollinearity and including it to the model could give unrealistic results. So, I will only consider the carat and price for the regression model.
linearmod <- lm(price ~ carat, data = data)
print(linearmod)
##
## Call:
## lm(formula = price ~ carat, data = data)
##
## Coefficients:
## (Intercept) carat
## -2259 7762
summary(linearmod)
##
## Call:
## lm(formula = price ~ carat, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18610.7 -805.5 -19.9 538.0 12728.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2258.55 14.61 -154.6 <2e-16 ***
## carat 7761.93 15.73 493.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1550 on 43120 degrees of freedom
## Multiple R-squared: 0.8496, Adjusted R-squared: 0.8496
## F-statistic: 2.435e+05 on 1 and 43120 DF, p-value: < 2.2e-16
There are significant evidences that both intercept and carat are different than the zero by their t-tests. Additionally, R-squared result indicates that, the model explains 85% of the variability around its mean.
#https://www.dataquest.io/blog/statistical-learning-for-predictive-modeling-r/
ggplot(data=data, aes(linearmod$residuals)) +
geom_histogram(binwidth = 1, color = "black", fill = "purple4") +
theme(panel.background = element_rect(fill = "white"),
axis.line.x=element_line(),
axis.line.y=element_line()) +
ggtitle("Histogram for Model Residuals")
The most of the residuals are spread around zero, which can be concluded as, the model fits to the data quite well.
pred <-predict(linearmod, data.frame(carat=testdata$carat))
plot(pred-testdata$price)
If we test our prediction model with the test data, the graph above shows that, predicted values are quite close to the actual prices on our test data.
The basic analyzing exhibits, most of the times people don’t know the real value of the diamond that they bought. Generally the only variable that they consider, what is the carat of the diamond. Since the carat is basically the weight of the diamond, the regression is applied without the variables that represents the shape of a diamond, but only carat is taken into account. According to the residuals between predicted values and actual values, the regression model fits quite well and explains the 85% of the variability.
Diamonds Clarity Scale.&text=A%20diamond%20with%20a%20clarity,to%20see%20under%2010x%20magnification.)