set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
  group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
                            diamonds_test, by = "diamond_id")


data<-diamonds_train
testdata <- diamonds_test

Data Preprocessing

data %>%
  summary()
##      carat               cut        color       clarity          depth      
##  Min.   :0.2000   Fair     : 1285   D:5416   SI1    :10449   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 3923   E:7835   VS2    : 9806   1st Qu.:61.00  
##  Median :0.7000   Very Good: 9662   F:7629   SI2    : 7354   Median :61.80  
##  Mean   :0.7985   Premium  :11036   G:9037   VS1    : 6538   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :17237   H:6646   VVS2   : 4052   3rd Qu.:62.50  
##  Max.   :5.0100                     I:4336   VVS1   : 2923   Max.   :79.00  
##                                     J:2244   (Other): 2021                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2403   Median : 5.700   Median : 5.710  
##  Mean   :57.45   Mean   : 3939   Mean   : 5.732   Mean   : 5.736  
##  3rd Qu.:59.00   3rd Qu.: 5352   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18818   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z            diamond_id   
##  Min.   : 0.000   Min.   :    2  
##  1st Qu.: 2.910   1st Qu.:13586  
##  Median : 3.520   Median :26991  
##  Mean   : 3.539   Mean   :26995  
##  3rd Qu.: 4.040   3rd Qu.:40454  
##  Max.   :31.800   Max.   :53940  
## 

The x, y and z represent the shape of the diamond. According to the summary above, there are some outliers since the y’s and z’s max values are quite high and all of the x, y, z have the min values as zero which is not possible. Therefore we can eliminate them

data <- subset(data, x != 0 & y!= 0& z!=0)
data <- subset(data, y<15)
data <- subset(data, z<15)

sum(is.na(data))
## [1] 0

There is no missing value in data

The Summary of variables

unique(data$cut)
## [1] Premium   Good      Very Good Fair      Ideal    
## Levels: Fair < Good < Very Good < Premium < Ideal

The classification of cut is, as in order; Fair(Worst), Good, Very Good, Premium, Ideal(Best).

unique(data$color)
## [1] E I J H F G D
## Levels: D < E < F < G < H < I < J

Diamond colors have a grading system. According to that system, D is very rare and the most valuable one. E and F are more affordable and still accepted as colorless, which is very good quality. But they have high demand. G-H-I-J are graded as Near Colorless. They have some little spots that can be seen with naked eye. They are not as valuable as D, E or F

unique(data$clarity)
## [1] SI1  VS1  VS2  SI2  VVS1 VVS2 I1   IF  
## Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF

Clarity also have a universal scale from VSS(Better) to I (Worse)

summary(data$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2000  0.4000  0.7000  0.7982  1.0400  5.0100

The carat actually means the weight of the gemstone. Generally 0.2 gram is 1 carat for diamonds. However, it depends of the density of the gemstone, therefore 1 carat of a diamond doesn’t mean the same as 1 carat any other gemstone by its volume or size.

On the other hand, the depth and table percentages actually are depended on the shape of diamond. There are no one ideal measure, but for further information the website at the reference section can be visited.

Exploratory Data Analysis

Under these circumstances, we can investigate the data graphically, if the conditions of what makes a diamond more precious, really effects the price or not

ggplot(data, aes(x=carat, y=price)) +
  geom_point()

The graph shows us above, there is a linear dependency between carat and price as expected.

color_mean_price <- data %>%
  group_by(color)%>%
  summarize(mean_price=mean(price), n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(color_mean_price, aes(x=color, y=mean_price)) +
  geom_col()

Just opposite of the description above, the D color supposed to be the most valuable diamond. However, it has the lowest average price. On the other hand, the J color which is the least quality one, has the highest mean price

cut_mean_price <- data %>%
  group_by(cut)%>%
  summarize(mean_price=mean(price), n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(cut_mean_price, aes(x=cut, y=mean_price )) +
  geom_col()

According to the graph above, there is no significant difference can be seen between price and cut. The “ideal” cut has the lowest mean price, but the “fair” cut has the second highest mean price

clarity_mean_price <- data %>%
  group_by(clarity)%>%
  summarize(mean_price =mean(price), n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(clarity_mean_price, aes(x=clarity, y=mean_price )) +
  geom_col()

Quite similar to the color and cutting of a diamond, clarity also don’t make a significant difference on price as well.

In the next step, we can check the relationship between variables. Instead of checking the x, y and z separately, I will find the volume of the diamond and examine if the size matters for a diamond by correlation matrix.

data <- data %>%
  mutate(volume=x*y*z)
corr_data <- data[,c(1,5,6,7,8,9,10,12)]
round(cor(corr_data), 2)
##        carat depth table price     x     y    z volume
## carat   1.00  0.02  0.18  0.92  0.98  0.98 0.98   1.00
## depth   0.02  1.00 -0.30 -0.01 -0.03 -0.03 0.09   0.01
## table   0.18 -0.30  1.00  0.13  0.20  0.19 0.16   0.17
## price   0.92 -0.01  0.13  1.00  0.89  0.89 0.88   0.92
## x       0.98 -0.03  0.20  0.89  1.00  1.00 0.99   0.98
## y       0.98 -0.03  0.19  0.89  1.00  1.00 0.99   0.98
## z       0.98  0.09  0.16  0.88  0.99  0.99 1.00   0.98
## volume  1.00  0.01  0.17  0.92  0.98  0.98 0.98   1.00

The correlation table shows that, there are very strong positive relationships between price and carat, x, y, z and volume. But, as explained above, carat actually means the weight of the diamond and it depends on its shape as well. This can be a sign of the multicollinearity and including it to the model could give unrealistic results. So, I will only consider the carat and price for the regression model.

Prediction

linearmod <- lm(price ~ carat, data = data)
print(linearmod)
## 
## Call:
## lm(formula = price ~ carat, data = data)
## 
## Coefficients:
## (Intercept)        carat  
##       -2259         7762
summary(linearmod)
## 
## Call:
## lm(formula = price ~ carat, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18610.7   -805.5    -19.9    538.0  12728.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2258.55      14.61  -154.6   <2e-16 ***
## carat        7761.93      15.73   493.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1550 on 43120 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8496 
## F-statistic: 2.435e+05 on 1 and 43120 DF,  p-value: < 2.2e-16

There are significant evidences that both intercept and carat are different than the zero by their t-tests. Additionally, R-squared result indicates that, the model explains 85% of the variability around its mean.

#https://www.dataquest.io/blog/statistical-learning-for-predictive-modeling-r/
ggplot(data=data, aes(linearmod$residuals)) +
geom_histogram(binwidth = 1, color = "black", fill = "purple4") +
theme(panel.background = element_rect(fill = "white"),
axis.line.x=element_line(),
axis.line.y=element_line()) +
ggtitle("Histogram for Model Residuals")

The most of the residuals are spread around zero, which can be concluded as, the model fits to the data quite well.

pred <-predict(linearmod, data.frame(carat=testdata$carat))
plot(pred-testdata$price)

If we test our prediction model with the test data, the graph above shows that, predicted values are quite close to the actual prices on our test data.

Conclusion

The basic analyzing exhibits, most of the times people don’t know the real value of the diamond that they bought. Generally the only variable that they consider, what is the carat of the diamond. Since the carat is basically the weight of the diamond, the regression is applied without the variables that represents the shape of a diamond, but only carat is taken into account. According to the residuals between predicted values and actual values, the regression model fits quite well and explains the 85% of the variability.

Reference

Linear regression

Diamonds Coloring Scale

Diamonds Clarity Scale.&text=A%20diamond%20with%20a%20clarity,to%20see%20under%2010x%20magnification.)

Diamonds Depth and Table

Linear Regression Predictive Modelling