1. Introduction

1.1 Diamonds

Diamonds are the result of carbon exposed to extremely high heat and pressure, deep in the earth. This process results in a variety of both internal and external characteristics(i.e. size, shape, color, etc.) which makes every diamond unique. Therefore, jewelry industry uses a systematic way to evaluate and discus these factors. Diamond professionals use the grading system developed by GIA in the 1950s, which established the use of four important factors to describe and classify diamonds: Clarity, Color, Cut, and Carat Weight. These are known as the 4Cs. The value of a finished diamond is based on this combination. The explanation of 4C can be seen below:

  • Clarity: The clarity of a diamond is determined by the number, location and type of inclusions it contains. Inclusions can be microscopic cracks, mineral deposits or external markings. Clarity is rated using a scale which contains a combination of letters and numbers to signify the amount and type of inclusions. This scale ranges from FL to I3, FL being Flawless and the most valuable.

  • Color: Most commercially available diamonds are classified by color, or more appropriately, the most valuable diamonds are those classified as colorless. Color is graded on a letter scale from D to Z, with D representing a colorless diamond. Diamonds that are graded in the D-F range are the rarest and consequently most valuable. In reality, diamonds in the G-K range have such a small amount of color that the untrained eye can not see it.

  • Cut: Cut is the most important characteristic of the diamond. It determines how the light which enters into the diamond from the above will be reflected back to the eye of observer. A perfect cut diamond reflects light to its optimum.

  • Carat Weight: In addition to color, clarity and cut, weight provides a further basis in the valuation of a diamond. The weight of diamonds is measured in carats. One carat is equal to 1/5 of a gram. Smaller diamonds are more readily available than larger ones, which results in higher values based on weight. A 2 carat diamond will not be twice the cost of a 1 carat diamond, despite being twice the size. The larger the diamond, the rarer it becomes and as such the price increases exponentially.

1.2 Objectives

In this assignment, the diamonds dataset that we use consists of prices and quality information from about 53,940 diamonds, and is included in the ggplot2 package. The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price. These attributes consist of the 4C and some physical measurements such as depth, table, x, y, and z.

The processes during the assignment:

Firstly, the data will be investigated for preprocessing to improve its quality. Then an exploratory data analysis will be performed by data manipulation and data visualization steps. The main purpose is to identify which variables affect the price of diamonds mostly and come up with a conclusion for the relationship between variables. In addition to these, forecasting models will be studied in order to estimate the price of diamonds with given properties.

  1. Data Preprocessing
  2. Data Manipulation
  3. Data Visualization
  4. Forecasting

2. Data Preprocessing

2.1 Required Packages

The packages used during the assignment can be listed as below:

library(tidyverse)
library(knitr)
library(tinytex)
library(kableExtra)
library(corrplot)
library(RColorBrewer)
library(ggExtra)
library(rpart)
library(rpart.plot)
library(rattle)
library(data.table)
library(caret)
library(e1071)
library(patchwork) 
  • tidyverse
  • knitr
  • tinytex
  • kableExtra
  • corrplot
  • RColorBrewer
  • ggExtra
  • rpart
  • rpart.plot
  • rattle
  • data.table
  • caret
  • e1071

2.2 Variables

Before starting the analysis with the dataset, it will be useful to have information about the properties of diamonds. There are 10 columns regarding the properties of diamonds.

  1. carat: Weight of the diamond (numeric variable)
  2. cut: Quality of the cut, from Fair, Good, Very Good, Premium, Ideal (categoric variable)
  3. color: Color of the diamond, from D (best) to J (worst) (categoric variable)
  4. clarity: A measurement of how clear the diamond is, from I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (categoric variable)
  5. depth: A measurement of the diamond called total depth percentage equals to z / mean(x, y) = 2 * z / (x + y)
  6. table: width of top of diamond relative to widest point (numeric variable)
  7. price: Price of the diamond in US dollars ranging between $326 to $18,823 (numeric variable)
  8. x: Length of the diamond in mm (numeric variable)
  9. y: Width of the diamond in mm (numeric variable)
  10. z: Depth of the diamond in mm (numeric variable)

By using this glimpse() function, we obtain these variables.

diamonds%>% glimpse()
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

Like glimpse() function, we can also get similar results by using structure function which is the function of Base R, i.e., str().

diamonds%>% str()
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Diamond color type D is known as the best color while J is the worst, however str() output shows that the order is wrong, therefore we should identify the levels of the color variable in diamonds dataset.

diamonds$color <- factor(diamonds$color, levels = c("J", "I", "H", "G", "F", "E", "D"))

2.3 Data Examination

To make pre processing, basic examinations will be made.Some of them are

  • NA values,
  • Duplicated values,
  • Control of the variables
  • Controlling of the negative values
sum(any(is.na(diamonds)))
## [1] 0

By using above function we can obtain number of NA values in this dataset. When we check the dataset, There are not any NA values.This means that there is no missing value in any row or column.

sum(as.numeric(duplicated(diamonds)))
## [1] 146

Another important control point is that checking of the duplicated values. There are 146 duplicated lines. Before the analysis, we need to remove these data from the dataset.

diamonds <- unique(diamonds)
sum(as.numeric(duplicated(diamonds)))
## [1] 0

According to the usage of unique function, there is no duplicated line, and we have 53794 rows and 10 columns.

diamonds %>% glimpse()
## Rows: 53,794
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Logically, price should be greater than zero. For this reason, we check dataset to control negative price values. According to the results there is no negative price value.

diamonds %>%
  filter(price <= 0) %>%
  summarise(NegativePrice = n())
## # A tibble: 1 x 1
##   NegativePrice
##           <int>
## 1             0

2.3.1 Accuracy of Values

Unrealistic price values:

The dataset must include realistic values, therefore it shouldn’t contain negative values for the numeric values which correspond to a specific characteristics of a diamond. Price should be greater than zero, therefore, we check dataset to control negative price values. According to the results there is no negative price value, as expected.

Missing x, y, z values:

These values correspond to length, width, and depth, therefore they should be positive. We can impute the wrongly filled zero values if we have other dimensions since depth is a value obtained by a formula which include x,y, and z. This explanation also can be obtain from the R package explanation by using ?diamonds.

The calculation is that

total depth percentage = z / mean(x, y) = 2 * z / (x + y)

diamonds %>%
  filter(x <= 0 & y > 0 & z > 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
diamonds %>%
  filter(y <= 0 & x > 0 & z > 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
diamonds %>%
  filter(z <= 0 & x > 0 & y > 0)
## # A tibble: 12 x 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1    Premium G     SI2      59.1    59  3142  6.55  6.48     0
##  2  1.01 Premium H     I1       58.1    59  3167  6.66  6.6      0
##  3  1.1  Premium G     SI2      63      59  3696  6.5   6.47     0
##  4  1.01 Premium F     SI2      59.2    58  3837  6.5   6.47     0
##  5  1.5  Good    G     I1       64      61  4731  7.15  7.04     0
##  6  1.15 Ideal   G     VS2      59.2    56  5564  6.88  6.83     0
##  7  2.18 Premium H     SI2      59.4    61 12631  8.49  8.45     0
##  8  2.25 Premium I     SI1      61.3    58 15397  8.52  8.42     0
##  9  2.2  Premium H     SI1      61.2    59 17265  8.42  8.37     0
## 10  2.02 Premium H     VS2      62.7    53 18207  8.02  7.95     0
## 11  2.8  Good    G     SI2      63.8    58 18788  8.9   8.85     0
## 12  1.12 Premium G     I1       60.4    59  2383  6.71  6.67     0

All x and y values are positive, for now we don’t need to do anything. But, there are some missing values in the x,y, and z columns, to detect these values, we can use following codes. When we find the missing values, by using depth formulation we can fill some lines.

diamonds = diamonds %>%
  mutate(z = ifelse(z == 0 & x != 0 & y != 0,round(depth * mean(c(x, y)) / 100, 2), z)) 

However, we need to control one more thing in this data set that is whether the two of these variable are zero or not. In this situation, we do not calculate the values because there two unknown value and one formula. In this situation, we can remove these trials from the data set. For this purpose, the binary combinations are controlled with filter operation.

diamonds %>%
  filter(x <= 0 & y <= 0 ) 
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139     0     0     0
## 2  1.14 Fair      G     VS1      57.5    67  6381     0     0     0
## 3  1.56 Ideal     G     VS2      62.2    54 12800     0     0     0
## 4  1.2  Premium   D     VVS1     62.1    59 15686     0     0     0
## 5  2.25 Premium   H     SI2      62.8    59 18034     0     0     0
## 6  0.71 Good      F     SI2      64.1    60  2130     0     0     0

When we control the data set we obtain trials which have both x and y are zero.We control other combinations.

diamonds %>%
  filter(x <= 0 & z <= 0)
## # A tibble: 7 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
## 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
## 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
## 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
## 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
## 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
## 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
diamonds %>%
  filter(y <= 0 & z <= 0)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139     0     0     0
## 2  1.14 Fair      G     VS1      57.5    67  6381     0     0     0
## 3  1.56 Ideal     G     VS2      62.2    54 12800     0     0     0
## 4  1.2  Premium   D     VVS1     62.1    59 15686     0     0     0
## 5  2.25 Premium   H     SI2      62.8    59 18034     0     0     0
## 6  0.71 Good      F     SI2      64.1    60  2130     0     0     0

As a result there are 7 rows that both x and z values are 0. As we do not calculate values of these variables, we can remove from the data set. Because this number is too small when we compare the all data set.

diamonds <- diamonds %>%
  filter(!(x == 0 & z == 0))

diamonds %>%
  filter(x == 0 | y == 0 | z == 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>

The last filter shows that, there is no zero values in the x,y, and z variables’ column.Now, we can have more clear data set. But,there is one more examination about the data set which is finding outliers of the data set. For this purpose, we can control numeric variables x,y, and z with range() function.

For x, y variables:

range(diamonds$x)  
diamonds$x %>% 
  unique() %>% 
  sort(decreasing = TRUE)

 range(diamonds$y) 
diamonds$y %>% 
  unique() %>% 
  sort(decreasing = TRUE)

When we control the output of the above lines, we obtain that there is huge difference between the highest value of the y and the second highest value of the y.The reason behind this huge different, the data can be entered wrong or the calculation may be wrong. For this reason, we can remove this line or we can calculate by using depth formulation.

diamonds %>%
  filter(y > 15) %>%
  mutate(new_y = (2 * z / depth) / 100 - x) %>%
  select(depth, x, z, y, new_y)
## # A tibble: 2 x 5
##   depth     x     z     y new_y
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  58.9  8.09  8.06  58.9 -8.09
## 2  61.8  5.15  5.12  31.8 -5.15

when we calculate this value, we can not obtain proper value. For this reason we can remove this data from the data set.

diamonds = diamonds %>%
  filter(y < 15)

Lastly, we control the z variable range.

range(diamonds$z) 
diamonds$z %>% 
  unique() %>% 
  sort(decreasing = TRUE)

Since, we also obtain outlier trial for the z variable. For this according to the manual calculation there is one variable, we can remove or calculate a new value for this trial.

#changing of the outlier with the calculated value.
diamonds$z[diamonds$z == 31.8] <- diamonds$z[diamonds$z == 31.8] / 10
#Now, we also need to check the depth column. Then, we need to change the depth value, with the new calculated depth value.
diamonds$calculated_depth <- 2 * diamonds$z / (diamonds$x + diamonds$y) * 100

By using calculated depth, which is obtained from the x,y and z values with using formulation, and depth that is given in the dataset, we can compare the results. It is expected that, both values have linear relationship.

diamonds %>%
  ggplot(., aes(x = calculated_depth, y = depth)) +
  geom_point() + 
  geom_abline(intercept = 0, slope = 1, color="blue", size=1.5) +
  theme_minimal() +
  labs(title = "The Depth vs Calculated Depth",
       x = "Calculated Depth",
       y = "Depth (given in the dataset)")

According to the results, we can say that almost all variables have linear relationship, but some of them is not. When we analyzed the information about the diamonds, the difference between 56.5 and 65, whereas the ideal range is between 59.5 and 62.9. Hence, we can remove data which does not have this property. According to the web-page,the interval selected as 9. But this criteria can be changed according to the other implementations and variables.

diamonds %>%
  filter(!(abs(calculated_depth - depth) <9)) %>%
  select(calculated_depth, depth, x, y, z)
## # A tibble: 24 x 5
##    calculated_depth depth     x     y     z
##               <dbl> <dbl> <dbl> <dbl> <dbl>
##  1             51.2  62.8  6.26  6.19  3.19
##  2             63.1  43    6.32  6.27  3.97
##  3             65.7  44    6.31  6.24  4.12
##  4             51.7  64    7.15  7.04  3.67
##  5             63.0  43    6.53  6.55  4.12
##  6             49.5  59.2  6.88  6.83  3.39
##  7             16.1  60.6  6.62  6.67  1.07
##  8             65.0  55.9  6.77  6.71  4.38
##  9             56.9  67.3  7.85  5.75  3.87
## 10             20.5  61.9  7.43  7.5   1.53
## # ... with 14 more rows

There are 24 observations. When we compare it with the number of all observations in the dataset, it is very small value. So, we can remove them from the dataset.

diamonds = diamonds %>%
  filter((abs(calculated_depth - depth) < 9))
diamonds = subset(diamonds, select = -c(calculated_depth)) 
diamonds%>%glimpse()
## Rows: 53,761
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

According to the depth formulation and web information, for ideal diamonds, there should be relationship between variables x,y, and z. For this reason, as a final interpretation we can compare the x, y and z values with each other.

p1<- diamonds %>%
        ggplot(., aes(x = x, y = y)) + 
        geom_point() + 
        geom_smooth(method = "lm", color = "blue", size=2) + 
        theme_minimal() + theme(axis.text.x = element_text(angle=90, size=6, vjust=0.5,hjust=1))+
        labs(title = "Comparison of \n the x and y variable",
             x = "X variable",
             y = "Y variable")

p2 <- diamonds %>%
        ggplot(., aes(x = z, y = y)) + 
        geom_point() + 
        geom_smooth(method = "lm", color = "blue", size=2)+ 
        theme_minimal() + theme(axis.text.x = element_text(angle=90, size=6, vjust=0.5,hjust=1))+
        labs(title = "Comparison of \n the x and y variable",
             x = "Z variable",
             y = "Y variable")

p3 <- diamonds %>%
        ggplot(., aes(x = x, y = z)) +
        geom_point() + 
        geom_smooth(method = "lm", color = "blue", size=2)+
        theme_minimal() + theme(axis.text.x = element_text(angle=90, size=6, vjust=0.5,hjust=1))+
        labs(title = "Comparison of \n the x and y variable",
             x = "X variable",
             y = "Z variable")

(p1 | p2 | p3)

As we mentioned before, the relationship between variables show that there is high correlation. But, we can assume that these x, y and z values are valid values.

2.4 Summary of Data

After preprocessing of the data, the summary and glimpse of the data can be seen below.In this new data set, we have 53,761 rows.

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1592   J: 2800   SI1    :13026   Min.   :50.80  
##  1st Qu.:0.4000   Good     : 4887   I: 5405   VS2    :12221   1st Qu.:61.00  
##  Median :0.7000   Very Good:12065   H: 8262   SI2    : 9140   Median :61.80  
##  Mean   :0.7974   Premium  :13738   G:11252   VS1    : 8151   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21479   F: 9517   VVS2   : 5055   3rd Qu.:62.50  
##  Max.   :5.0100                     E: 9772   VVS1   : 3645   Max.   :79.00  
##                                     D: 6753   (Other): 2523                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 3.730   Min.   : 3.680  
##  1st Qu.:56.00   1st Qu.:  951   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3930   Mean   : 5.731   Mean   : 5.733  
##  3rd Qu.:59.00   3rd Qu.: 5322   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :10.540  
##                                                                   
##        z        
##  Min.   :2.240  
##  1st Qu.:2.910  
##  Median :3.530  
##  Mean   :3.539  
##  3rd Qu.:4.030  
##  Max.   :6.980  
## 
glimpse(diamonds)
## Rows: 53,761
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

3. Exploratory Data Analysis

3.1 Price

Since the mean is greater than median for price, the distribution is right-skewed.

diamonds %>%
  ggplot(aes(x=price)) +
  geom_histogram(aes(y=..density..), position="identity", alpha=0.8, fill = "paleturquoise4", color="paleturquoise4") +
  geom_density(alpha=1, size = 1)+
  theme_minimal() +
  labs(title = "Price Distribution with Histogram",
       x = "Price",
       y = "Count")

3.2 Price vs Categoric Variables

3.2.1. Price vs Cut

diamonds %>%
  group_by(cut) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((cut)) %>%
  select(cut, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Cut","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)
Diamond Cut Count Percentage Minimum Price Average Price Maximum Price
Fair 1592 3.0 337 4330.16 18574
Good 4887 9.1 327 3916.91 18707
Very Good 12065 22.4 336 3981.30 18818
Premium 13738 25.6 326 4576.58 18823
Ideal 21479 40.0 326 3461.60 18806
  • The most of the diamonds belong to ideal cut type.
  • The percentage of premium and very good are very close.
  • There is a little fair cut type.
diamonds%>% ggplot(., aes(cut, price)) + 
  geom_boxplot(aes(fill=factor(cut))) + 
  scale_fill_brewer(palette="Pastel1")+
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme_minimal()+
  labs(title="Price vs Cut", 
       subtitle="Diamonds dataset",
       x="Diamond Cut",
       y="Price",
       fill="Diamond Cut")

diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="pink2") +
  theme_minimal()+facet_wrap(~cut, ncol=5)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="pink3"))+
    labs(title="Price Frequency vs Diamond Cut", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Cut")

Although the best cut type is Ideal, its price is the lowest. According to the average prices, the most expensive diamonds belong to Premium and Fair cut types. These results present that cut is not enough to explain response variable price, since price does not increase while cut feature improves.

3.2.2 Price vs Color

diamonds %>%
  group_by(color) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((color)) %>%
  select(color, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Color","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)
Diamond Color Count Percentage Minimum Price Average Price Maximum Price
J 2800 5.2 335 5327.34 18710
I 5405 10.1 334 5079.12 18823
H 8262 15.4 337 4474.01 18803
G 11252 20.9 354 3997.03 18818
F 9517 17.7 342 3726.56 18791
E 9772 18.2 326 3080.28 18731
D 6753 12.6 357 3172.01 18693
  • The most of the diamonds belong to G color type.
  • The percentage of E, F, and G are close.
  • There is a little J color type.
diamonds %>%
  group_by(color) %>%
  summarise(Average_Price = mean(price)) %>%
  
  ggplot(.,aes(x=color, y = Average_Price, fill= color)) +
    geom_col(color="black") +
    geom_text(aes(label = format(Average_Price, digits = 3)), position = position_dodge(0.9),vjust = 5) +
    geom_line(aes(y = Average_Price), color="black", group=1, size=0.8)+
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 65))+
     scale_fill_brewer(palette="Pastel2")+
   
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  labs(title="Price vs Color", 
       subtitle="Diamonds dataset",
       x="Diamond Color",
       y="Average Price",
       fill="Diamond color")

diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="palegreen3") +
  theme_minimal()+facet_wrap(~color, ncol=7)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="seagreen3"))+
    labs(title="Price Frequency vs Diamond Cut", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Cut")

Although the best color type is D, its price is one of the lowest. According to the average prices, the most expensive diamonds belong to J and I cut types which are actually the worst two color type in this data set. These results present that color is not enough to explain response variable price, since price does not increase while color feature improves.

3.2.3 Price vs Clarity

diamonds %>%
  group_by(clarity) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((clarity)) %>%
  select(clarity, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Clarity","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)
Diamond Clarity Count Percentage Minimum Price Average Price Maximum Price
I1 739 1.4 345 3924.19 18531
SI2 9140 17.0 326 5051.59 18804
SI1 13026 24.2 326 3992.93 18818
VS2 12221 22.7 334 3925.96 18823
VS1 8151 15.2 327 3839.87 18795
VVS2 5055 9.4 336 3286.97 18768
VVS1 3645 6.8 336 2523.09 18777
IF 1784 3.3 369 2870.57 18806
  • The most of the diamonds belong to SI1 clarity type.
  • The percentage of SI2 and SI1 are close.
  • I1, IF, VVS1, and VVS2 clarity types are less than 10%.
  • The majority of the data set consists of VS1, VS2, SI1, and SI2 clarity types.
diamonds %>%
  ggplot(., aes(x = clarity, y = price, color = clarity)) +
  scale_color_brewer(palette="Set3")+
  geom_jitter(alpha=0.7) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title = "Price vs Clarity",
       subtitle = "Diamonds dataset",
       x = "Diamond Clarity",
       y = "Price",
       color = "Diamond Clarity")

diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="skyblue3") +
  theme_minimal()+facet_wrap(~clarity, ncol=8)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="skyblue2"))+
    labs(title="Price Frequency vs Diamond  Clarity", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Clarity")

Although the best clarity type is IF, its price is the lowest. According to the average prices, the most expensive diamonds belong to SI2 clarity types which is actually the second worst clarity type in this data set. These results present that clarity is not enough to explain response variable price, since price does not increase while clarity feature improves.

3.3 Price vs Numeric Variables

3.3.1 Price vs Carat

diamonds %>%
  group_by(carat) %>%
  summarise(carat_count = n(),
            minPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price))%>%
  arrange(desc(carat_count)) %>%
  kable(col.names = c("Carat", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_styling("striped", full_width = T) %>%
  scroll_box(width = "100%", height = "300px")
Carat Count Minimum Price Average Price Maximum Price
0.30 2596 339 680.6367 2366
1.01 2240 2017 5507.1902 16234
0.31 2238 335 708.1997 1917
0.70 1981 945 2516.0717 5539
0.32 1827 345 719.1916 1715
1.00 1543 1681 5249.5846 16469
0.90 1484 1599 3939.5889 9182
0.41 1378 467 973.0530 1745
0.40 1291 484 933.0077 1908
0.71 1291 1305 2674.3106 5019
0.50 1254 584 1503.8453 3378
0.33 1179 366 765.1747 1389
0.51 1123 826 1623.5841 4368
0.34 908 413 755.7423 2346
1.02 883 2344 5598.4632 17100
0.52 813 870 1656.4785 3373
1.51 804 3497 10552.0211 18806
1.50 787 2964 10082.9835 18691
0.72 764 945 2705.7788 5086
0.53 707 844 1685.5248 3827
0.42 704 552 994.4545 1722
0.38 669 497 904.1973 1992
0.35 663 409 799.5098 1672
1.20 643 2360 6670.7387 16256
0.54 624 935 1704.3317 4042
0.36 570 410 781.4667 1718
0.91 570 1570 3981.1561 9636
1.03 522 1262 5583.8966 17590
0.55 495 959 1725.2283 3509
0.56 492 953 1831.9980 4632
0.73 492 1049 2791.2480 5154
0.43 486 452 995.2737 1827
1.04 474 2037 5774.6709 18542
1.21 470 2396 6905.4043 17353
2.01 433 5696 14772.7136 18741
0.57 429 929 1798.0443 4032
0.39 397 451 923.4761 1624
0.37 394 451 816.0711 1764
1.52 380 3105 10629.9868 17723
1.06 372 2080 5947.0081 15813
1.05 361 2066 5781.6814 13060
1.07 340 2260 6004.2118 18279
0.74 322 1632 2821.9006 5841
0.58 310 893 1798.6484 3741
1.11 308 3183 6005.5682 15464
1.22 299 2699 6971.5151 14414
0.23 293 326 486.1433 688
1.09 285 2708 6020.0947 18231
0.80 284 1232 3146.1901 6834
0.59 282 903 1831.4504 4916
1.23 277 3168 7290.8881 16253
1.10 276 3216 5913.0725 12748
2.00 261 5051 14107.1456 18818
0.24 254 336 505.1850 963
0.26 253 337 550.8972 814
0.76 251 1140 2889.5936 5056
0.77 251 1651 2884.3984 5251
1.12 251 2383 6044.4900 14407
0.75 249 1013 2763.4016 5288
1.08 246 2869 5774.3415 14032
1.13 246 2968 6193.4024 14525
1.24 235 2940 7147.1149 15026
0.27 233 361 574.7597 893
0.60 226 806 1956.2434 4209
0.92 226 2283 4035.2124 7544
1.53 219 5631 10703.4658 17223
1.70 215 5617 12110.3209 18718
0.25 212 357 550.9245 1186
0.44 212 648 1056.6226 2506
1.14 206 2327 6354.4078 18112
0.61 204 931 2069.2451 4279
0.81 200 1687 3094.4250 6338
0.28 198 360 580.1212 828
0.78 187 1720 2918.9091 4904
1.25 187 3276 7123.2513 16823
0.46 177 746 1240.2429 2139
2.02 176 6346 14767.4091 18731
1.54 174 4492 10865.0287 18416
1.16 172 2368 6274.2965 13595
0.79 151 1809 2993.2649 4858
1.15 148 3220 5962.5811 12895
1.26 146 3881 7490.7466 15247
0.93 142 2374 3994.2958 5992
0.82 140 2033 3083.6500 4931
0.62 135 933 1973.3704 3968
1.27 134 2845 7119.1194 15377
1.31 132 3697 7750.6364 17496
0.83 131 1863 3181.9924 7889
0.29 130 334 601.1923 1776
1.19 126 2892 6483.2381 15767
1.18 123 3219 6481.1057 14847
1.55 123 4965 10717.7642 18188
2.03 122 6002 14710.1148 18781
1.30 121 2512 7851.6612 15874
1.71 119 6320 12396.5546 18791
0.45 110 672 1093.3182 1833
1.17 110 2336 6224.8818 11886
1.56 108 5766 10205.9444 17455
1.28 106 3355 7711.8113 18700
1.57 105 5170 10444.5143 17548
0.96 103 1637 3623.4660 6807
0.63 102 1065 2244.7843 6607
1.29 101 2596 7033.0198 17932
0.47 99 719 1272.3131 2460
1.60 95 6272 11124.2526 18780
1.32 89 4140 7639.3708 15802
1.58 89 4459 11427.9888 18057
1.59 89 6010 10574.3596 17552
1.33 87 4634 7527.0805 18435
2.04 86 7403 14840.1744 18795
0.64 80 1080 2113.5875 4281
1.35 77 4082 7448.2857 14616
1.34 68 4352 7758.5000 17663
0.65 65 1214 2322.6154 5052
0.95 65 2195 4013.6615 6951
2.05 65 7026 15647.2462 18787
0.84 64 1749 3294.9844 6399
1.61 64 6915 10992.8438 18318
0.48 63 841 1322.0476 2677
0.85 62 1250 3204.1774 7230
1.62 61 5157 11705.8852 18281
0.94 59 2571 4195.7797 7120
0.97 59 1547 4026.2373 7415
2.06 59 5203 14348.6780 18779
1.72 57 5240 12040.3158 18730
1.73 52 6007 11875.9615 18377
2.10 51 6597 14464.3725 18648
1.36 50 4158 8591.5200 15386
1.40 50 4939 9108.5200 16808
1.63 50 7100 11556.3200 18179
1.75 50 7276 12673.7000 17343
2.07 50 6503 15384.2800 18804
0.66 48 1188 2235.7292 4311
0.67 48 1244 2067.8542 3271
2.14 48 5405 14905.0417 18528
1.37 46 5063 8044.0870 13439
0.49 45 900 1306.3333 2199
2.09 45 10042 15287.3111 18640
1.64 43 4849 10738.0000 17604
2.11 43 7019 14563.4186 18575
2.08 41 6150 15128.6829 18760
1.74 40 4677 12270.3250 18430
1.41 39 4145 9452.2051 17216
1.39 36 3914 8335.1111 13061
0.86 34 1828 3289.9706 5029
1.65 32 5914 10973.8750 17729
0.87 31 1827 3342.3226 5640
0.98 31 1712 3469.7742 4950
2.20 31 7934 15226.8710 18364
1.66 30 5310 11084.5000 16294
2.18 30 10340 14901.7000 18717
1.76 28 9053 12398.5714 18178
2.22 27 5607 14691.5185 18706
0.69 26 1636 2206.6154 3784
1.38 26 4598 8164.3077 17598
0.68 25 1341 2334.3200 4325
1.42 25 6069 9487.2000 18682
1.67 25 7911 12281.4800 17351
2.12 25 7508 15007.6800 18656
2.16 25 8709 15860.4800 18678
1.69 24 5182 11878.0833 17803
0.88 23 1964 3431.3478 5595
0.99 23 1789 4406.1739 6094
2.21 23 6535 15287.9565 18483
2.15 22 5430 14187.1818 18791
2.19 22 10179 15521.0909 18693
0.89 21 1334 3291.2381 4854
1.47 21 4898 7833.7143 12547
2.13 21 12356 15220.4286 18442
1.80 20 7862 12955.1500 17273
2.28 20 9068 15422.9500 18055
2.30 20 7226 14861.5500 18304
1.43 19 6086 9234.9474 14429
1.68 19 5765 11779.1053 16985
1.44 18 4064 8021.6111 12093
1.46 18 5421 8237.4444 12261
1.83 18 5083 11930.0000 18525
2.17 18 6817 14416.6111 17805
1.77 17 7771 11103.1765 15278
2.29 17 11502 15834.7647 18823
2.51 17 14209 16616.8824 18308
2.24 16 10913 15573.9375 18299
2.25 16 6653 13403.5625 18034
2.32 16 6860 16042.1875 18532
2.50 16 7854 14316.1875 18447
1.45 15 4320 8709.3333 17545
1.79 15 9122 13144.0667 18429
2.26 15 11226 15330.2000 18426
1.82 13 5993 12986.1538 15802
2.23 13 7006 13947.4615 18149
2.31 13 7257 15017.3846 17715
2.40 13 11988 16362.4615 18541
0.20 12 345 365.1667 367
1.78 12 8889 12504.5833 18128
1.91 12 6632 12365.8333 17509
2.27 12 5733 14198.4167 18343
3.01 12 8040 15521.6667 18710
1.49 11 5407 9174.2727 18614
0.21 9 326 380.2222 394
1.81 9 5859 11374.4444 14177
1.86 9 9791 12164.3333 17267
2.33 9 8220 14089.7778 18119
2.48 9 12883 16537.1111 18692
2.52 9 10076 16242.1111 18252
2.36 8 11238 15756.7500 18745
2.38 8 10716 15481.3750 18559
2.53 8 14659 16548.8750 18254
2.54 8 12095 15125.2500 17996
3.00 8 6512 12378.5000 16970
1.48 7 6216 8791.2857 15164
1.87 7 9955 12933.5714 17761
1.90 7 8576 11477.1429 13919
2.35 7 14185 16090.4286 17999
2.39 7 15917 17182.4286 17955
2.42 7 16198 16923.5714 18615
1.93 6 13278 16828.1667 18306
2.37 6 14837 16525.1667 18508
2.43 6 9716 15356.0000 18692
0.22 5 337 391.4000 470
1.98 5 12923 14667.8000 16171
2.34 5 8491 11922.0000 15818
2.41 5 13563 15808.2000 17923
1.84 4 7922 11108.5000 13905
1.88 4 8048 10946.2500 13607
1.89 4 10055 13969.0000 17553
1.96 4 5554 8642.2500 13099
1.97 4 14410 15966.0000 17535
2.44 4 13027 16365.5000 18430
2.45 4 11830 14593.7500 18113
1.85 3 5688 11436.0000 14375
1.94 3 9344 14189.3333 18735
1.95 3 5045 12340.6667 17374
1.99 3 11486 14179.0000 17713
2.46 3 10470 14227.0000 16466
2.47 3 14970 15644.0000 16532
2.49 3 6289 13843.0000 18325
2.55 3 14351 15964.0000 18766
2.56 3 15231 16723.3333 17753
2.57 3 17116 17841.6667 18485
2.58 3 12500 14481.3333 16195
2.60 3 17027 17535.0000 18369
2.61 3 13784 16583.0000 18756
2.63 3 10437 12659.6667 16914
2.72 3 6870 12088.3333 17801
2.74 3 8807 14385.0000 17184
1.92 2 15364 15418.0000 15472
2.66 2 16239 17367.0000 18495
2.68 2 8419 9042.0000 9665
2.75 2 13156 14285.5000 15415
3.04 2 15354 16956.5000 18559
4.01 2 15223 15223.0000 15223
2.59 1 16465 16465.0000 16465
2.64 1 17407 17407.0000 17407
2.65 1 16314 16314.0000 16314
2.67 1 18686 18686.0000 18686
2.70 1 14341 14341.0000 14341
2.71 1 17146 17146.0000 17146
2.77 1 10424 10424.0000 10424
2.80 1 15030 15030.0000 15030
3.02 1 10577 10577.0000 10577
3.05 1 10453 10453.0000 10453
3.11 1 9823 9823.0000 9823
3.22 1 12545 12545.0000 12545
3.24 1 12300 12300.0000 12300
3.40 1 15964 15964.0000 15964
3.50 1 12587 12587.0000 12587
3.51 1 18701 18701.0000 18701
3.65 1 11668 11668.0000 11668
3.67 1 16193 16193.0000 16193
4.00 1 15984 15984.0000 15984
4.13 1 17329 17329.0000 17329
4.50 1 18531 18531.0000 18531
5.01 1 18018 18018.0000 18018

The most frequent carat weight equals to 0.3 in this data set. From the histogram, we can see the distribution of carat. To see all carats according to their count, following table can be analyzed. It can be said that, most of the carat values in this data set are less than 1.

diamonds%>% ggplot(.,aes(x=carat))+
  geom_vline(aes(xintercept=mean(carat)),
            color="black", linetype="dashed", size=1)+
  geom_histogram(bins=30, color="tomato3", fill="salmon")+ 
  theme_minimal()+ 
  labs(title = "Distribution of Carat",
       subtitle = "Diamonds dataset",
       x = "Carat")

3.3.2 Price vs x, y, z, and Depth

Price is highly related with x, y, and, z which can be seen below scatter plots. Since depth is a formula obtained by using these variables, the plot belongs to depth is not presented in this part.

p1 <- ggplot(diamonds, aes(x, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red", size=2, linetype=2 ) +
  theme_minimal()+
  labs(title = "Price vs Diamond Length",
       subtitle = "Diamonds dataset",
       x = "Diamond Length (x)",
       y = "Price")

p2 <- ggplot(diamonds, aes(y, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red", size=2, linetype=2 ) +
  theme_minimal()+
  labs(title = "Price vs Diamond Width",
       subtitle = "Diamonds dataset",
       x = "Diamond Width (y)",
       y = "Price")

p3 <- ggplot(diamonds, aes(z, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red", size=2, linetype=2 ) + 
  theme_minimal()+
  labs(title = "Price vs Diamond Depth",
       subtitle = "Diamonds dataset",
       x = "Diamond Depth (z)",
       y = "Price")

(p1 | p2 | p3)

4. Clustering

To make model and predict the price variable; first, we need to divide data set into two pieces: train data and test data. By using train data, we can create a model for the data set, and then by using test data we can control the accuracy of the model obtained by using train data. Following codes provide the production of the train and test data set. Creating of the train and test data is random process, for this reason we use set.seed() function to get same two data set.

set.seed(503)
#test data
diamonds_test <- diamonds %>% 
                    #to divide data set to, we add diamond id for each row
                    mutate(diamond_id = row_number()) %>% 
                    # we group data set according to the cut,color and clarity
                    group_by(cut, color, clarity) %>%   
                    # by using sample_frac() function, we get 20% of the data as test data
                    sample_frac(0.2) %>%                  
                    ungroup()    

#train data
#by using anti_join we get different rows from data set.
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),  
    diamonds_test, by = "diamond_id")

#then to make clear analysis we need to extract diamond_id column by using following codes.
diamonds_train <- diamonds_train[, c(-ncol(diamonds_train))]
diamonds_test  <- diamonds_test[, c(-ncol(diamonds_test))]

Train data set:

diamonds_train %>% glimpse()
## Rows: 43,000
## Columns: 10
## $ carat   <dbl> 0.21, 0.23, 0.31, 0.24, 0.26, 0.22, 0.30, 0.23, 0.31, 0.20,...
## $ cut     <ord> Premium, Good, Good, Very Good, Very Good, Fair, Good, Idea...
## $ color   <ord> E, E, J, J, H, E, J, J, J, E, I, J, I, E, J, J, G, I, D, F,...
## $ clarity <ord> SI1, VS1, SI2, VVS2, SI1, VS2, SI1, VS1, SI2, SI2, SI2, SI1...
## $ depth   <dbl> 59.8, 56.9, 63.3, 62.8, 61.9, 65.1, 64.0, 62.8, 62.2, 60.2,...
## $ table   <dbl> 61, 65, 58, 57, 55, 61, 55, 56, 54, 62, 54, 56, 56, 55, 62,...
## $ price   <int> 326, 327, 335, 336, 337, 337, 339, 340, 344, 345, 348, 351,...
## $ x       <dbl> 3.89, 4.05, 4.34, 3.94, 4.07, 3.87, 4.25, 3.93, 4.35, 3.79,...
## $ y       <dbl> 3.84, 4.07, 4.35, 3.96, 4.11, 3.78, 4.28, 3.90, 4.37, 3.75,...
## $ z       <dbl> 2.31, 2.31, 2.75, 2.48, 2.53, 2.49, 2.73, 2.46, 2.71, 2.27,...

Test data set:

diamonds_test %>% glimpse()
## Rows: 10,761
## Columns: 10
## $ carat   <dbl> 2.15, 2.00, 0.70, 1.00, 1.72, 1.45, 1.05, 1.02, 0.71, 2.02,...
## $ cut     <ord> Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair,...
## $ color   <ord> J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J,...
## $ clarity <ord> I1, I1, I1, I1, I1, SI2, SI2, SI2, SI2, SI2, SI1, SI1, SI1,...
## $ depth   <dbl> 65.5, 66.5, 64.7, 65.0, 68.5, 58.6, 65.8, 65.0, 65.5, 64.6,...
## $ table   <dbl> 57.0, 56.0, 59.0, 59.0, 59.0, 68.0, 59.0, 59.0, 61.0, 55.0,...
## $ price   <int> 5430, 6796, 1066, 1784, 5240, 4570, 2789, 3119, 1409, 9853,...
## $ x       <dbl> 8.01, 7.93, 5.59, 6.27, 7.31, 7.45, 6.41, 6.34, 5.54, 7.93,...
## $ y       <dbl> 7.95, 7.80, 5.50, 6.17, 7.24, 7.33, 6.27, 6.24, 5.49, 7.84,...
## $ z       <dbl> 5.23, 5.23, 3.59, 4.05, 4.98, 4.34, 4.18, 4.08, 3.61, 5.09,...

After the division of the data set into test and train, we extract the diamonds_id column to make clear analysis. After that point, we can transform factor variable to numeric variable by using as.numeric() function. That is, all the ordered columns, we will use them as numeric values.

#we transform our factor variable to numeric variable, 
#then we assign new variable to their own variables by using following codes.

diamonds_train$cut <- as.numeric(diamonds_train$cut)
diamonds_train$clarity <- as.numeric(diamonds_train$clarity)
diamonds_train$color <- as.numeric(diamonds_train$color)
diamonds_test$cut <- as.numeric(diamonds_test$cut)
diamonds_test$clarity <- as.numeric(diamonds_test$clarity)
diamonds_test$color <- as.numeric(diamonds_test$color)

4.1 Principal Component Analysis

Principal Component Analysis (PCA) is suitable for numeric variables, therefore we choose this type of variables from our data set to continue. It can be seen from summary output that 4 component is enough in order to explain 88% of the data. We should add these components to both train and test data set for the following price estimation models. To make PCCA, we use princomp() function which belongs to base R.

set.seed(157)
diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
summary(diamonds_pca,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     2.0743446 1.1893677 1.1061659 0.9926647 0.83123137
## Proportion of Variance 0.4781006 0.1571773 0.1359559 0.1094870 0.07677173
## Cumulative Proportion  0.4781006 0.6352779 0.7712338 0.8807208 0.95749254
##                            Comp.6      Comp.7       Comp.8       Comp.9
## Standard deviation     0.59250776 0.172964901 0.0341800209 2.040935e-02
## Proportion of Variance 0.03900727 0.003324095 0.0001298082 4.628241e-05
## Cumulative Proportion  0.99649981 0.999823909 0.9999537176 1.000000e+00
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat    0.471                       0.126         0.868              
## cut     -0.109  0.595  0.452  0.240 -0.176  0.584                     
## color   -0.158 -0.203         0.803  0.537                            
## clarity -0.222  0.208  0.212 -0.506  0.778                            
## depth           0.209 -0.828         0.117  0.492                0.101
## table    0.129 -0.698  0.226 -0.182         0.642                     
## x        0.475                       0.115        -0.278 -0.699  0.425
## y        0.474                       0.120        -0.299  0.714  0.381
## z        0.474  0.103                0.132        -0.282        -0.815
#by using following function we can add this pca variables into data set
#since we select the four component, we add four variables into data set

#Train data set
PCAResults = predict(diamonds_pca, newdata = diamonds_train) # prediction by using data set
diamonds_train$PCA1 <- PCAResults[,1] 
diamonds_train$PCA2 <- PCAResults[,2]
diamonds_train$PCA3 <- PCAResults[,3]
diamonds_train$PCA4 <- PCAResults[,4]

#Test data set
PCAResultsTest = predict(diamonds_pca, newdata = diamonds_test) # prediction by using data set
diamonds_test$PCA1 <- PCAResultsTest[,1] 
diamonds_test$PCA2 <- PCAResultsTest[,2]
diamonds_test$PCA3 <- PCAResultsTest[,3]
diamonds_test$PCA4 <- PCAResultsTest[,4]

4.2 Clustering

There are different kinds of clustering methods to create a feature, in this assignment K-means is used. First of all, scaling should be applied since if a column have much higher values than the others, it may dominate the results. In order to scale the data, we should find the minimum and maximum values of the train set, and then we will scale both train and test set with the same values.

MinValues = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], min)
MaxValues = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], max)

We get each variables minimum and maximum variables, now we need to implement scaling process into our data sets. In this process, we try to subtract vector form each row of the matrix data set. To make it easy, we can use sweep() function. By using the sweep function we can make calculation with vector and matrix.

diamonds_train_scale <- sweep(sweep(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, MinValues), 2, (MaxValues-MinValues), "/")

For deciding the number of cluster, we can use a for loop from 1 to 15 center, and then select the number of center value.

errors = c()
for (i in (1:15)){
  set.seed(157) 
  cluster<-kmeans(diamonds_train_scale,centers=i) 
  errors = c(errors, 100*round(1 - (cluster$tot.withinss/cluster$totss), digits = 3))
  }

errors_df <- data.frame(x=c(1:15), y=errors) 

ggplot(errors_df, aes(x=x, y=y)) +
  geom_point(color = "firebrick2") +
  geom_line(color="firebrick3") +
  geom_text(aes(label = errors), size=3, color = "black", position = position_stack(vjust = 0.95))+
  theme_minimal() +
  labs(title = "Center Number vs R-Square",
       subtitle = "Diamonds dataset",
       x = "Cluster Number",
       y = "R-Square")

The decrease in errors are slowly changing after the cluster with 8 centers. So, we can say that we should select the model with center equals to 8.

set.seed(157)
best_cluster = kmeans(diamonds_train_scale,centers=8)
diamonds_train$cluster = as.factor(best_cluster$cluster)
diamonds_train %>%
  group_by(cluster)%>%
  summarise(cluster_count = n()) %>%
  
  ggplot(., aes(x=cluster_count, y = reorder(cluster, cluster_count))) +
  geom_col(fill = "paleturquoise4")+
  theme_minimal() +
  labs(title = "Number of Cluster in Data Set",
       st="Diamond dataset",
       x = "Number of cluster",
       y = "Cluster")

Now, we need to apply clustering process to the test data set.

diamonds_test_scale <- sweep(sweep(diamonds_test[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, MinValues), 2, (MaxValues-MinValues), "/")

closest.cluster <- function(x) {
  cluster.dist <- apply(best_cluster$centers, 1, function(y) sqrt(sum((x-y)^2)))
  return(which.min(cluster.dist)[1])
}
diamonds_test$cluster <- as.factor(apply(diamonds_test_scale, 1, closest.cluster))

Now, we have enough data to create models for this data sets. In the next section, by using all outputs and process in the previous sections, the best predicting model is prepared.

5. Price Estimation Models

Before we start to to prepare prediction model, we can control the correlation between variables.

diamonds_cor<-cor(diamonds_train[-c(11:15)])
corrplot(diamonds_cor, method="number")

5.1 Generalized Linear Model

There are four assumptions to be fulfilled in a linear model:

  1. Linearity Assumption
  2. Constant Variance Assumption
  3. Independence Assumption
  4. Normality Assumption

Since, our data set does not fulfill these assumptions we will construct a generalized linear model which is more flexible than standard linear model in the terms of assumptions. The response variable, price, is continuous therefore Gamma or Gaussian family may fit. The lowest AIC value is obtained by Gamma family with identity link function.

model_glm <- glm(price ~ carat+cut+color+clarity+depth+table+x+y+z+cluster, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm)
## 
## Call:
## glm(formula = price ~ carat + cut + color + clarity + depth + 
##     table + x + y + z + cluster, family = Gamma(link = "identity"), 
##     data = diamonds_train, start = c(0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0700  -0.1955  -0.0545   0.1173   6.5060  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20286.918    694.509 -29.210  < 2e-16 ***
## carat         6346.172     36.947 171.765  < 2e-16 ***
## cut            -22.296      3.901  -5.716 1.10e-08 ***
## color           86.507      1.669  51.825  < 2e-16 ***
## clarity         75.038      1.657  45.276  < 2e-16 ***
## depth          317.350     11.169  28.413  < 2e-16 ***
## table           -4.607      1.101  -4.184 2.87e-05 ***
## x             4202.185     90.019  46.681  < 2e-16 ***
## y              678.821     88.742   7.649 2.06e-14 ***
## z            -8267.158    242.533 -34.087  < 2e-16 ***
## cluster2        30.883      9.742   3.170 0.001525 ** 
## cluster3        56.930     15.625   3.644 0.000269 ***
## cluster4       723.762     25.527  28.353  < 2e-16 ***
## cluster5      -167.412      8.508 -19.677  < 2e-16 ***
## cluster6       105.534      9.531  11.073  < 2e-16 ***
## cluster7       611.267     34.613  17.660  < 2e-16 ***
## cluster8       -72.963      9.573  -7.622 2.56e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.09973595)
## 
##     Null deviance: 42124.3  on 42999  degrees of freedom
## Residual deviance:  3154.8  on 42983  degrees of freedom
## AIC: 679803
## 
## Number of Fisher Scoring iterations: 25

The model can be improved by some arrangement in the explanatory variables.

model_glm2 <- glm(price ~ carat*color+carat*clarity+I(carat^2)+cluster+y+depth, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm2)
## 
## Call:
## glm(formula = price ~ carat * color + carat * clarity + I(carat^2) + 
##     cluster + y + depth, family = Gamma(link = "identity"), data = diamonds_train, 
##     start = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.34343  -0.11391  -0.00724   0.09756   2.20066  
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -1447.7988    69.2925  -20.894  < 2e-16 ***
## carat         -7164.3182    41.3076 -173.438  < 2e-16 ***
## color           -66.6927     1.8990  -35.119  < 2e-16 ***
## clarity        -166.2208     2.0255  -82.064  < 2e-16 ***
## I(carat^2)     5000.0573    18.6769  267.714  < 2e-16 ***
## cluster2         31.4677     3.3334    9.440  < 2e-16 ***
## cluster3        263.7606    10.0305   26.296  < 2e-16 ***
## cluster4        490.0523    12.0700   40.601  < 2e-16 ***
## cluster5        -82.5500     4.6767  -17.651  < 2e-16 ***
## cluster6         61.7204     3.4631   17.822  < 2e-16 ***
## cluster7        462.1753    18.2967   25.260  < 2e-16 ***
## cluster8        -19.1216     3.5598   -5.372 7.85e-08 ***
## y               609.2037     5.0484  120.672  < 2e-16 ***
## depth            10.0415     0.8831   11.371  < 2e-16 ***
## carat:color     413.9402     4.8268   85.759  < 2e-16 ***
## carat:clarity   735.9830     5.0657  145.288  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.02846104)
## 
##     Null deviance: 42124.3  on 42999  degrees of freedom
## Residual deviance:  1206.6  on 42984  degrees of freedom
## AIC: 638149
## 
## Number of Fisher Scoring iterations: 25

To make an example with components, following model can be constructed.

model_glm_pca = glm(price ~ PCA1 + PCA2 + PCA3 + PCA4 + cluster, data = diamonds_train, family = Gamma(link = "sqrt"))
summary(model_glm_pca)
## 
## Call:
## glm(formula = price ~ PCA1 + PCA2 + PCA3 + PCA4 + cluster, family = Gamma(link = "sqrt"), 
##     data = diamonds_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.10222  -0.24752  -0.05761   0.15491   1.83183  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  60.64345    0.12940 468.647  < 2e-16 ***
## PCA1         12.08463    0.03760 321.403  < 2e-16 ***
## PCA2          3.00609    0.03860  77.880  < 2e-16 ***
## PCA3          2.24874    0.04084  55.058  < 2e-16 ***
## PCA4          0.70079    0.05198  13.481  < 2e-16 ***
## cluster2     -0.93044    0.14448  -6.440 1.21e-10 ***
## cluster3     -8.00277    0.23054 -34.714  < 2e-16 ***
## cluster4     -1.02700    0.23782  -4.318 1.58e-05 ***
## cluster5     -4.35581    0.16242 -26.819  < 2e-16 ***
## cluster6    -11.68687    0.12298 -95.034  < 2e-16 ***
## cluster7     -8.23586    0.28306 -29.096  < 2e-16 ***
## cluster8     -5.13711    0.14563 -35.276  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1102179)
## 
##     Null deviance: 42124  on 42999  degrees of freedom
## Residual deviance:  4306  on 42988  degrees of freedom
## AIC: 693361
## 
## Number of Fisher Scoring iterations: 8

PCA helps us to use less feature in a model. But, as expected, with fewer feature we will get more AIC value in the model which means the explanatory power of the model decreases. Since we do not have so many features, we do not need to decrease the feature number.We may also use all the principal components instead of the variables to deal with multicollinearity however in this study this model is not investigated. For the best glm model, we can calculate the R^2 value. It can be calculated easily with:

model_glm2_prediction<-predict(model_glm2, newdata=diamonds_test)
model_glm2_r2 <- 1 - (sum((model_glm2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm2_r2
## [1] 0.8613199

To visually see the prediction vs actual price values, we can check the below plot.

pred_vs_real_glm <- as.data.frame(cbind(model_glm2_prediction, diamonds_test$price))
colnames(pred_vs_real_glm) = c("prediction", "actual")

pred_vs_real_glm %>%
  ggplot(aes(prediction, actual)) +
  geom_point(color="seagreen3")  +
  theme_minimal()+
  geom_abline(intercept = 0, slope = 1, color="red3", size=2, linetype=2) +
  labs(title = "Prediction vs Actual Values for the Generalized Linear Model",
       subtitle = "Diamonds dataset",
       x = "Predictions",
       y = "Real Values")

5.2 Classification and Regression Tree

Another supervised model is Classification and Regression Tree models, i.e., CART. It is a predicted model and it is a model that explains how an outcome variable’s values can be predicted based on other values. To create CART model, we use rpart() function which belongs to base R. To create a CARt model, again we use diamonds_train dataset. After the creating of the model, there are several visualization methods like prp(), fancyRpartPlot, and so on. In this example, fancyRpartPlot plot is used.

model_cart <- rpart(price~., data=diamonds_train)
fancyRpartPlot(model_cart, sub="CART Model")

By using obtain tree, we can examine nodes and define the divisions. According to the results, some variables are used as division parameters. In this output, we shows that y and clarity are main parameters or affecting values in the data set. In other words, these variables are better features to reduce the variance in the data set with default argument. This tree is obtain by using default variables, by changing the default values like cp, we can obtain different models and we can select the best of them.In this analysis, the default values are used.

Now we create two different models, one of them obtain from the glm function and the other one is obtained from the CART, tree analysis. However, we also need to test our models by using test data set for each model. First, We predict glm model, then we control the CART model. Then by using error, we can compare the result and select the best model.

#GLM prediction model
#To predict price we can use predict function with the best glm model
model_glm2Prediction <-predict(model_glm2, newdata=diamonds_test)

#PRS: Prediction R Square
GLModel2PRS <- 1 - (sum((model_glm2Prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
GLModel2PRS
## [1] 0.8613199

According to the results, we obtain R Square value as 0.86. This means that our model explains our data set with 81%. We need to make same calculation for CART model.

ModelCARTPrediction <-predict(model_cart, newdata=diamonds_test)

#PRS: Prediction R Square
ModelCARTPRS <- 1 - (sum((ModelCARTPrediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
ModelCARTPRS
## [1] 0.8768922

Now, we can compare the generalized linear model and CART. the result shows that the CART model R Square value is higher than the GLM. We can conclude by saying CART model is better than Generalized Linear Model. This means that, the CART model explains the data set instead of GLM.