1 Diamonds

Diamonds are one of the most expensive gemstone in the world. There is a saying that diamonds are rare. According to this article they are more expensive when the stone is rare.

According to this site, the price of a diamond is related with it’s shape and 4Cs..

Shape: It refers to the outline or external figure of the diamond.
- There are many shapes like Round, Princess, Emerald, Cushion etc.
- Even if all the other properties are equal, the shape of a diamond will change the price.
- In our diamonds dataset, there is no shape column. So, we will assume that every one of them have equal shapes.

Figure 1: Types of Shapes

Cut: It is how well a diamond is cut and polished, including how well-proportioned the stone is, its depth and symmetry.
- A well cut diamond is luminous and reflects white and colored light back to your eyes.
- A poorly cut diamond is dull instead of brilliant.
- Cut types are explained below:
  - Excellent: Excellent Cut Diamonds provide the highest level of fire and brilliance. Because almost all of the incoming light is reflected through the table, the diamond radiates with magnificent sparkle.
  - Very Good: Very Good Cut Diamonds offer exceptional brilliance and fire. A large majority of the entering light reflects through the diamond’s table. To the naked eye, Very Good diamonds provide similar sparkle to those of Excellent grade.
  - Good: Good Cut Diamonds showcase brilliance and sparkle, with much of the light reflecting through the table to the viewer’s eye. These diamonds provide beauty at a lower price point.
  - Fair: Fair Cut Diamonds offer little brilliance, as light easily exits through the bottom and sides of the diamond. Diamonds of a Fair Cut may be a satisfactory choice for smaller carats and those acting as side stones.
  - Poor: Poor Cut Diamonds yield nearly no sparkle, brilliance or fire. Entering light escapes from the sides and bottom of the diamond.

In this link, you can find some shapes of a diamond. At the end of descriptions of shapes, you can find the price table. For every diamonds, better cut is more expensive than the worse cut if all the others are equal.

Figure 2: Looks of Cuts

Color: A diamond’s color refers to how clear or yellow it is.
- Diamond color is measured using Gemological Institute of America, or GIA color scale which goes from D (colorless) all the way to Z (light yellow or brown in color).
- In general, the highest quality diamonds are totally colorless, whereas lower quality diamonds can often have a slight yellow tint.
- Diamond colors are graded from D to Z, with most diamonds used in jewelry falling somewhere into the D to M range. It means that the colors after M are not very popular.
- The difference between one color grade and other grades (for example, a G color diamond and an H color diamond) are very small, so that they’re largely impossible to perceive with the naked eye. But when we compare it with the one that three or four color grade below or above, the difference can be seen easily.

Figure 3. Color Scales of Diamonds

Clarity: Diamond clarity is a qualitative metric that grades the visual appearance of each diamond.
- The fewer inclusions and blemishes the diamond has, the better the clarity grade.
- Five factors play a significant role in how the clarity grades are determined. These five roles in diamond grading include size, nature, number, location, and the relief of the inclusions.

Figure 4. Classification of Diamonds by Clarity

Carat: The carat is the weight of the diamond.
- One carat is equal to 200 milligrams.
- It is said that the more carat a diamond has, the more expensive it is. Even if you see only the top of a diamond on your ring, the carat also effects the price.
- It has a correlation with the diameter of a diamond. Because it’s mathematically impossible for a 0.05ct diamond to have more surface area than a 1 carat diamond.

2 Data and Required Packages

2.1 Data Information

In this assignment, objectives can be listed as following:

To provide useful exploratory data analysis (EDA)
To create a model for estimation of the diamond price

To fulfill these objectives, (i) data is analyzed and prepared for the analysis, (ii) the meaningful EDA is presented by using some useful packages such as ggplot2, dplyr, data.table etc.

pti <- c("data.table", "tidyverse", "kableExtra", "knitr", "RColorBrewer", "caret", "e1071", "rpart", "rpart.plot", "rattle", "corrplot")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
    install.packages(pti)
}

library(tidyverse)
library(data.table)
library(knitr)
library(kableExtra)
library(RColorBrewer)
library(caret)
library(e1071)
library(rpart)
library(rpart.plot)
library(rattle)
library(corrplot)

The dataset in this analysis is obtained from the ggplot2 packages. Before the load this data set. By using ?diamonds we can obtain information about this dataset. This data set contains 53940 rows and 10 variables. As we mentioned before, there is no shape column in this dataset. By using this glimpse() function, we obtain these variables.

#variables of the dataset
diamonds %>%
  glimpse()

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

Price: price in US dollars ($326–$18,823)
Carat: weight of the diamond (0.2–5.01)
Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
Color: diamond color, from D (best) to J (worst)
Clarity: a measurement of how clear the diamond is I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)
x: length in mm (0–10.74)
y: width in mm (0–58.9)
z: depth in mm (0–31.8)
Depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
Table: width of top of diamond relative to widest point (43–95)

2.2 Data Preprocessing

In the previous section, detailed information about investigated dataset is presented. In this section, the data is analyzed or pre-processed before the detailed analyses. First step is to check null values.

#to control NA values in the dataset
sum(any(is.na(diamonds)))

## [1] 0

When we check the dataset, there is 0 NA value. This means that there is no missing value in any row or column. Second step is to check for duplicated rows.

sum(as.numeric(duplicated(diamonds)))

## [1] 146

By using above function we can obtain number of exactly the same rows in this dataset. There are 146 duplicated lines. Before the analysis, we need to extract these data from the data frame. To solve this problem, we can use unique() function.

#taking only unique rows form the data set
diamonds <- unique(diamonds)
#control of the duplicated lines after removing of the duplicated lines
sum(as.numeric(duplicated(diamonds)))

## [1] 0

After the usage of the unique function there is no duplicated line. After this process, we have 53794 rows and 10 columns. Third step could check the types of these variables like numeric, factor etc. To do so, we can use str() function

str(diamonds)

## tibble [53,794 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53794] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53794] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53794] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53794] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53794] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53794] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53794] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

When we look at the results, we see that color column is ordered wrongly. We need to correct the order.

diamonds$color = factor(diamonds$color, levels = c("J", "I", "H", "G", "F", "E", "D"))

The data is ready to make further corrections like accuracy of values. After these processes, we can use summary() function and head() function to get more information about the dataset.

summary(diamonds) # summary of each variable in the dataset

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1598   J: 2802   SI1    :13032   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4891   I: 5407   VS2    :12229   1st Qu.:61.00  
##  Median :0.7000   Very Good:12069   H: 8272   SI2    : 9150   Median :61.80  
##  Mean   :0.7978   Premium  :13748   G:11262   VS1    : 8156   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21488   F: 9520   VVS2   : 5056   3rd Qu.:62.50  
##  Max.   :5.0100                     E: 9776   VVS1   : 3647   Max.   :79.00  
##                                     D: 6755   (Other): 2524                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  951   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5327   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.030  
##  Max.   :31.800  
##

head(diamonds) #first 6 rows of the data

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

2.3 Check Accuracy of Values

We also control the negative price values in the data set.

diamonds %>%
  filter(price <= 0) %>%
  summarise(NegativePrice = n())

## # A tibble: 1 x 1
##   NegativePrice
##           <int>
## 1             0

Logically, price can not be less than or equal to zero. For this reason, we check dataset to control negative of zero price values. According to the results price values are positive, there is nothing to do for price column

Now we can search for not positive x, y and z values.

diamonds %>%
  filter(x <= 0 & y > 0 & z > 0)

## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>

diamonds %>%
  filter(y <= 0 & x > 0 & z > 0)

## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>

diamonds %>%
  filter(z <= 0 & x > 0 & y > 0)

## # A tibble: 12 x 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1    Premium G     SI2      59.1    59  3142  6.55  6.48     0
##  2  1.01 Premium H     I1       58.1    59  3167  6.66  6.6      0
##  3  1.1  Premium G     SI2      63      59  3696  6.5   6.47     0
##  4  1.01 Premium F     SI2      59.2    58  3837  6.5   6.47     0
##  5  1.5  Good    G     I1       64      61  4731  7.15  7.04     0
##  6  1.15 Ideal   G     VS2      59.2    56  5564  6.88  6.83     0
##  7  2.18 Premium H     SI2      59.4    61 12631  8.49  8.45     0
##  8  2.25 Premium I     SI1      61.3    58 15397  8.52  8.42     0
##  9  2.2  Premium H     SI1      61.2    59 17265  8.42  8.37     0
## 10  2.02 Premium H     VS2      62.7    53 18207  8.02  7.95     0
## 11  2.8  Good    G     SI2      63.8    58 18788  8.9   8.85     0
## 12  1.12 Premium G     I1       60.4    59  2383  6.71  6.67     0

All x and y values are positive, for now we don’t need to do anything. But, for z values, there are 12 0 values in z column. We can fill these values with the help of x, y and depth column. Because z = depth *100 * mean(x, y).

diamonds = diamonds %>%
  mutate(z = ifelse(z == 0 & x != 0 & y != 0,round(depth * mean(c(x, y)) / 100, 2), z))

From now on, we searched for x, y or z values that are not positive when the others are positive. If we find any row, we could calculate it as we did above. Now, we need to check for rows that two of these three columns are not positive.

diamonds %>%
  filter(x <= 0 & y <= 0)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139     0     0     0
## 2  1.14 Fair      G     VS1      57.5    67  6381     0     0     0
## 3  1.56 Ideal     G     VS2      62.2    54 12800     0     0     0
## 4  1.2  Premium   D     VVS1     62.1    59 15686     0     0     0
## 5  2.25 Premium   H     SI2      62.8    59 18034     0     0     0
## 6  0.71 Good      F     SI2      64.1    60  2130     0     0     0

diamonds %>%
  filter(x <= 0 & z <= 0)

## # A tibble: 7 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
## 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
## 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
## 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
## 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
## 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
## 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0

diamonds %>%
  filter(y <= 0 & z <= 0)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139     0     0     0
## 2  1.14 Fair      G     VS1      57.5    67  6381     0     0     0
## 3  1.56 Ideal     G     VS2      62.2    54 12800     0     0     0
## 4  1.2  Premium   D     VVS1     62.1    59 15686     0     0     0
## 5  2.25 Premium   H     SI2      62.8    59 18034     0     0     0
## 6  0.71 Good      F     SI2      64.1    60  2130     0     0     0

When we look at the result, we have 7 rows that both x and z values are 0. If we remove these rows from the data, we would have more accurate data. So, we need to apply this code chunk.

diamonds = diamonds %>%
  filter(!(x == 0 & z == 0))

diamonds %>%
  filter(x == 0 | y == 0 | z == 0)

## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>

After this process, we have all positive values in x, y and z columns. Now, we need to check the range of x, y and z columns for outliers.

range(diamonds$x)

## [1]  3.73 10.74

diamonds$x %>% unique() %>% sort(decreasing = TRUE) %>% head(20)

##  [1] 10.74 10.23 10.14 10.02 10.01 10.00  9.86  9.66  9.65  9.54  9.53  9.51
## [13]  9.49  9.44  9.42  9.41  9.38  9.36  9.35  9.32

There is nothing wrong with length of these diamonds.

range(diamonds$y)

## [1]  3.68 58.90

diamonds$y %>% unique() %>% sort(decreasing = TRUE) %>% head(20)

##  [1] 58.90 31.80 10.54 10.16 10.10  9.94  9.85  9.81  9.63  9.59  9.48  9.46
## [13]  9.42  9.40  9.38  9.37  9.34  9.32  9.31  9.26

There are diamonds$y %>% unique() %>% sort(decreasing = TRUE) %>% head(20) %>% nrow() values that are very big respect to other values. So, we need to remove these rows from the data or recalculate the y values. To avoid to reduce the data, we will try to recalculate them with the help of depth, x and z values.

diamonds %>%
  filter(y > 15) %>%
  mutate(new_y = (2 * z / depth) / 100 - x) %>%
  select(depth, x, z, y, new_y)

## # A tibble: 2 x 5
##   depth     x     z     y new_y
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  58.9  8.09  8.06  58.9 -8.09
## 2  61.8  5.15  5.12  31.8 -5.15

These results are not suitable values for y column, so we need to remove them from the data.

diamonds = diamonds %>%
  filter(y < 15)

range(diamonds$z)

## [1]  1.07 31.80

diamonds$z %>% unique() %>% sort(decreasing = TRUE) %>% head(20)

##  [1] 31.80  6.98  6.72  6.43  6.38  6.31  6.27  6.24  6.17  6.16  6.13  6.03
## [13]  5.98  5.97  5.92  5.91  5.90  5.86  5.85  5.79

For z column, there is one row that has a high value respect to other rows. So, as we did before, we need to calculate the z value like we did before.

diamonds %>%
  filter(z == 31.8) %>%
  mutate(new_z = depth * mean(c(5.12, 5.15)) / 100) %>%
  select(z, new_z)

## # A tibble: 1 x 2
##       z new_z
##   <dbl> <dbl>
## 1  31.8  3.17

So, we can say that this value should be divided by 10. We can compare the other z values that have similar x and y values with this z value.

diamonds[abs(diamonds$x - 5) < 1 & abs(diamonds$y - 5) < 1 & abs(diamonds$z - 3.18) < 1, "z"]

## # A tibble: 30,800 x 1
##        z
##    <dbl>
##  1  2.31
##  2  2.63
##  3  2.75
##  4  2.53
##  5  2.73
##  6  2.71
##  7  2.68
##  8  2.68
##  9  2.7 
## 10  2.71
## # ... with 30,790 more rows

There are many z values near to 3.18. So, we can say that this value should be divided by 10. Also, we need to update the new_depth column

diamonds$z[diamonds$z == 31.8] = diamonds$z[diamonds$z == 31.8] / 10

Now, we need to check the depth column. To do so, we can calculate a new column called new_depth and compare it with the depth column.

diamonds$new_depth = 2 * diamonds$z / (diamonds$x + diamonds$y) * 100

The depth and new_depth columns should be equal. To be easily see that, we can use scatter plot and add a line.

# diamonds[, calculate := 2 * z / (x + y)]
diamonds %>%
  ggplot(., aes(x = new_depth, y = depth)) +
  geom_point() + 
  geom_abline(intercept = 0, slope = 1, color="red", size=1.5)

From the plot, we can say that most of the depth values are almost equal. But, there are some rows that their depth and new_depth values are very far from each others. We can say that their difference should be less than 11 (This is an assumption respect to range of depths in this link. In this page, it says that " (For a princess cut diamond) A very good cut can be between 56 to 67 percent or 75 to 76 percent.". So, maximum depth range can be 11 among all other shapes.)

diamonds %>%
  filter(!(abs(new_depth - depth) < 11)) %>%
  select(new_depth, depth, x, y, z)

## # A tibble: 20 x 5
##    new_depth depth     x     y     z
##        <dbl> <dbl> <dbl> <dbl> <dbl>
##  1      51.2  62.8  6.26  6.19  3.19
##  2      63.1  43    6.32  6.27  3.97
##  3      65.7  44    6.31  6.24  4.12
##  4      51.7  64    7.15  7.04  3.67
##  5      63.0  43    6.53  6.55  4.12
##  6      16.1  60.6  6.62  6.67  1.07
##  7      20.5  61.9  7.43  7.5   1.53
##  8      19.4  60.7  7.31  7.22  1.41
##  9      40.3  59.4  8.49  8.45  3.41
## 10      78.3  59    6.16  6.15  4.82
## 11      41.4  61.3  8.52  8.42  3.51
## 12      78.3  65.6  7.89  7.84  6.16
## 13      79.1  60    6.29  6.25  4.96
## 14      41.8  61.2  8.42  8.37  3.51
## 15      45.0  62.7  8.02  7.95  3.59
## 16      41.2  63.8  8.9   8.85  3.66
## 17      84.3  61.2  4.51  6.02  4.44
## 18      43.9  60.9  4.71  4.68  2.06
## 19      42.3  61    5.31  5.34  2.25
## 20     101.   63.7  5.01  5.04  5.06

There are 20 observations. When we compare it with the number of all observations in the dataset, it is very small value. So, we can remove them from the dataset.

diamonds = diamonds %>%
  filter((abs(new_depth - depth) < 11))
diamonds = subset(diamonds, select = -c(new_depth))

Now, we can compare the x, y and z values with each other. These values should be highly correlated with each other.

diamonds %>%
  ggplot(., aes(x = x, y = y)) +
  geom_point() + 
  geom_smooth(method = "lm")

diamonds %>%
  ggplot(., aes(x = z, y = y)) +
  geom_point() + 
  geom_smooth(method = "lm")

As expected, these values are highly correlated. We can assume that these x, y and z values are valid values.

3 Exploratory Data Analysis

After the pre-processing of the data, we make EDA for the diamonds dataset by using different variables.

3.1 Cut

As we mentioned before, cut is one of the most important feature of diamonds. In this data set, quality of cut is given as categorical variables. Fair is the lowest quality, whereas, ideal is the highest quality. For this reason, to examine diamond dataset by using this feature, we can see the percentages of cut types in the dataset.

diamonds %>%
  group_by(cut) %>%
  summarise(count = n()) %>%
  mutate(percentage = 100*count/sum(count)) %>%
  ggplot(., aes(x = '', y = count, fill = cut)) + 
    geom_bar(width = 1, stat = "identity", alpha = 0.8) +
    coord_polar("y") +
    theme_void() +
    theme(plot.title = element_text(vjust = 0.5)) +
    geom_text(aes(label = paste(format(percentage,digits=2), "%")), size=4, color = "black", position = position_stack(vjust = 0.3)) +
    labs(title = "Percentage of Quality of Cut ",
        subtitle = st,
        fill = "Quality of the Cut")

The pie chart illustrates that, most of the diamonds are cut as ideal form. It was expected, because if the cut type of a diamonds is ideal, it will be more expensive than the other cut types (when every other properties are equal). Moreover, percentage of premium and very good are almost equal. On the other hand, there is a little fair cutting type. We can see the count, minimum, maximum and average prices from the table below.

diamonds %>%
  group_by(cut) %>%
  summarise(cut_count = n(),
            MinPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price)) %>%
  kable(col.names = c("Cut", "Number of Cut Quality", "Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_minimal(full_width = F)

Cut	Number of Cut Quality	Minimum Price	Average Price	Maximum Price
Fair	1594	337	4334.275	18574
Good	4888	327	3916.279	18707
Very Good	12065	336	3981.305	18818
Premium	13738	326	4576.581	18823
Ideal	21480	326	3461.698	18806

While, the pie chart presents the percentage of the dataset according to the cut type, the Above table shows the number of trial with cut type. Furthermore, the minimum, average and maximum prices are addressed in this table. Although the most popular cutting type is Ideal cut, its price is not the highest one. According to the average prices, the most expensive diamonds are belongs to Premium if we just look for the cut variable. So, it means that we can not explain the response (price) variable only with the cut type of a diamond.

3.2 Colors

Colors are the another important feature of the diamonds and also it affects the price of the diamonds. First, according to trials, we can group according to the color. After this grouping process we can number of diamonds with this group.

diamonds %>%
  group_by(color)%>%
  summarise(count = n()) %>%
  ggplot(., aes(x=color, y = count, fill = count)) +
    geom_col() +
    scale_fill_gradient("count", low="yellow", high="red") +
    geom_line(aes(y = count), size = 1.2, color="black", group = 1) +
    theme_minimal() +
    theme(legend.position = "none") +
    labs(title = "Classification Of Diamonds According to the Colors",
         subtitle = st,
         x = "Diamond Color",
         y = "Number of Color")

The bar chart illustrates that there is at most G color diamond, whereas, there is at least J color. Their distribution looks like normal. We can obtain exact value of each color and price values in data set by using below table.

diamonds %>%
  group_by(color)%>%
  summarise(color_count = n(),
            MinPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price)) %>%
  kable(col.names = c("Color", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_minimal(full_width = F)

Color	Count	Minimum Price	Average Price	Maximum Price
J	2800	335	5327.337	18710
I	5405	334	5079.123	18823
H	8263	337	4474.449	18803
G	11253	354	3997.174	18818
F	9517	342	3726.559	18791
E	9773	326	3080.044	18731
D	6754	357	3172.589	18693

According to the diamond information given above D-F color interval is the highest purity. Then, G-F color scale follows. Results illustrates that the number of G color is the highest. The most beautiful color, i.e., actually colorless, is D-F. Moreover, by using this table output, we can obtain minimum,maximum and average prices and compare the price of diamonds with each other by using color classification. According to the average prices, the most expensive diamonds are belongs to J if we just look for the color variable. So, it means that we can not explain the response (price) variable only with the color type of a diamond.

3.3 Clarity

Clarity gives information about diamonds whether it contains stain or not. In this report, to find clarity classification of the data set, a kind of scatter plot is used.

diamonds %>%
  group_by(clarity) %>%
  summarise(clarity_count = n()) %>%
  ggplot(.,aes(x=clarity, y = clarity_count, color= clarity_count)) +
    geom_point(size=9) +
    geom_segment(aes(x=clarity,
                     xend=clarity,
                     y=0,
                     yend=clarity_count))+
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90), legend.position = "none")+
        labs(title = "Classification Of Diamonds According to the Clarity",
         subtitle = st, 
         x = "Clarity",
         y = "Number of Diamonds by Clarity")

According to the results,

The most of the data is occurred form the SI1 clarity type, VS2 and SI2 follow SI1, respectively. These clarity classes are worse when we compare the other clarity classes.
The best type of clarity, IF, takes place in the at the second place from the last. It can be expected that it is hard to produce a diamonds without any clarity.

To get more clear analysis, the following table, which shows the number of diamonds according to clarity, can be analyzed. Moreover, the minimum, maximum, and average prices can be also obtained.

diamonds %>%
  mutate(clarity = factor(clarity)) %>%
  group_by(clarity) %>%
  summarise(clarity_count = n(),
            MinPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price)) %>%
  arrange(desc(clarity_count)) %>%
  kable(col.names = c("Clarity", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_minimal(full_width = F)

Clarity	Count	Minimum Price	Average Price	Maximum Price
SI1	13026	326	3992.932	18818
VS2	12223	334	3925.842	18823
SI2	9140	326	5051.592	18804
VS1	8153	327	3840.792	18795
VVS2	5055	336	3286.971	18768
VVS1	3645	336	2523.088	18777
IF	1784	369	2870.570	18806
I1	739	345	3924.185	18531

According to the average prices, the most expensive diamonds are belongs to SI2 if we just look for the clarity variable. So, it means that we can not explain the response (price) variable only with the clarity type of a diamond.

3.4 Carat

Carat has two meanings: (i) the purity of the diamonds and (ii) the weight of the diamonds. In this data set, the carat illustrates the weight of the carat. To see the most used carat, the number of data is group according to the carat variable.

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(diamonds$carat)

## [1] 0.3

According to the data set, the most preferable carat is 0.3 carat.

diamonds %>%
  mutate(carat = factor(carat)) %>%
  group_by(carat) %>%
  summarise(carat_count = n())%>%
  arrange(desc(carat_count)) %>%
  head(10) %>%
  ggplot(., aes(y=carat_count, x = reorder(carat, -carat_count), fill = carat)) +
  geom_col() +
  geom_text(aes(label = carat_count), size=3, color = "black", position = position_stack(vjust = 0.95)) +
  theme_minimal() +
  scale_fill_brewer(palette = c("Paired")) +
  theme(legend.position = "none") +
  labs(title = "The Most 10 Popular Carat",
       subtitle = st,
       x = "Carat",
       y = "Number of Count")

From the histogram, you can find the distribution of the carat. To see all carat according to the count, following table can be analyzed. In addition to the number of diamonds according to the carat classification, the price intervals and average price also can be investigated.

diamonds %>%
  group_by(carat) %>%
  summarise(carat_count = n(),
            inPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price))%>%
  kable(col.names = c("Carat", "Number of Carats","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_styling("striped", full_width = T) %>%
  scroll_box(width = "100%", height = "300px")

Carat	Number of Carats	Minimum Price	Average Price	Maximum Price
0.20	12	345	365.1667	367
0.21	9	326	380.2222	394
0.22	5	337	391.4000	470
0.23	293	326	486.1433	688
0.24	254	336	505.1850	963
0.25	212	357	550.9245	1186
0.26	253	337	550.8972	814
0.27	233	361	574.7597	893
0.28	198	360	580.1212	828
0.29	130	334	601.1923	1776
0.30	2596	339	680.6367	2366
0.31	2238	335	708.1997	1917
0.32	1827	345	719.1916	1715
0.33	1179	366	765.1747	1389
0.34	909	413	755.8086	2346
0.35	663	409	799.5098	1672
0.36	570	410	781.4667	1718
0.37	394	451	816.0711	1764
0.38	669	497	904.1973	1992
0.39	397	451	923.4761	1624
0.40	1291	484	933.0077	1908
0.41	1378	467	973.0530	1745
0.42	704	552	994.4545	1722
0.43	486	452	995.2737	1827
0.44	212	648	1056.6226	2506
0.45	110	672	1093.3182	1833
0.46	177	746	1240.2429	2139
0.47	99	719	1272.3131	2460
0.48	63	841	1322.0476	2677
0.49	45	900	1306.3333	2199
0.50	1254	584	1503.8453	3378
0.51	1123	826	1623.5841	4368
0.52	813	870	1656.4785	3373
0.53	707	844	1685.5248	3827
0.54	624	935	1704.3317	4042
0.55	495	959	1725.2283	3509
0.56	492	953	1831.9980	4632
0.57	429	929	1798.0443	4032
0.58	310	893	1798.6484	3741
0.59	282	903	1831.4504	4916
0.60	226	806	1956.2434	4209
0.61	204	931	2069.2451	4279
0.62	135	933	1973.3704	3968
0.63	102	1065	2244.7843	6607
0.64	80	1080	2113.5875	4281
0.65	65	1214	2322.6154	5052
0.66	48	1188	2235.7292	4311
0.67	48	1244	2067.8542	3271
0.68	25	1341	2334.3200	4325
0.69	26	1636	2206.6154	3784
0.70	1981	945	2516.0717	5539
0.71	1291	1305	2674.3106	5019
0.72	764	945	2705.7788	5086
0.73	492	1049	2791.2480	5154
0.74	322	1632	2821.9006	5841
0.75	249	1013	2763.4016	5288
0.76	251	1140	2889.5936	5056
0.77	251	1651	2884.3984	5251
0.78	187	1720	2918.9091	4904
0.79	151	1809	2993.2649	4858
0.80	284	1232	3146.1901	6834
0.81	200	1687	3094.4250	6338
0.82	140	2033	3083.6500	4931
0.83	131	1863	3181.9924	7889
0.84	64	1749	3294.9844	6399
0.85	62	1250	3204.1774	7230
0.86	34	1828	3289.9706	5029
0.87	31	1827	3342.3226	5640
0.88	23	1964	3431.3478	5595
0.89	21	1334	3291.2381	4854
0.90	1484	1599	3939.5889	9182
0.91	570	1570	3981.1561	9636
0.92	226	2283	4035.2124	7544
0.93	142	2374	3994.2958	5992
0.94	59	2571	4195.7797	7120
0.95	65	2195	4013.6615	6951
0.96	103	1637	3623.4660	6807
0.97	59	1547	4026.2373	7415
0.98	31	1712	3469.7742	4950
0.99	23	1789	4406.1739	6094
1.00	1544	1681	5250.7720	16469
1.01	2240	2017	5507.1902	16234
1.02	883	2344	5598.4632	17100
1.03	522	1262	5583.8966	17590
1.04	474	2037	5774.6709	18542
1.05	361	2066	5781.6814	13060
1.06	372	2080	5947.0081	15813
1.07	340	2260	6004.2118	18279
1.08	246	2869	5774.3415	14032
1.09	285	2708	6020.0947	18231
1.10	276	3216	5913.0725	12748
1.11	308	3183	6005.5682	15464
1.12	251	2383	6044.4900	14407
1.13	246	2968	6193.4024	14525
1.14	206	2327	6354.4078	18112
1.15	149	3220	5959.9060	12895
1.16	172	2368	6274.2965	13595
1.17	110	2336	6224.8818	11886
1.18	123	3219	6481.1057	14847
1.19	126	2892	6483.2381	15767
1.20	643	2360	6670.7387	16256
1.21	470	2396	6905.4043	17353
1.22	299	2699	6971.5151	14414
1.23	277	3168	7290.8881	16253
1.24	235	2940	7147.1149	15026
1.25	187	3276	7123.2513	16823
1.26	146	3881	7490.7466	15247
1.27	134	2845	7119.1194	15377
1.28	106	3355	7711.8113	18700
1.29	101	2596	7033.0198	17932
1.30	121	2512	7851.6612	15874
1.31	132	3697	7750.6364	17496
1.32	89	4140	7639.3708	15802
1.33	87	4634	7527.0805	18435
1.34	68	4352	7758.5000	17663
1.35	77	4082	7448.2857	14616
1.36	50	4158	8591.5200	15386
1.37	46	5063	8044.0870	13439
1.38	26	4598	8164.3077	17598
1.39	36	3914	8335.1111	13061
1.40	50	4939	9108.5200	16808
1.41	39	4145	9452.2051	17216
1.42	25	6069	9487.2000	18682
1.43	19	6086	9234.9474	14429
1.44	18	4064	8021.6111	12093
1.45	15	4320	8709.3333	17545
1.46	18	5421	8237.4444	12261
1.47	21	4898	7833.7143	12547
1.48	7	6216	8791.2857	15164
1.49	11	5407	9174.2727	18614
1.50	787	2964	10082.9835	18691
1.51	804	3497	10552.0211	18806
1.52	380	3105	10629.9868	17723
1.53	219	5631	10703.4658	17223
1.54	174	4492	10865.0287	18416
1.55	123	4965	10717.7642	18188
1.56	108	5766	10205.9444	17455
1.57	106	5170	10422.7075	17548
1.58	89	4459	11427.9888	18057
1.59	89	6010	10574.3596	17552
1.60	95	6272	11124.2526	18780
1.61	64	6915	10992.8438	18318
1.62	61	5157	11705.8852	18281
1.63	50	7100	11556.3200	18179
1.64	43	4849	10738.0000	17604
1.65	32	5914	10973.8750	17729
1.66	30	5310	11084.5000	16294
1.67	25	7911	12281.4800	17351
1.68	19	5765	11779.1053	16985
1.69	24	5182	11878.0833	17803
1.70	215	5617	12110.3209	18718
1.71	119	6320	12396.5546	18791
1.72	57	5240	12040.3158	18730
1.73	52	6007	11875.9615	18377
1.74	40	4677	12270.3250	18430
1.75	50	7276	12673.7000	17343
1.76	28	9053	12398.5714	18178
1.77	17	7771	11103.1765	15278
1.78	12	8889	12504.5833	18128
1.79	15	9122	13144.0667	18429
1.80	20	7862	12955.1500	17273
1.81	9	5859	11374.4444	14177
1.82	13	5993	12986.1538	15802
1.83	18	5083	11930.0000	18525
1.84	4	7922	11108.5000	13905
1.85	3	5688	11436.0000	14375
1.86	9	9791	12164.3333	17267
1.87	7	9955	12933.5714	17761
1.88	4	8048	10946.2500	13607
1.89	4	10055	13969.0000	17553
1.90	7	8576	11477.1429	13919
1.91	12	6632	12365.8333	17509
1.92	2	15364	15418.0000	15472
1.93	6	13278	16828.1667	18306
1.94	3	9344	14189.3333	18735
1.95	3	5045	12340.6667	17374
1.96	4	5554	8642.2500	13099
1.97	4	14410	15966.0000	17535
1.98	5	12923	14667.8000	16171
1.99	3	11486	14179.0000	17713
2.00	261	5051	14107.1456	18818
2.01	433	5696	14772.7136	18741
2.02	176	6346	14767.4091	18731
2.03	122	6002	14710.1148	18781
2.04	86	7403	14840.1744	18795
2.05	65	7026	15647.2462	18787
2.06	59	5203	14348.6780	18779
2.07	50	6503	15384.2800	18804
2.08	41	6150	15128.6829	18760
2.09	45	10042	15287.3111	18640
2.10	51	6597	14464.3725	18648
2.11	43	7019	14563.4186	18575
2.12	25	7508	15007.6800	18656
2.13	21	12356	15220.4286	18442
2.14	48	5405	14905.0417	18528
2.15	22	5430	14187.1818	18791
2.16	25	8709	15860.4800	18678
2.17	18	6817	14416.6111	17805
2.18	30	10340	14901.7000	18717
2.19	22	10179	15521.0909	18693
2.20	31	7934	15226.8710	18364
2.21	23	6535	15287.9565	18483
2.22	27	5607	14691.5185	18706
2.23	13	7006	13947.4615	18149
2.24	16	10913	15573.9375	18299
2.25	16	6653	13403.5625	18034
2.26	15	11226	15330.2000	18426
2.27	12	5733	14198.4167	18343
2.28	20	9068	15422.9500	18055
2.29	17	11502	15834.7647	18823
2.30	20	7226	14861.5500	18304
2.31	13	7257	15017.3846	17715
2.32	16	6860	16042.1875	18532
2.33	9	8220	14089.7778	18119
2.34	5	8491	11922.0000	15818
2.35	7	14185	16090.4286	17999
2.36	8	11238	15756.7500	18745
2.37	6	14837	16525.1667	18508
2.38	8	10716	15481.3750	18559
2.39	7	15917	17182.4286	17955
2.40	13	11988	16362.4615	18541
2.41	5	13563	15808.2000	17923
2.42	7	16198	16923.5714	18615
2.43	6	9716	15356.0000	18692
2.44	4	13027	16365.5000	18430
2.45	4	11830	14593.7500	18113
2.46	3	10470	14227.0000	16466
2.47	3	14970	15644.0000	16532
2.48	9	12883	16537.1111	18692
2.49	3	6289	13843.0000	18325
2.50	16	7854	14316.1875	18447
2.51	17	14209	16616.8824	18308
2.52	9	10076	16242.1111	18252
2.53	8	14659	16548.8750	18254
2.54	8	12095	15125.2500	17996
2.55	3	14351	15964.0000	18766
2.56	3	15231	16723.3333	17753
2.57	3	17116	17841.6667	18485
2.58	3	12500	14481.3333	16195
2.59	1	16465	16465.0000	16465
2.60	3	17027	17535.0000	18369
2.61	3	13784	16583.0000	18756
2.63	3	10437	12659.6667	16914
2.64	1	17407	17407.0000	17407
2.65	1	16314	16314.0000	16314
2.66	2	16239	17367.0000	18495
2.67	1	18686	18686.0000	18686
2.68	2	8419	9042.0000	9665
2.70	1	14341	14341.0000	14341
2.71	1	17146	17146.0000	17146
2.72	3	6870	12088.3333	17801
2.74	3	8807	14385.0000	17184
2.75	2	13156	14285.5000	15415
2.77	1	10424	10424.0000	10424
2.80	1	15030	15030.0000	15030
3.00	8	6512	12378.5000	16970
3.01	12	8040	15521.6667	18710
3.02	1	10577	10577.0000	10577
3.04	2	15354	16956.5000	18559
3.05	1	10453	10453.0000	10453
3.11	1	9823	9823.0000	9823
3.22	1	12545	12545.0000	12545
3.24	1	12300	12300.0000	12300
3.40	1	15964	15964.0000	15964
3.50	1	12587	12587.0000	12587
3.51	1	18701	18701.0000	18701
3.65	1	11668	11668.0000	11668
3.67	1	16193	16193.0000	16193
4.00	1	15984	15984.0000	15984
4.01	2	15223	15223.0000	15223
4.13	1	17329	17329.0000	17329
4.50	1	18531	18531.0000	18531
5.01	1	18018	18018.0000	18018

Up until now, there are most important features of the diamonds are presented as a count analysis. After that point, relationship between variables and prices are illustrated.

3.5 Price and Cut Analysis by Color

diamonds %>%
  group_by(cut, color) %>%
  summarise(avg_price = mean(price)) %>%
  ggplot(aes(x=cut, y= avg_price, fill = cut)) +
    geom_col() +
    facet_wrap(~color) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90), legend.position = "none") +
    labs(title = "Relationship Analysis Between Price and Cut by Color",
         subtitle = st, 
         y = "Average Price",
         x = "Quality of Cut")

When we look at the plots, we can say that these two variables can not explain the price variable because the worst type of colors have the highest prices in the best type of cuts.

3.6 Price and Clarity Analysis by Color

diamonds %>%
  group_by(clarity, color) %>%
  summarise(MeanPrice = mean(price)) %>%
  ggplot(aes(x=clarity, y = MeanPrice, fill = color)) +
    geom_col(alpha = 0.8) +
    theme_minimal() +
    facet_wrap(~color) +
    labs(title = "Relationship Analysis Between Price and Clarity by Color",
         subtitle = st,
         x = "Clarity",
         y = "Average Price",
         fill = "Color")

When we look at the best option of these two types, it has the highest average price among the others. But, when we look at the others, prices are decreasing when we increase the clarity for all types of color. So, it means that we can not still explain the price with these two variables.

3.7 Price and Clarity Analysis by Cut

diamonds %>%
  group_by(clarity, cut) %>%
  summarise(MeanPrice = mean(price)) %>%
  ggplot(aes(x=clarity, y = MeanPrice, fill = cut)) +
    geom_col(alpha = 0.8) +
    theme_minimal() +
    facet_wrap(~cut) +
    labs(title = "Relationship Analysis Between Price and Clarity by Cut",
         subtitle = st,
         x = "Clarity",
         y = "Average Price",
         fill = "Cut")

We have the same results like the last plot. SI2, which is the second worst type of clarity, has the highest average price in all cut types. So, we can not explain the price variable with these two types.

3.8 Price Group Analysis

quant = quantile(diamonds$price, seq(0, 1, 0.2))

diamonds_price_group <- diamonds %>%
  mutate(price_group = case_when(
    price < quant[2] ~ "Very Low",
    price < quant[3] ~ "Low",
    price < quant[4] ~ "Medium",
    price < quant[5] ~ "High",
    TRUE ~ "Very High"
  )) %>%
  mutate(price_group = factor(price_group, levels = c("Very Low", "Low", "Medium", "High", "Very High")))

diamonds_price_group %>%
  group_by(cut, price_group) %>%
  summarise(count=n()) %>%
  mutate(percentage = 100 * count / sum(count)) %>%
  ggplot(., aes(x = '', y = count, fill = price_group)) + 
  geom_bar(width = 1, stat = "identity", position = "fill") +
  coord_polar("y") +
  theme_void() +
  theme(plot.title = element_text(vjust = 0.5)) +
  facet_wrap(~cut) +
  labs(title = "Price Group Analyses of Cut",
       subtitle = st,
       fill = "Price Group")

All cut types are equally distributed except Fair type. We have only 0.03 percent Fair cut types in the data. So, there could be some grouping in the price groups. Also, we can see the results in a table below.

price_group_vs_cut = diamonds_price_group %>%
  group_by(cut, price_group) %>%
  summarise(count=n()) %>%
  mutate(percentage = 100 * count / sum(count)) %>%
  select(cut, price_group, count, percentage)%>%
  pivot_wider(id_cols = cut, names_from = price_group, values_from = count)
cut_percentage = diamonds_price_group %>%
  group_by(cut) %>%
  summarise(count=n()) %>%
  mutate(percentage = round(100 * count / sum(count),2)) %>%
  select(cut,percentage)
price_group_vs_cut %>%
  inner_join(cut_percentage, by = "cut") %>%
  kable(col.names = c("Cut", "Very Low","Low", "Medium", "High", "Very High", "Percentage"))%>% 
  kable_minimal(full_width = F)

Cut	Very Low	Low	Medium	High	Very High	Percentage
Fair	68	221	546	463	296	2.96
Good	953	692	1032	1363	848	9.09
Very Good	2735	1782	2448	2697	2403	22.44
Premium	2246	2568	2297	3200	3427	25.55
Ideal	4745	5491	4435	3024	3785	39.95

3.9 Price Histogram

diamonds %>%
  ggplot(aes(x=price)) +
  geom_histogram(aes(y=..density..), position="identity", alpha=0.8, fill = "seashell3", color="seashell4") +
  geom_density(alpha=1, size = 1)+
  theme_minimal() +
  labs(title = "Histogram of Price",
       subtitle = st,
       x = "Price",
       y = "Count")

The histogram shows that the price in diamonds dataset is right skewed.

diamonds %>%
  ggplot(aes(x=log(price))) +
  geom_histogram(aes(y=..density..), position="identity", alpha=0.8, fill = "seashell3", color="seashell4") +
  geom_density(alpha=1, size = 1)+
  theme_minimal() +
  labs(title = "Histogram of Price",
       subtitle = st,
       x = "Price",
       y = "Count")

When we plot the log transformation of the prices, we didn’t get the shape of the normal distribution. So, we should not try linear models for this data. Because of that we can try glm, decision trees etc.

4 Feature Extraction / Dimentionality Reduction

At first, we need to divide the data to train and test sets.

set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

diamonds_train = diamonds_train[, c(-ncol(diamonds_train))]
diamonds_test = diamonds_test[, c(-ncol(diamonds_test))]

All the ordered columns, we will use them as numeric values.

diamonds_train$cut = as.numeric(diamonds_train$cut)
diamonds_train$clarity = as.numeric(diamonds_train$clarity)
diamonds_train$color = as.numeric(diamonds_train$color)
diamonds_test$cut = as.numeric(diamonds_test$cut)
diamonds_test$clarity = as.numeric(diamonds_test$clarity)
diamonds_test$color = as.numeric(diamonds_test$color)

4.1 PCA

There are some numerical variables. For this reason we can apply PCA to decrease the number of columns in this dataset.

diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
summary(diamonds_pca,loadings=TRUE)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     2.0743448 1.1892870 1.1060252 0.9927188 0.83129152
## Proportion of Variance 0.4781007 0.1571560 0.1359213 0.1094989 0.07678284
## Cumulative Proportion  0.4781007 0.6352566 0.7711779 0.8806769 0.95745974
##                            Comp.6      Comp.7       Comp.8       Comp.9
## Standard deviation     0.59265505 0.172985644 0.0353090908 2.125041e-02
## Proportion of Variance 0.03902667 0.003324893 0.0001385258 5.017554e-05
## Cumulative Proportion  0.99648641 0.999811299 0.9999498245 1.000000e+00
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat    0.471                       0.126         0.868              
## cut     -0.109  0.596  0.451  0.239 -0.175  0.585                     
## color   -0.158 -0.203         0.802  0.537                            
## clarity -0.222  0.208  0.212 -0.506  0.778                            
## depth           0.207 -0.828         0.118  0.492                0.101
## table    0.129 -0.697  0.226 -0.183         0.642                     
## x        0.475                       0.115        -0.277 -0.722  0.387
## y        0.474                       0.120        -0.299  0.692  0.420
## z        0.474  0.103                0.132        -0.282        -0.815

From the summary of PCA model that we create, we can see that four components can almost explain 88% of the data. We can choose to use four components for creating a model. Now, we need to add these components to both train and test datasets.

pca_results = predict(diamonds_pca, newdata = diamonds_train)
diamonds_train$pca1 = pca_results[,1]
diamonds_train$pca2 = pca_results[,2]
diamonds_train$pca3 = pca_results[,3]
diamonds_train$pca4 = pca_results[,4]

pca_results_test = predict(diamonds_pca, newdata = diamonds_test)
diamonds_test$pca1 = pca_results_test[,1]
diamonds_test$pca2 = pca_results_test[,2]
diamonds_test$pca3 = pca_results_test[,3]
diamonds_test$pca4 = pca_results_test[,4]

4.2 Clustering

We can create a feature with using the clustering methods. For now, we can use KMeans algorithm. To be able to use the KMeans algorithm, we need to scale all data. Because if a column would have much higher values respect to other columns, it can dominate the rest. To be able to scale the data we need the minimum and maximum values of the train dataset. We will scale both train and test set with the same values.

min_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], min)
max_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], max)

diamonds_train_scale <- sweep(sweep(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")

To be able to get similar clusters every time, we will use the same seed. For selecting the cluster number, we can loop over 1 to 15 for center argument. We can select the center value with respect to the error plot.

errors = c()
for (i in (1:15)){
  set.seed(1234) #provide getting same results with random function 
  cluster<-kmeans(diamonds_train_scale,centers=i) # application of the k-means function with i number of group size
  errors = c(errors, 100*round(1 - (cluster$tot.withinss/cluster$totss), digits = 3)) # calculation of the fulfillment of the clusters to data.
}

errors_df <- data.frame(x=c(1:15), y=errors) # creating data frame with errors.

ggplot(errors_df, aes(x=x, y=y)) +
  geom_point(color = "seashell4") +
  geom_line(color="seashell4") +
  geom_text(aes(label = errors), size=3, color = "black", position = position_stack(vjust = 0.95))+
  theme_minimal() +
  labs(title = "Classification of Observations",
       subtitle = st,
       x = "X",
       y = "Y")

The decrease in errors are slowly changing after the cluster with 7 centers. So, we can say that we should select the model with center equals to 7.

set.seed(1234)
best_cluster = kmeans(diamonds_train_scale,centers=7)
diamonds_train$cluster = as.factor(best_cluster$cluster)

Now, we need to apply clustering process to test dataset. To be able to do this I used the method in this link

diamonds_test_scale <- sweep(sweep(diamonds_test[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")

closest.cluster <- function(x) {
  cluster.dist <- apply(best_cluster$centers, 1, function(y) sqrt(sum((x-y)^2)))
  return(which.min(cluster.dist)[1])
}
diamonds_test$cluster <- as.factor(apply(diamonds_test_scale, 1, closest.cluster))

After these processes, we have enough data to create a model.

5 Create Model

At the beginning, we need to check the correlation of features with price column.

diamonds_cor<-cor(diamonds_train[-c(11:15)])
corrplot(diamonds_cor, method="number")

From the plot above, we can say that carat, x, y and z columns are highly correlated with price column. So, we can say that we should use this columns in any model. But, at the beginning, we will try for all columns in the model.

5.1 Linear Model (LM)

For simplicity, we can start with linear regression models. At the beginning, we will not consider the pca columns in the dataset.

model_lm1 = lm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train)
summary(model_lm1)

## 
## Call:
## lm(formula = price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23380.8   -624.4   -111.3    487.6  10016.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23516.947   1284.448 -18.309  < 2e-16 ***
## carat        11553.086     65.487 176.419  < 2e-16 ***
## cut            168.773     12.865  13.119  < 2e-16 ***
## color          283.070      5.670  49.920  < 2e-16 ***
## clarity        501.646      4.475 112.094  < 2e-16 ***
## depth          358.929     19.878  18.056  < 2e-16 ***
## table          -16.708      3.386  -4.934 8.07e-07 ***
## x             -347.272    147.355  -2.357   0.0184 *  
## y             3718.053    151.287  24.576  < 2e-16 ***
## z            -7446.634    320.787 -23.214  < 2e-16 ***
## cluster2       -48.540     25.392  -1.912   0.0559 .  
## cluster3        76.119     42.557   1.789   0.0737 .  
## cluster4       -42.650     35.186  -1.212   0.2255    
## cluster5       250.682     35.456   7.070 1.57e-12 ***
## cluster6       234.953     35.374   6.642 3.13e-11 ***
## cluster7       440.818     26.276  16.776  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1190 on 42987 degrees of freedom
## Multiple R-squared:  0.9106, Adjusted R-squared:  0.9106 
## F-statistic: 2.92e+04 on 15 and 42987 DF,  p-value: < 2.2e-16

We created a linear model. For, we will not try to improve the model and check for the linear model assumptions. To do so, we need to plot the residuals.

plot(model_lm1,1)

plot(model_lm1,2)

In the first plot, we can see that there is a variance in the response variable. It means that this data set is not suitable for linear regression. For better models we can use generalized linear models (GLM) rather than linear models.

5.2 Generalized Linear Model (GLM)

This data has a continuous response variable. It means that we can use Gamma and Gaussian families.

For easy implementation, we can use the code chunks that we learned in IE508.

for(link in c("identity", "log", "inverse", "sqrt")){
  fitgamma = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = Gamma(link = link), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
  print(AIC(fitgamma))
}

## [1] 673749.4
## [1] 1096312
## [1] 711941.7
## [1] 644408.7

for(link in c("identity", "log", "inverse")){
  fitgaussian = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = gaussian(link = link), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
  print(AIC(fitgaussian))
}

## [1] 731124.9
## [1] 5602415
## [1] 864186

When we compare the results of glm models, we see that the best model is the one with gamma family and sqrt link function. So, we will continue with these argument for glm.

model_glm = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm)

## 
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4, family = Gamma(link = "sqrt"), 
##     data = diamonds_train, start = c(0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.27202  -0.12157  -0.02088   0.09054   2.29577  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -75.03830    4.80955 -15.602  < 2e-16 ***
## carat        52.76661    0.34888 151.246  < 2e-16 ***
## cut           0.53402    0.04191  12.743  < 2e-16 ***
## color         1.19880    0.01677  71.493  < 2e-16 ***
## clarity       2.01002    0.01348 149.150  < 2e-16 ***
## depth         0.72617    0.07474   9.716  < 2e-16 ***
## table         0.01401    0.01082   1.295 0.195384    
## x            16.04449    0.61542  26.071  < 2e-16 ***
## y            -3.56428    0.62223  -5.728 1.02e-08 ***
## z           -12.25907    1.42896  -8.579  < 2e-16 ***
## cluster2     -0.12907    0.06921  -1.865 0.062195 .  
## cluster3     -1.58676    0.14697 -10.797  < 2e-16 ***
## cluster4     -2.87269    0.13135 -21.870  < 2e-16 ***
## cluster5     -0.10937    0.10869  -1.006 0.314332    
## cluster6      0.35828    0.10104   3.546 0.000392 ***
## cluster7      1.38220    0.10099  13.686  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.03404458)
## 
##     Null deviance: 42126.8  on 43002  degrees of freedom
## Residual deviance:  1393.5  on 42987  degrees of freedom
## AIC: 644409
## 
## Number of Fisher Scoring iterations: 8

Table is not significant in this model. So we can remove it from the model.

model_glm2 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm2)

## 
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4 - table, 
##     family = Gamma(link = "sqrt"), data = diamonds_train, start = c(0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.27264  -0.12163  -0.02084   0.09062   2.28622  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -73.19219    4.56920 -16.019  < 2e-16 ***
## carat        52.81087    0.34721 152.102  < 2e-16 ***
## cut           0.50918    0.03704  13.748  < 2e-16 ***
## color         1.19915    0.01676  71.540  < 2e-16 ***
## clarity       2.00950    0.01347 149.222  < 2e-16 ***
## depth         0.71171    0.07371   9.655  < 2e-16 ***
## x            16.05383    0.61528  26.092  < 2e-16 ***
## y            -3.63697    0.61902  -5.875 4.25e-09 ***
## z           -12.17752    1.42666  -8.536  < 2e-16 ***
## cluster2     -0.12735    0.06919  -1.841 0.065677 .  
## cluster3     -1.61398    0.14540 -11.101  < 2e-16 ***
## cluster4     -2.87373    0.13132 -21.884  < 2e-16 ***
## cluster5     -0.14129    0.10604  -1.332 0.182730    
## cluster6      0.32846    0.09832   3.341 0.000836 ***
## cluster7      1.38189    0.10096  13.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.0340283)
## 
##     Null deviance: 42126.8  on 43002  degrees of freedom
## Residual deviance:  1393.6  on 42988  degrees of freedom
## AIC: 644408
## 
## Number of Fisher Scoring iterations: 8

Also, we can test to remove the cluster column.

model_glm3 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table - cluster, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm3)

## 
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4 - table - 
##     cluster, family = Gamma(link = "sqrt"), data = diamonds_train, 
##     start = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.25890  -0.12326  -0.02196   0.08701   2.25745  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -86.25420    4.62895 -18.634  < 2e-16 ***
## carat        50.94390    0.34313 148.469  < 2e-16 ***
## cut           0.47216    0.01724  27.390  < 2e-16 ***
## color         1.33121    0.01096 121.467  < 2e-16 ***
## clarity       2.04246    0.01187 172.057  < 2e-16 ***
## depth         0.89650    0.07468  12.005  < 2e-16 ***
## x            17.25907    0.60617  28.472  < 2e-16 ***
## y            -2.30449    0.60142  -3.832 0.000127 ***
## z           -15.66734    1.44552 -10.839  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.03557524)
## 
##     Null deviance: 42126.8  on 43002  degrees of freedom
## Residual deviance:  1441.6  on 42994  degrees of freedom
## AIC: 645863
## 
## Number of Fisher Scoring iterations: 9

So we can say that cluster makes our model better. Now, we need to check whether we can replace numeric values with PCA values.

model_glm4 = glm(price ~ pca1 + pca2 + pca3 + pca4 + cluster, data = diamonds_train, family = Gamma(link = "sqrt"))
summary(model_glm4)

## 
## Call:
## glm(formula = price ~ pca1 + pca2 + pca3 + pca4 + cluster, family = Gamma(link = "sqrt"), 
##     data = diamonds_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.19335  -0.26935  -0.06091   0.16983   1.92149  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.58457    0.12355 457.991  < 2e-16 ***
## pca1        11.83082    0.04129 286.500  < 2e-16 ***
## pca2         4.07612    0.04255  95.803  < 2e-16 ***
## pca3         3.14036    0.04449  70.588  < 2e-16 ***
## pca4        -0.30761    0.06086  -5.054 4.34e-07 ***
## cluster2    -3.32685    0.14251 -23.345  < 2e-16 ***
## cluster3    -2.57743    0.27792  -9.274  < 2e-16 ***
## cluster4    -7.05281    0.27171 -25.957  < 2e-16 ***
## cluster5     2.92040    0.17373  16.810  < 2e-16 ***
## cluster6     3.86938    0.15565  24.859  < 2e-16 ***
## cluster7     1.72285    0.21070   8.177 2.99e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1343288)
## 
##     Null deviance: 42126.8  on 43002  degrees of freedom
## Residual deviance:  5343.6  on 42992  degrees of freedom
## AIC: 702856
## 
## Number of Fisher Scoring iterations: 12

PCA helps us to use less feature in a model. But, as expected, with fewer feature we will get more AIC value in the model. We have 13 numeric feature. So, we don’t need to decrease the feature number.

To be able to compare glm model with other models, we need to predict for test data and calculate the R squared value.

model_glm2_prediction<-predict(model_glm2,newdata=diamonds_test)

model_glm2_r2 <- 1 - (sum((model_glm2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm2_r2

## [1] -0.9323168

The result is negative. It means that we have a very bad model. So, we can change the family/link function of the glm to get a better model. The second best model have Gamma family with identity link function. So we need to create the same model with this family and link function.

model_glm5 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))

model_glm5_prediction<-predict(model_glm5,newdata=diamonds_test)

model_glm5_r2 <- 1 - (sum((model_glm5_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm5_r2

## [1] 0.8122059

This is a far more better model. Also we can plot the real values vs predictions.

pred_vs_real_glm = as.data.frame(cbind(model_glm5_prediction, diamonds_test$price))
colnames(pred_vs_real_glm) = c("predictions", "real")

pred_vs_real_glm %>%
  ggplot(aes(predictions, real)) +
  geom_point()  +
  geom_abline(intercept = 0, slope = 1, color="red", size=1.5) +
  labs(title = "Prediction vs Real Values For Best GLM Model",
       subtitle = st,
       x = "Predictions",
       y = "Real Values")

5.3 CART

model_rpart <- rpart(price~., data=diamonds_train)
prp(model_rpart)

From the plot of a tree, we can see that the nodes are divided with using only carat, y and clarity columns. It means that these are better features to reduce the variance in the dataset with default argument. We can make a cross validation for cp argument. To do so, we need to apply this code chunk (these lines are taken from the Analytics Edge Course:

# Number of folds
tr.control = trainControl(method = "cv", number = 10)

# cp values
cp.grid = expand.grid( .cp = (1:10)*0.001)

# Cross-validation
tr = train(price~. - pca1 - pca2 - pca3 - pca4, data = diamonds_train, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
tr

## CART 
## 
## 43003 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 38703, 38704, 38702, 38701, 38703, 38702, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE       Rsquared   MAE     
##   0.001   944.8551  0.9436613  577.6079
##   0.002  1037.4473  0.9320865  628.6129
##   0.003  1079.4290  0.9264358  653.3161
##   0.004  1143.8428  0.9174002  689.0738
##   0.005  1146.8377  0.9169674  690.2158
##   0.006  1241.1488  0.9026995  809.9428
##   0.007  1277.7140  0.8968892  845.0753
##   0.008  1277.7140  0.8968892  845.0753
##   0.009  1277.7140  0.8968892  845.0753
##   0.010  1334.7119  0.8871967  870.7893
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001.

We can see that we need to use 0.001 for cp when we compare with other cp values.

model_rpart2 <- rpart(price~. - pca1 - pca2 - pca3 - pca4, data=diamonds_train, cp = 0.001)
prp(model_rpart2)

This is a more detailed tree. From the plot of this tree, we can say that we will define the price with respect to carat, y, clarity and color values. To be able to compare these two trees, we need to calculate the R squared values. To do so, firstly we need to calculate the predictions for test data and then calculate the R squared values.

model_rpart_prediction<-predict(model_rpart,newdata=diamonds_test)

model_rpart_r2 <- 1 - (sum((model_rpart_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart_r2

## [1] 0.881602

model_rpart2_prediction<-predict(model_rpart2,newdata=diamonds_test)

model_rpart2_r2 <- 1 - (sum((model_rpart2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart2_r2

## [1] 0.9456136

As we can from the results, we can see that second model is better than the first model.

If we want to compare the best tree model with best glm model, we need to compare the R squared values for these models.

model_rpart2_r2 > model_glm5_r2

## [1] TRUE

So, our CART model is better than the glm model.

6 Conclusion

We prepared and checked the accuracy of the data for creating a model. Before creating models, we applied the PCA and KMeans algorithms. As we expect, we get worse result with columns that we craeted by PCA algorithm. PCA is better to use for data that have many features. With KMeans algorithm, we created a feature for our models after scaling the numerical columns (which removes the importance of all columns). This feature makes all models better. After creating the best linear, generalized linear and decision tree models, the decision tree model performed better than the others.

7 References

Kaggle Notebook1 - Predicting Diamond Prices with Linear Regression
Kaggle Notebook2 - Diamond Exploration Price Modeling
EDA Example with Diamonds data set
Diamonds-Wikipedia
Color Cheatsheet
Geom Histogram
Ideal Cut
Depth Percentages
Mode Function
Division of Vector and Matrix Cluster for Test Data R2 for GLM

Diamond Price Estimation

Taha BAYAZ

06 09 2020