1 Diamonds

Diamonds are one of the most expensive gemstone in the world. There is a saying that diamonds are rare. According to this article they are more expensive when the stone is rare.

According to this site, the price of a diamond is related with it’s shape and 4Cs..

  • Shape: It refers to the outline or external figure of the diamond.
    • There are many shapes like Round, Princess, Emerald, Cushion etc.
    • Even if all the other properties are equal, the shape of a diamond will change the price.
    • In our diamonds dataset, there is no shape column. So, we will assume that every one of them have equal shapes.

Figure 1: Types of Shapes

  • Cut: It is how well a diamond is cut and polished, including how well-proportioned the stone is, its depth and symmetry.
    • A well cut diamond is luminous and reflects white and colored light back to your eyes.
    • A poorly cut diamond is dull instead of brilliant.
    • Cut types are explained below:
      • Excellent: Excellent Cut Diamonds provide the highest level of fire and brilliance. Because almost all of the incoming light is reflected through the table, the diamond radiates with magnificent sparkle.
      • Very Good: Very Good Cut Diamonds offer exceptional brilliance and fire. A large majority of the entering light reflects through the diamond’s table. To the naked eye, Very Good diamonds provide similar sparkle to those of Excellent grade.
      • Good: Good Cut Diamonds showcase brilliance and sparkle, with much of the light reflecting through the table to the viewer’s eye. These diamonds provide beauty at a lower price point.
      • Fair: Fair Cut Diamonds offer little brilliance, as light easily exits through the bottom and sides of the diamond. Diamonds of a Fair Cut may be a satisfactory choice for smaller carats and those acting as side stones.
      • Poor: Poor Cut Diamonds yield nearly no sparkle, brilliance or fire. Entering light escapes from the sides and bottom of the diamond.

In this link, you can find some shapes of a diamond. At the end of descriptions of shapes, you can find the price table. For every diamonds, better cut is more expensive than the worse cut if all the others are equal.

Figure 2: Looks of Cuts

  • Color: A diamond’s color refers to how clear or yellow it is.
    • Diamond color is measured using Gemological Institute of America, or GIA color scale which goes from D (colorless) all the way to Z (light yellow or brown in color).
    • In general, the highest quality diamonds are totally colorless, whereas lower quality diamonds can often have a slight yellow tint.
    • Diamond colors are graded from D to Z, with most diamonds used in jewelry falling somewhere into the D to M range. It means that the colors after M are not very popular.
    • The difference between one color grade and other grades (for example, a G color diamond and an H color diamond) are very small, so that they’re largely impossible to perceive with the naked eye. But when we compare it with the one that three or four color grade below or above, the difference can be seen easily.

Figure 3. Color Scales of Diamonds

  • Clarity: Diamond clarity is a qualitative metric that grades the visual appearance of each diamond.
    • The fewer inclusions and blemishes the diamond has, the better the clarity grade.
    • Five factors play a significant role in how the clarity grades are determined. These five roles in diamond grading include size, nature, number, location, and the relief of the inclusions.

Figure 4. Classification of Diamonds by Clarity

  • Carat: The carat is the weight of the diamond.
    • One carat is equal to 200 milligrams.
    • It is said that the more carat a diamond has, the more expensive it is. Even if you see only the top of a diamond on your ring, the carat also effects the price.
    • It has a correlation with the diameter of a diamond. Because it’s mathematically impossible for a 0.05ct diamond to have more surface area than a 1 carat diamond.

2 Data and Required Packages

2.1 Data Information

In this assignment, objectives can be listed as following:

  1. To provide useful exploratory data analysis (EDA)
  2. To create a model for estimation of the diamond price

To fulfill these objectives, (i) data is analyzed and prepared for the analysis, (ii) the meaningful EDA is presented by using some useful packages such as ggplot2, dplyr, data.table etc.

pti <- c("data.table", "tidyverse", "kableExtra", "knitr", "RColorBrewer", "caret", "e1071", "rpart", "rpart.plot", "rattle", "corrplot")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
    install.packages(pti)
}

library(tidyverse)
library(data.table)
library(knitr)
library(kableExtra)
library(RColorBrewer)
library(caret)
library(e1071)
library(rpart)
library(rpart.plot)
library(rattle)
library(corrplot)

The dataset in this analysis is obtained from the ggplot2 packages. Before the load this data set. By using ?diamonds we can obtain information about this dataset. This data set contains 53940 rows and 10 variables. As we mentioned before, there is no shape column in this dataset. By using this glimpse() function, we obtain these variables.

#variables of the dataset
diamonds %>%
  glimpse()
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
  • Price: price in US dollars ($326–$18,823)
  • Carat: weight of the diamond (0.2–5.01)
  • Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • Color: diamond color, from D (best) to J (worst)
  • Clarity: a measurement of how clear the diamond is I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)
  • x: length in mm (0–10.74)
  • y: width in mm (0–58.9)
  • z: depth in mm (0–31.8)
  • Depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
  • Table: width of top of diamond relative to widest point (43–95)

2.2 Data Preprocessing

In the previous section, detailed information about investigated dataset is presented. In this section, the data is analyzed or pre-processed before the detailed analyses. First step is to check null values.

#to control NA values in the dataset
sum(any(is.na(diamonds)))
## [1] 0

When we check the dataset, there is 0 NA value. This means that there is no missing value in any row or column. Second step is to check for duplicated rows.

sum(as.numeric(duplicated(diamonds)))
## [1] 146

By using above function we can obtain number of exactly the same rows in this dataset. There are 146 duplicated lines. Before the analysis, we need to extract these data from the data frame. To solve this problem, we can use unique() function.

#taking only unique rows form the data set
diamonds <- unique(diamonds)
#control of the duplicated lines after removing of the duplicated lines
sum(as.numeric(duplicated(diamonds)))
## [1] 0

After the usage of the unique function there is no duplicated line. After this process, we have 53794 rows and 10 columns. Third step could check the types of these variables like numeric, factor etc. To do so, we can use str() function

str(diamonds)
## tibble [53,794 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53794] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53794] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53794] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53794] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53794] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53794] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53794] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

When we look at the results, we see that color column is ordered wrongly. We need to correct the order.

diamonds$color = factor(diamonds$color, levels = c("J", "I", "H", "G", "F", "E", "D"))

The data is ready to make further corrections like accuracy of values. After these processes, we can use summary() function and head() function to get more information about the dataset.

summary(diamonds) # summary of each variable in the dataset
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1598   J: 2802   SI1    :13032   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4891   I: 5407   VS2    :12229   1st Qu.:61.00  
##  Median :0.7000   Very Good:12069   H: 8272   SI2    : 9150   Median :61.80  
##  Mean   :0.7978   Premium  :13748   G:11262   VS1    : 8156   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21488   F: 9520   VVS2   : 5056   3rd Qu.:62.50  
##  Max.   :5.0100                     E: 9776   VVS1   : 3647   Max.   :79.00  
##                                     D: 6755   (Other): 2524                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  951   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5327   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.030  
##  Max.   :31.800  
## 
head(diamonds) #first 6 rows of the data
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

2.3 Check Accuracy of Values

We also control the negative price values in the data set.

diamonds %>%
  filter(price <= 0) %>%
  summarise(NegativePrice = n())
## # A tibble: 1 x 1
##   NegativePrice
##           <int>
## 1             0

Logically, price can not be less than or equal to zero. For this reason, we check dataset to control negative of zero price values. According to the results price values are positive, there is nothing to do for price column

Now we can search for not positive x, y and z values.

diamonds %>%
  filter(x <= 0 & y > 0 & z > 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
diamonds %>%
  filter(y <= 0 & x > 0 & z > 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
diamonds %>%
  filter(z <= 0 & x > 0 & y > 0)
## # A tibble: 12 x 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1    Premium G     SI2      59.1    59  3142  6.55  6.48     0
##  2  1.01 Premium H     I1       58.1    59  3167  6.66  6.6      0
##  3  1.1  Premium G     SI2      63      59  3696  6.5   6.47     0
##  4  1.01 Premium F     SI2      59.2    58  3837  6.5   6.47     0
##  5  1.5  Good    G     I1       64      61  4731  7.15  7.04     0
##  6  1.15 Ideal   G     VS2      59.2    56  5564  6.88  6.83     0
##  7  2.18 Premium H     SI2      59.4    61 12631  8.49  8.45     0
##  8  2.25 Premium I     SI1      61.3    58 15397  8.52  8.42     0
##  9  2.2  Premium H     SI1      61.2    59 17265  8.42  8.37     0
## 10  2.02 Premium H     VS2      62.7    53 18207  8.02  7.95     0
## 11  2.8  Good    G     SI2      63.8    58 18788  8.9   8.85     0
## 12  1.12 Premium G     I1       60.4    59  2383  6.71  6.67     0

All x and y values are positive, for now we don’t need to do anything. But, for z values, there are 12 0 values in z column. We can fill these values with the help of x, y and depth column. Because z = depth *100 * mean(x, y).

diamonds = diamonds %>%
  mutate(z = ifelse(z == 0 & x != 0 & y != 0,round(depth * mean(c(x, y)) / 100, 2), z)) 

From now on, we searched for x, y or z values that are not positive when the others are positive. If we find any row, we could calculate it as we did above. Now, we need to check for rows that two of these three columns are not positive.

diamonds %>%
  filter(x <= 0 & y <= 0)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139     0     0     0
## 2  1.14 Fair      G     VS1      57.5    67  6381     0     0     0
## 3  1.56 Ideal     G     VS2      62.2    54 12800     0     0     0
## 4  1.2  Premium   D     VVS1     62.1    59 15686     0     0     0
## 5  2.25 Premium   H     SI2      62.8    59 18034     0     0     0
## 6  0.71 Good      F     SI2      64.1    60  2130     0     0     0
diamonds %>%
  filter(x <= 0 & z <= 0)
## # A tibble: 7 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
## 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
## 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
## 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
## 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
## 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
## 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
diamonds %>%
  filter(y <= 0 & z <= 0)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1    Very Good H     VS2      63.3    53  5139     0     0     0
## 2  1.14 Fair      G     VS1      57.5    67  6381     0     0     0
## 3  1.56 Ideal     G     VS2      62.2    54 12800     0     0     0
## 4  1.2  Premium   D     VVS1     62.1    59 15686     0     0     0
## 5  2.25 Premium   H     SI2      62.8    59 18034     0     0     0
## 6  0.71 Good      F     SI2      64.1    60  2130     0     0     0

When we look at the result, we have 7 rows that both x and z values are 0. If we remove these rows from the data, we would have more accurate data. So, we need to apply this code chunk.

diamonds = diamonds %>%
  filter(!(x == 0 & z == 0))

diamonds %>%
  filter(x == 0 | y == 0 | z == 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>

After this process, we have all positive values in x, y and z columns. Now, we need to check the range of x, y and z columns for outliers.

range(diamonds$x)
## [1]  3.73 10.74
diamonds$x %>% unique() %>% sort(decreasing = TRUE) %>% head(20)
##  [1] 10.74 10.23 10.14 10.02 10.01 10.00  9.86  9.66  9.65  9.54  9.53  9.51
## [13]  9.49  9.44  9.42  9.41  9.38  9.36  9.35  9.32

There is nothing wrong with length of these diamonds.

range(diamonds$y)
## [1]  3.68 58.90
diamonds$y %>% unique() %>% sort(decreasing = TRUE) %>% head(20)
##  [1] 58.90 31.80 10.54 10.16 10.10  9.94  9.85  9.81  9.63  9.59  9.48  9.46
## [13]  9.42  9.40  9.38  9.37  9.34  9.32  9.31  9.26

There are diamonds$y %>% unique() %>% sort(decreasing = TRUE) %>% head(20) %>% nrow() values that are very big respect to other values. So, we need to remove these rows from the data or recalculate the y values. To avoid to reduce the data, we will try to recalculate them with the help of depth, x and z values.

diamonds %>%
  filter(y > 15) %>%
  mutate(new_y = (2 * z / depth) / 100 - x) %>%
  select(depth, x, z, y, new_y)
## # A tibble: 2 x 5
##   depth     x     z     y new_y
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  58.9  8.09  8.06  58.9 -8.09
## 2  61.8  5.15  5.12  31.8 -5.15

These results are not suitable values for y column, so we need to remove them from the data.

diamonds = diamonds %>%
  filter(y < 15)
range(diamonds$z)
## [1]  1.07 31.80
diamonds$z %>% unique() %>% sort(decreasing = TRUE) %>% head(20)
##  [1] 31.80  6.98  6.72  6.43  6.38  6.31  6.27  6.24  6.17  6.16  6.13  6.03
## [13]  5.98  5.97  5.92  5.91  5.90  5.86  5.85  5.79

For z column, there is one row that has a high value respect to other rows. So, as we did before, we need to calculate the z value like we did before.

diamonds %>%
  filter(z == 31.8) %>%
  mutate(new_z = depth * mean(c(5.12, 5.15)) / 100) %>%
  select(z, new_z)
## # A tibble: 1 x 2
##       z new_z
##   <dbl> <dbl>
## 1  31.8  3.17

So, we can say that this value should be divided by 10. We can compare the other z values that have similar x and y values with this z value.

diamonds[abs(diamonds$x - 5) < 1 & abs(diamonds$y - 5) < 1 & abs(diamonds$z - 3.18) < 1, "z"]
## # A tibble: 30,800 x 1
##        z
##    <dbl>
##  1  2.31
##  2  2.63
##  3  2.75
##  4  2.53
##  5  2.73
##  6  2.71
##  7  2.68
##  8  2.68
##  9  2.7 
## 10  2.71
## # ... with 30,790 more rows

There are many z values near to 3.18. So, we can say that this value should be divided by 10. Also, we need to update the new_depth column

diamonds$z[diamonds$z == 31.8] = diamonds$z[diamonds$z == 31.8] / 10

Now, we need to check the depth column. To do so, we can calculate a new column called new_depth and compare it with the depth column.

diamonds$new_depth = 2 * diamonds$z / (diamonds$x + diamonds$y) * 100

The depth and new_depth columns should be equal. To be easily see that, we can use scatter plot and add a line.

# diamonds[, calculate := 2 * z / (x + y)]
diamonds %>%
  ggplot(., aes(x = new_depth, y = depth)) +
  geom_point() + 
  geom_abline(intercept = 0, slope = 1, color="red", size=1.5)

From the plot, we can say that most of the depth values are almost equal. But, there are some rows that their depth and new_depth values are very far from each others. We can say that their difference should be less than 11 (This is an assumption respect to range of depths in this link. In this page, it says that " (For a princess cut diamond) A very good cut can be between 56 to 67 percent or 75 to 76 percent.". So, maximum depth range can be 11 among all other shapes.)

diamonds %>%
  filter(!(abs(new_depth - depth) < 11)) %>%
  select(new_depth, depth, x, y, z)
## # A tibble: 20 x 5
##    new_depth depth     x     y     z
##        <dbl> <dbl> <dbl> <dbl> <dbl>
##  1      51.2  62.8  6.26  6.19  3.19
##  2      63.1  43    6.32  6.27  3.97
##  3      65.7  44    6.31  6.24  4.12
##  4      51.7  64    7.15  7.04  3.67
##  5      63.0  43    6.53  6.55  4.12
##  6      16.1  60.6  6.62  6.67  1.07
##  7      20.5  61.9  7.43  7.5   1.53
##  8      19.4  60.7  7.31  7.22  1.41
##  9      40.3  59.4  8.49  8.45  3.41
## 10      78.3  59    6.16  6.15  4.82
## 11      41.4  61.3  8.52  8.42  3.51
## 12      78.3  65.6  7.89  7.84  6.16
## 13      79.1  60    6.29  6.25  4.96
## 14      41.8  61.2  8.42  8.37  3.51
## 15      45.0  62.7  8.02  7.95  3.59
## 16      41.2  63.8  8.9   8.85  3.66
## 17      84.3  61.2  4.51  6.02  4.44
## 18      43.9  60.9  4.71  4.68  2.06
## 19      42.3  61    5.31  5.34  2.25
## 20     101.   63.7  5.01  5.04  5.06

There are 20 observations. When we compare it with the number of all observations in the dataset, it is very small value. So, we can remove them from the dataset.

diamonds = diamonds %>%
  filter((abs(new_depth - depth) < 11))
diamonds = subset(diamonds, select = -c(new_depth))

Now, we can compare the x, y and z values with each other. These values should be highly correlated with each other.

diamonds %>%
  ggplot(., aes(x = x, y = y)) +
  geom_point() + 
  geom_smooth(method = "lm")

diamonds %>%
  ggplot(., aes(x = z, y = y)) +
  geom_point() + 
  geom_smooth(method = "lm")

As expected, these values are highly correlated. We can assume that these x, y and z values are valid values.

3 Exploratory Data Analysis

After the pre-processing of the data, we make EDA for the diamonds dataset by using different variables.

3.1 Cut

As we mentioned before, cut is one of the most important feature of diamonds. In this data set, quality of cut is given as categorical variables. Fair is the lowest quality, whereas, ideal is the highest quality. For this reason, to examine diamond dataset by using this feature, we can see the percentages of cut types in the dataset.

diamonds %>%
  group_by(cut) %>%
  summarise(count = n()) %>%
  mutate(percentage = 100*count/sum(count)) %>%
  ggplot(., aes(x = '', y = count, fill = cut)) + 
    geom_bar(width = 1, stat = "identity", alpha = 0.8) +
    coord_polar("y") +
    theme_void() +
    theme(plot.title = element_text(vjust = 0.5)) +
    geom_text(aes(label = paste(format(percentage,digits=2), "%")), size=4, color = "black", position = position_stack(vjust = 0.3)) +
    labs(title = "Percentage of Quality of Cut ",
        subtitle = st,
        fill = "Quality of the Cut")

The pie chart illustrates that, most of the diamonds are cut as ideal form. It was expected, because if the cut type of a diamonds is ideal, it will be more expensive than the other cut types (when every other properties are equal). Moreover, percentage of premium and very good are almost equal. On the other hand, there is a little fair cutting type. We can see the count, minimum, maximum and average prices from the table below.

diamonds %>%
  group_by(cut) %>%
  summarise(cut_count = n(),
            MinPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price)) %>%
  kable(col.names = c("Cut", "Number of Cut Quality", "Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_minimal(full_width = F)
Cut Number of Cut Quality Minimum Price Average Price Maximum Price
Fair 1594 337 4334.275 18574
Good 4888 327 3916.279 18707
Very Good 12065 336 3981.305 18818
Premium 13738 326 4576.581 18823
Ideal 21480 326 3461.698 18806

While, the pie chart presents the percentage of the dataset according to the cut type, the Above table shows the number of trial with cut type. Furthermore, the minimum, average and maximum prices are addressed in this table. Although the most popular cutting type is Ideal cut, its price is not the highest one. According to the average prices, the most expensive diamonds are belongs to Premium if we just look for the cut variable. So, it means that we can not explain the response (price) variable only with the cut type of a diamond.

3.2 Colors

Colors are the another important feature of the diamonds and also it affects the price of the diamonds. First, according to trials, we can group according to the color. After this grouping process we can number of diamonds with this group.

diamonds %>%
  group_by(color)%>%
  summarise(count = n()) %>%
  ggplot(., aes(x=color, y = count, fill = count)) +
    geom_col() +
    scale_fill_gradient("count", low="yellow", high="red") +
    geom_line(aes(y = count), size = 1.2, color="black", group = 1) +
    theme_minimal() +
    theme(legend.position = "none") +
    labs(title = "Classification Of Diamonds According to the Colors",
         subtitle = st,
         x = "Diamond Color",
         y = "Number of Color")

The bar chart illustrates that there is at most G color diamond, whereas, there is at least J color. Their distribution looks like normal. We can obtain exact value of each color and price values in data set by using below table.

diamonds %>%
  group_by(color)%>%
  summarise(color_count = n(),
            MinPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price)) %>%
  kable(col.names = c("Color", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_minimal(full_width = F)
Color Count Minimum Price Average Price Maximum Price
J 2800 335 5327.337 18710
I 5405 334 5079.123 18823
H 8263 337 4474.449 18803
G 11253 354 3997.174 18818
F 9517 342 3726.559 18791
E 9773 326 3080.044 18731
D 6754 357 3172.589 18693

According to the diamond information given above D-F color interval is the highest purity. Then, G-F color scale follows. Results illustrates that the number of G color is the highest. The most beautiful color, i.e., actually colorless, is D-F. Moreover, by using this table output, we can obtain minimum,maximum and average prices and compare the price of diamonds with each other by using color classification. According to the average prices, the most expensive diamonds are belongs to J if we just look for the color variable. So, it means that we can not explain the response (price) variable only with the color type of a diamond.

3.3 Clarity

Clarity gives information about diamonds whether it contains stain or not. In this report, to find clarity classification of the data set, a kind of scatter plot is used.

diamonds %>%
  group_by(clarity) %>%
  summarise(clarity_count = n()) %>%
  ggplot(.,aes(x=clarity, y = clarity_count, color= clarity_count)) +
    geom_point(size=9) +
    geom_segment(aes(x=clarity,
                     xend=clarity,
                     y=0,
                     yend=clarity_count))+
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90), legend.position = "none")+
        labs(title = "Classification Of Diamonds According to the Clarity",
         subtitle = st, 
         x = "Clarity",
         y = "Number of Diamonds by Clarity")

According to the results,

  • The most of the data is occurred form the SI1 clarity type, VS2 and SI2 follow SI1, respectively. These clarity classes are worse when we compare the other clarity classes.
  • The best type of clarity, IF, takes place in the at the second place from the last. It can be expected that it is hard to produce a diamonds without any clarity.

To get more clear analysis, the following table, which shows the number of diamonds according to clarity, can be analyzed. Moreover, the minimum, maximum, and average prices can be also obtained.

diamonds %>%
  mutate(clarity = factor(clarity)) %>%
  group_by(clarity) %>%
  summarise(clarity_count = n(),
            MinPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price)) %>%
  arrange(desc(clarity_count)) %>%
  kable(col.names = c("Clarity", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_minimal(full_width = F)
Clarity Count Minimum Price Average Price Maximum Price
SI1 13026 326 3992.932 18818
VS2 12223 334 3925.842 18823
SI2 9140 326 5051.592 18804
VS1 8153 327 3840.792 18795
VVS2 5055 336 3286.971 18768
VVS1 3645 336 2523.088 18777
IF 1784 369 2870.570 18806
I1 739 345 3924.185 18531

According to the average prices, the most expensive diamonds are belongs to SI2 if we just look for the clarity variable. So, it means that we can not explain the response (price) variable only with the clarity type of a diamond.

3.4 Carat

Carat has two meanings: (i) the purity of the diamonds and (ii) the weight of the diamonds. In this data set, the carat illustrates the weight of the carat. To see the most used carat, the number of data is group according to the carat variable.

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(diamonds$carat)
## [1] 0.3

According to the data set, the most preferable carat is 0.3 carat.

diamonds %>%
  mutate(carat = factor(carat)) %>%
  group_by(carat) %>%
  summarise(carat_count = n())%>%
  arrange(desc(carat_count)) %>%
  head(10) %>%
  ggplot(., aes(y=carat_count, x = reorder(carat, -carat_count), fill = carat)) +
  geom_col() +
  geom_text(aes(label = carat_count), size=3, color = "black", position = position_stack(vjust = 0.95)) +
  theme_minimal() +
  scale_fill_brewer(palette = c("Paired")) +
  theme(legend.position = "none") +
  labs(title = "The Most 10 Popular Carat",
       subtitle = st,
       x = "Carat",
       y = "Number of Count")

From the histogram, you can find the distribution of the carat. To see all carat according to the count, following table can be analyzed. In addition to the number of diamonds according to the carat classification, the price intervals and average price also can be investigated.

diamonds %>%
  group_by(carat) %>%
  summarise(carat_count = n(),
            inPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price))%>%
  kable(col.names = c("Carat", "Number of Carats","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_styling("striped", full_width = T) %>%
  scroll_box(width = "100%", height = "300px")
Carat Number of Carats Minimum Price Average Price Maximum Price
0.20 12 345 365.1667 367
0.21 9 326 380.2222 394
0.22 5 337 391.4000 470
0.23 293 326 486.1433 688
0.24 254 336 505.1850 963
0.25 212 357 550.9245 1186
0.26 253 337 550.8972 814
0.27 233 361 574.7597 893
0.28 198 360 580.1212 828
0.29 130 334 601.1923 1776
0.30 2596 339 680.6367 2366
0.31 2238 335 708.1997 1917
0.32 1827 345 719.1916 1715
0.33 1179 366 765.1747 1389
0.34 909 413 755.8086 2346
0.35 663 409 799.5098 1672
0.36 570 410 781.4667 1718
0.37 394 451 816.0711 1764
0.38 669 497 904.1973 1992
0.39 397 451 923.4761 1624
0.40 1291 484 933.0077 1908
0.41 1378 467 973.0530 1745
0.42 704 552 994.4545 1722
0.43 486 452 995.2737 1827
0.44 212 648 1056.6226 2506
0.45 110 672 1093.3182 1833
0.46 177 746 1240.2429 2139
0.47 99 719 1272.3131 2460
0.48 63 841 1322.0476 2677
0.49 45 900 1306.3333 2199
0.50 1254 584 1503.8453 3378
0.51 1123 826 1623.5841 4368
0.52 813 870 1656.4785 3373
0.53 707 844 1685.5248 3827
0.54 624 935 1704.3317 4042
0.55 495 959 1725.2283 3509
0.56 492 953 1831.9980 4632
0.57 429 929 1798.0443 4032
0.58 310 893 1798.6484 3741
0.59 282 903 1831.4504 4916
0.60 226 806 1956.2434 4209
0.61 204 931 2069.2451 4279
0.62 135 933 1973.3704 3968
0.63 102 1065 2244.7843 6607
0.64 80 1080 2113.5875 4281
0.65 65 1214 2322.6154 5052
0.66 48 1188 2235.7292 4311
0.67 48 1244 2067.8542 3271
0.68 25 1341 2334.3200 4325
0.69 26 1636 2206.6154 3784
0.70 1981 945 2516.0717 5539
0.71 1291 1305 2674.3106 5019
0.72 764 945 2705.7788 5086
0.73 492 1049 2791.2480 5154
0.74 322 1632 2821.9006 5841
0.75 249 1013 2763.4016 5288
0.76 251 1140 2889.5936 5056
0.77 251 1651 2884.3984 5251
0.78 187 1720 2918.9091 4904
0.79 151 1809 2993.2649 4858
0.80 284 1232 3146.1901 6834
0.81 200 1687 3094.4250 6338
0.82 140 2033 3083.6500 4931
0.83 131 1863 3181.9924 7889
0.84 64 1749 3294.9844 6399
0.85 62 1250 3204.1774 7230
0.86 34 1828 3289.9706 5029
0.87 31 1827 3342.3226 5640
0.88 23 1964 3431.3478 5595
0.89 21 1334 3291.2381 4854
0.90 1484 1599 3939.5889 9182
0.91 570 1570 3981.1561 9636
0.92 226 2283 4035.2124 7544
0.93 142 2374 3994.2958 5992
0.94 59 2571 4195.7797 7120
0.95 65 2195 4013.6615 6951
0.96 103 1637 3623.4660 6807
0.97 59 1547 4026.2373 7415
0.98 31 1712 3469.7742 4950
0.99 23 1789 4406.1739 6094
1.00 1544 1681 5250.7720 16469
1.01 2240 2017 5507.1902 16234
1.02 883 2344 5598.4632 17100
1.03 522 1262 5583.8966 17590
1.04 474 2037 5774.6709 18542
1.05 361 2066 5781.6814 13060
1.06 372 2080 5947.0081 15813
1.07 340 2260 6004.2118 18279
1.08 246 2869 5774.3415 14032
1.09 285 2708 6020.0947 18231
1.10 276 3216 5913.0725 12748
1.11 308 3183 6005.5682 15464
1.12 251 2383 6044.4900 14407
1.13 246 2968 6193.4024 14525
1.14 206 2327 6354.4078 18112
1.15 149 3220 5959.9060 12895
1.16 172 2368 6274.2965 13595
1.17 110 2336 6224.8818 11886
1.18 123 3219 6481.1057 14847
1.19 126 2892 6483.2381 15767
1.20 643 2360 6670.7387 16256
1.21 470 2396 6905.4043 17353
1.22 299 2699 6971.5151 14414
1.23 277 3168 7290.8881 16253
1.24 235 2940 7147.1149 15026
1.25 187 3276 7123.2513 16823
1.26 146 3881 7490.7466 15247
1.27 134 2845 7119.1194 15377
1.28 106 3355 7711.8113 18700
1.29 101 2596 7033.0198 17932
1.30 121 2512 7851.6612 15874
1.31 132 3697 7750.6364 17496
1.32 89 4140 7639.3708 15802
1.33 87 4634 7527.0805 18435
1.34 68 4352 7758.5000 17663
1.35 77 4082 7448.2857 14616
1.36 50 4158 8591.5200 15386
1.37 46 5063 8044.0870 13439
1.38 26 4598 8164.3077 17598
1.39 36 3914 8335.1111 13061
1.40 50 4939 9108.5200 16808
1.41 39 4145 9452.2051 17216
1.42 25 6069 9487.2000 18682
1.43 19 6086 9234.9474 14429
1.44 18 4064 8021.6111 12093
1.45 15 4320 8709.3333 17545
1.46 18 5421 8237.4444 12261
1.47 21 4898 7833.7143 12547
1.48 7 6216 8791.2857 15164
1.49 11 5407 9174.2727 18614
1.50 787 2964 10082.9835 18691
1.51 804 3497 10552.0211 18806
1.52 380 3105 10629.9868 17723
1.53 219 5631 10703.4658 17223
1.54 174 4492 10865.0287 18416
1.55 123 4965 10717.7642 18188
1.56 108 5766 10205.9444 17455
1.57 106 5170 10422.7075 17548
1.58 89 4459 11427.9888 18057
1.59 89 6010 10574.3596 17552
1.60 95 6272 11124.2526 18780
1.61 64 6915 10992.8438 18318
1.62 61 5157 11705.8852 18281
1.63 50 7100 11556.3200 18179
1.64 43 4849 10738.0000 17604
1.65 32 5914 10973.8750 17729
1.66 30 5310 11084.5000 16294
1.67 25 7911 12281.4800 17351
1.68 19 5765 11779.1053 16985
1.69 24 5182 11878.0833 17803
1.70 215 5617 12110.3209 18718
1.71 119 6320 12396.5546 18791
1.72 57 5240 12040.3158 18730
1.73 52 6007 11875.9615 18377
1.74 40 4677 12270.3250 18430
1.75 50 7276 12673.7000 17343
1.76 28 9053 12398.5714 18178
1.77 17 7771 11103.1765 15278
1.78 12 8889 12504.5833 18128
1.79 15 9122 13144.0667 18429
1.80 20 7862 12955.1500 17273
1.81 9 5859 11374.4444 14177
1.82 13 5993 12986.1538 15802
1.83 18 5083 11930.0000 18525
1.84 4 7922 11108.5000 13905
1.85 3 5688 11436.0000 14375
1.86 9 9791 12164.3333 17267
1.87 7 9955 12933.5714 17761
1.88 4 8048 10946.2500 13607
1.89 4 10055 13969.0000 17553
1.90 7 8576 11477.1429 13919
1.91 12 6632 12365.8333 17509
1.92 2 15364 15418.0000 15472
1.93 6 13278 16828.1667 18306
1.94 3 9344 14189.3333 18735
1.95 3 5045 12340.6667 17374
1.96 4 5554 8642.2500 13099
1.97 4 14410 15966.0000 17535
1.98 5 12923 14667.8000 16171
1.99 3 11486 14179.0000 17713
2.00 261 5051 14107.1456 18818
2.01 433 5696 14772.7136 18741
2.02 176 6346 14767.4091 18731
2.03 122 6002 14710.1148 18781
2.04 86 7403 14840.1744 18795
2.05 65 7026 15647.2462 18787
2.06 59 5203 14348.6780 18779
2.07 50 6503 15384.2800 18804
2.08 41 6150 15128.6829 18760
2.09 45 10042 15287.3111 18640
2.10 51 6597 14464.3725 18648
2.11 43 7019 14563.4186 18575
2.12 25 7508 15007.6800 18656
2.13 21 12356 15220.4286 18442
2.14 48 5405 14905.0417 18528
2.15 22 5430 14187.1818 18791
2.16 25 8709 15860.4800 18678
2.17 18 6817 14416.6111 17805
2.18 30 10340 14901.7000 18717
2.19 22 10179 15521.0909 18693
2.20 31 7934 15226.8710 18364
2.21 23 6535 15287.9565 18483
2.22 27 5607 14691.5185 18706
2.23 13 7006 13947.4615 18149
2.24 16 10913 15573.9375 18299
2.25 16 6653 13403.5625 18034
2.26 15 11226 15330.2000 18426
2.27 12 5733 14198.4167 18343
2.28 20 9068 15422.9500 18055
2.29 17 11502 15834.7647 18823
2.30 20 7226 14861.5500 18304
2.31 13 7257 15017.3846 17715
2.32 16 6860 16042.1875 18532
2.33 9 8220 14089.7778 18119
2.34 5 8491 11922.0000 15818
2.35 7 14185 16090.4286 17999
2.36 8 11238 15756.7500 18745
2.37 6 14837 16525.1667 18508
2.38 8 10716 15481.3750 18559
2.39 7 15917 17182.4286 17955
2.40 13 11988 16362.4615 18541
2.41 5 13563 15808.2000 17923
2.42 7 16198 16923.5714 18615
2.43 6 9716 15356.0000 18692
2.44 4 13027 16365.5000 18430
2.45 4 11830 14593.7500 18113
2.46 3 10470 14227.0000 16466
2.47 3 14970 15644.0000 16532
2.48 9 12883 16537.1111 18692
2.49 3 6289 13843.0000 18325
2.50 16 7854 14316.1875 18447
2.51 17 14209 16616.8824 18308
2.52 9 10076 16242.1111 18252
2.53 8 14659 16548.8750 18254
2.54 8 12095 15125.2500 17996
2.55 3 14351 15964.0000 18766
2.56 3 15231 16723.3333 17753
2.57 3 17116 17841.6667 18485
2.58 3 12500 14481.3333 16195
2.59 1 16465 16465.0000 16465
2.60 3 17027 17535.0000 18369
2.61 3 13784 16583.0000 18756
2.63 3 10437 12659.6667 16914
2.64 1 17407 17407.0000 17407
2.65 1 16314 16314.0000 16314
2.66 2 16239 17367.0000 18495
2.67 1 18686 18686.0000 18686
2.68 2 8419 9042.0000 9665
2.70 1 14341 14341.0000 14341
2.71 1 17146 17146.0000 17146
2.72 3 6870 12088.3333 17801
2.74 3 8807 14385.0000 17184
2.75 2 13156 14285.5000 15415
2.77 1 10424 10424.0000 10424
2.80 1 15030 15030.0000 15030
3.00 8 6512 12378.5000 16970
3.01 12 8040 15521.6667 18710
3.02 1 10577 10577.0000 10577
3.04 2 15354 16956.5000 18559
3.05 1 10453 10453.0000 10453
3.11 1 9823 9823.0000 9823
3.22 1 12545 12545.0000 12545
3.24 1 12300 12300.0000 12300
3.40 1 15964 15964.0000 15964
3.50 1 12587 12587.0000 12587
3.51 1 18701 18701.0000 18701
3.65 1 11668 11668.0000 11668
3.67 1 16193 16193.0000 16193
4.00 1 15984 15984.0000 15984
4.01 2 15223 15223.0000 15223
4.13 1 17329 17329.0000 17329
4.50 1 18531 18531.0000 18531
5.01 1 18018 18018.0000 18018

Up until now, there are most important features of the diamonds are presented as a count analysis. After that point, relationship between variables and prices are illustrated.

3.5 Price and Cut Analysis by Color

diamonds %>%
  group_by(cut, color) %>%
  summarise(avg_price = mean(price)) %>%
  ggplot(aes(x=cut, y= avg_price, fill = cut)) +
    geom_col() +
    facet_wrap(~color) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90), legend.position = "none") +
    labs(title = "Relationship Analysis Between Price and Cut by Color",
         subtitle = st, 
         y = "Average Price",
         x = "Quality of Cut")

When we look at the plots, we can say that these two variables can not explain the price variable because the worst type of colors have the highest prices in the best type of cuts.

3.6 Price and Clarity Analysis by Color

diamonds %>%
  group_by(clarity, color) %>%
  summarise(MeanPrice = mean(price)) %>%
  ggplot(aes(x=clarity, y = MeanPrice, fill = color)) +
    geom_col(alpha = 0.8) +
    theme_minimal() +
    facet_wrap(~color) +
    labs(title = "Relationship Analysis Between Price and Clarity by Color",
         subtitle = st,
         x = "Clarity",
         y = "Average Price",
         fill = "Color")

When we look at the best option of these two types, it has the highest average price among the others. But, when we look at the others, prices are decreasing when we increase the clarity for all types of color. So, it means that we can not still explain the price with these two variables.

3.7 Price and Clarity Analysis by Cut

diamonds %>%
  group_by(clarity, cut) %>%
  summarise(MeanPrice = mean(price)) %>%
  ggplot(aes(x=clarity, y = MeanPrice, fill = cut)) +
    geom_col(alpha = 0.8) +
    theme_minimal() +
    facet_wrap(~cut) +
    labs(title = "Relationship Analysis Between Price and Clarity by Cut",
         subtitle = st,
         x = "Clarity",
         y = "Average Price",
         fill = "Cut")

We have the same results like the last plot. SI2, which is the second worst type of clarity, has the highest average price in all cut types. So, we can not explain the price variable with these two types.

3.8 Price Group Analysis

quant = quantile(diamonds$price, seq(0, 1, 0.2))

diamonds_price_group <- diamonds %>%
  mutate(price_group = case_when(
    price < quant[2] ~ "Very Low",
    price < quant[3] ~ "Low",
    price < quant[4] ~ "Medium",
    price < quant[5] ~ "High",
    TRUE ~ "Very High"
  )) %>%
  mutate(price_group = factor(price_group, levels = c("Very Low", "Low", "Medium", "High", "Very High")))

diamonds_price_group %>%
  group_by(cut, price_group) %>%
  summarise(count=n()) %>%
  mutate(percentage = 100 * count / sum(count)) %>%
  ggplot(., aes(x = '', y = count, fill = price_group)) + 
  geom_bar(width = 1, stat = "identity", position = "fill") +
  coord_polar("y") +
  theme_void() +
  theme(plot.title = element_text(vjust = 0.5)) +
  facet_wrap(~cut) +
  labs(title = "Price Group Analyses of Cut",
       subtitle = st,
       fill = "Price Group")

All cut types are equally distributed except Fair type. We have only 0.03 percent Fair cut types in the data. So, there could be some grouping in the price groups. Also, we can see the results in a table below.

price_group_vs_cut = diamonds_price_group %>%
  group_by(cut, price_group) %>%
  summarise(count=n()) %>%
  mutate(percentage = 100 * count / sum(count)) %>%
  select(cut, price_group, count, percentage)%>%
  pivot_wider(id_cols = cut, names_from = price_group, values_from = count)
cut_percentage = diamonds_price_group %>%
  group_by(cut) %>%
  summarise(count=n()) %>%
  mutate(percentage = round(100 * count / sum(count),2)) %>%
  select(cut,percentage)
price_group_vs_cut %>%
  inner_join(cut_percentage, by = "cut") %>%
  kable(col.names = c("Cut", "Very Low","Low", "Medium", "High", "Very High", "Percentage"))%>% 
  kable_minimal(full_width = F) 
Cut Very Low Low Medium High Very High Percentage
Fair 68 221 546 463 296 2.96
Good 953 692 1032 1363 848 9.09
Very Good 2735 1782 2448 2697 2403 22.44
Premium 2246 2568 2297 3200 3427 25.55
Ideal 4745 5491 4435 3024 3785 39.95

3.9 Price Histogram

diamonds %>%
  ggplot(aes(x=price)) +
  geom_histogram(aes(y=..density..), position="identity", alpha=0.8, fill = "seashell3", color="seashell4") +
  geom_density(alpha=1, size = 1)+
  theme_minimal() +
  labs(title = "Histogram of Price",
       subtitle = st,
       x = "Price",
       y = "Count")

The histogram shows that the price in diamonds dataset is right skewed.

diamonds %>%
  ggplot(aes(x=log(price))) +
  geom_histogram(aes(y=..density..), position="identity", alpha=0.8, fill = "seashell3", color="seashell4") +
  geom_density(alpha=1, size = 1)+
  theme_minimal() +
  labs(title = "Histogram of Price",
       subtitle = st,
       x = "Price",
       y = "Count")

When we plot the log transformation of the prices, we didn’t get the shape of the normal distribution. So, we should not try linear models for this data. Because of that we can try glm, decision trees etc.

4 Feature Extraction / Dimentionality Reduction

At first, we need to divide the data to train and test sets.

set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

diamonds_train = diamonds_train[, c(-ncol(diamonds_train))]
diamonds_test = diamonds_test[, c(-ncol(diamonds_test))]

All the ordered columns, we will use them as numeric values.

diamonds_train$cut = as.numeric(diamonds_train$cut)
diamonds_train$clarity = as.numeric(diamonds_train$clarity)
diamonds_train$color = as.numeric(diamonds_train$color)
diamonds_test$cut = as.numeric(diamonds_test$cut)
diamonds_test$clarity = as.numeric(diamonds_test$clarity)
diamonds_test$color = as.numeric(diamonds_test$color)

4.1 PCA

There are some numerical variables. For this reason we can apply PCA to decrease the number of columns in this dataset.

diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
summary(diamonds_pca,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     2.0743448 1.1892870 1.1060252 0.9927188 0.83129152
## Proportion of Variance 0.4781007 0.1571560 0.1359213 0.1094989 0.07678284
## Cumulative Proportion  0.4781007 0.6352566 0.7711779 0.8806769 0.95745974
##                            Comp.6      Comp.7       Comp.8       Comp.9
## Standard deviation     0.59265505 0.172985644 0.0353090908 2.125041e-02
## Proportion of Variance 0.03902667 0.003324893 0.0001385258 5.017554e-05
## Cumulative Proportion  0.99648641 0.999811299 0.9999498245 1.000000e+00
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat    0.471                       0.126         0.868              
## cut     -0.109  0.596  0.451  0.239 -0.175  0.585                     
## color   -0.158 -0.203         0.802  0.537                            
## clarity -0.222  0.208  0.212 -0.506  0.778                            
## depth           0.207 -0.828         0.118  0.492                0.101
## table    0.129 -0.697  0.226 -0.183         0.642                     
## x        0.475                       0.115        -0.277 -0.722  0.387
## y        0.474                       0.120        -0.299  0.692  0.420
## z        0.474  0.103                0.132        -0.282        -0.815

From the summary of PCA model that we create, we can see that four components can almost explain 88% of the data. We can choose to use four components for creating a model. Now, we need to add these components to both train and test datasets.

pca_results = predict(diamonds_pca, newdata = diamonds_train)
diamonds_train$pca1 = pca_results[,1]
diamonds_train$pca2 = pca_results[,2]
diamonds_train$pca3 = pca_results[,3]
diamonds_train$pca4 = pca_results[,4]

pca_results_test = predict(diamonds_pca, newdata = diamonds_test)
diamonds_test$pca1 = pca_results_test[,1]
diamonds_test$pca2 = pca_results_test[,2]
diamonds_test$pca3 = pca_results_test[,3]
diamonds_test$pca4 = pca_results_test[,4]

4.2 Clustering

We can create a feature with using the clustering methods. For now, we can use KMeans algorithm. To be able to use the KMeans algorithm, we need to scale all data. Because if a column would have much higher values respect to other columns, it can dominate the rest. To be able to scale the data we need the minimum and maximum values of the train dataset. We will scale both train and test set with the same values.

min_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], min)
max_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], max)
diamonds_train_scale <- sweep(sweep(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")

To be able to get similar clusters every time, we will use the same seed. For selecting the cluster number, we can loop over 1 to 15 for center argument. We can select the center value with respect to the error plot.

errors = c()
for (i in (1:15)){
  set.seed(1234) #provide getting same results with random function 
  cluster<-kmeans(diamonds_train_scale,centers=i) # application of the k-means function with i number of group size
  errors = c(errors, 100*round(1 - (cluster$tot.withinss/cluster$totss), digits = 3)) # calculation of the fulfillment of the clusters to data.
}

errors_df <- data.frame(x=c(1:15), y=errors) # creating data frame with errors.

ggplot(errors_df, aes(x=x, y=y)) +
  geom_point(color = "seashell4") +
  geom_line(color="seashell4") +
  geom_text(aes(label = errors), size=3, color = "black", position = position_stack(vjust = 0.95))+
  theme_minimal() +
  labs(title = "Classification of Observations",
       subtitle = st,
       x = "X",
       y = "Y")

The decrease in errors are slowly changing after the cluster with 7 centers. So, we can say that we should select the model with center equals to 7.

set.seed(1234)
best_cluster = kmeans(diamonds_train_scale,centers=7)
diamonds_train$cluster = as.factor(best_cluster$cluster)

Now, we need to apply clustering process to test dataset. To be able to do this I used the method in this link

diamonds_test_scale <- sweep(sweep(diamonds_test[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")

closest.cluster <- function(x) {
  cluster.dist <- apply(best_cluster$centers, 1, function(y) sqrt(sum((x-y)^2)))
  return(which.min(cluster.dist)[1])
}
diamonds_test$cluster <- as.factor(apply(diamonds_test_scale, 1, closest.cluster))

After these processes, we have enough data to create a model.

5 Create Model

At the beginning, we need to check the correlation of features with price column.

diamonds_cor<-cor(diamonds_train[-c(11:15)])
corrplot(diamonds_cor, method="number")

From the plot above, we can say that carat, x, y and z columns are highly correlated with price column. So, we can say that we should use this columns in any model. But, at the beginning, we will try for all columns in the model.

5.1 Linear Model (LM)

For simplicity, we can start with linear regression models. At the beginning, we will not consider the pca columns in the dataset.

model_lm1 = lm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train)
summary(model_lm1)
## 
## Call:
## lm(formula = price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23380.8   -624.4   -111.3    487.6  10016.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23516.947   1284.448 -18.309  < 2e-16 ***
## carat        11553.086     65.487 176.419  < 2e-16 ***
## cut            168.773     12.865  13.119  < 2e-16 ***
## color          283.070      5.670  49.920  < 2e-16 ***
## clarity        501.646      4.475 112.094  < 2e-16 ***
## depth          358.929     19.878  18.056  < 2e-16 ***
## table          -16.708      3.386  -4.934 8.07e-07 ***
## x             -347.272    147.355  -2.357   0.0184 *  
## y             3718.053    151.287  24.576  < 2e-16 ***
## z            -7446.634    320.787 -23.214  < 2e-16 ***
## cluster2       -48.540     25.392  -1.912   0.0559 .  
## cluster3        76.119     42.557   1.789   0.0737 .  
## cluster4       -42.650     35.186  -1.212   0.2255    
## cluster5       250.682     35.456   7.070 1.57e-12 ***
## cluster6       234.953     35.374   6.642 3.13e-11 ***
## cluster7       440.818     26.276  16.776  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1190 on 42987 degrees of freedom
## Multiple R-squared:  0.9106, Adjusted R-squared:  0.9106 
## F-statistic: 2.92e+04 on 15 and 42987 DF,  p-value: < 2.2e-16

We created a linear model. For, we will not try to improve the model and check for the linear model assumptions. To do so, we need to plot the residuals.

plot(model_lm1,1)

plot(model_lm1,2)

In the first plot, we can see that there is a variance in the response variable. It means that this data set is not suitable for linear regression. For better models we can use generalized linear models (GLM) rather than linear models.

5.2 Generalized Linear Model (GLM)

This data has a continuous response variable. It means that we can use Gamma and Gaussian families.

For easy implementation, we can use the code chunks that we learned in IE508.

for(link in c("identity", "log", "inverse", "sqrt")){
  fitgamma = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = Gamma(link = link), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
  print(AIC(fitgamma))
}
## [1] 673749.4
## [1] 1096312
## [1] 711941.7
## [1] 644408.7
for(link in c("identity", "log", "inverse")){
  fitgaussian = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = gaussian(link = link), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
  print(AIC(fitgaussian))
}
## [1] 731124.9
## [1] 5602415
## [1] 864186

When we compare the results of glm models, we see that the best model is the one with gamma family and sqrt link function. So, we will continue with these argument for glm.

model_glm = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm)
## 
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4, family = Gamma(link = "sqrt"), 
##     data = diamonds_train, start = c(0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.27202  -0.12157  -0.02088   0.09054   2.29577  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -75.03830    4.80955 -15.602  < 2e-16 ***
## carat        52.76661    0.34888 151.246  < 2e-16 ***
## cut           0.53402    0.04191  12.743  < 2e-16 ***
## color         1.19880    0.01677  71.493  < 2e-16 ***
## clarity       2.01002    0.01348 149.150  < 2e-16 ***
## depth         0.72617    0.07474   9.716  < 2e-16 ***
## table         0.01401    0.01082   1.295 0.195384    
## x            16.04449    0.61542  26.071  < 2e-16 ***
## y            -3.56428    0.62223  -5.728 1.02e-08 ***
## z           -12.25907    1.42896  -8.579  < 2e-16 ***
## cluster2     -0.12907    0.06921  -1.865 0.062195 .  
## cluster3     -1.58676    0.14697 -10.797  < 2e-16 ***
## cluster4     -2.87269    0.13135 -21.870  < 2e-16 ***
## cluster5     -0.10937    0.10869  -1.006 0.314332    
## cluster6      0.35828    0.10104   3.546 0.000392 ***
## cluster7      1.38220    0.10099  13.686  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.03404458)
## 
##     Null deviance: 42126.8  on 43002  degrees of freedom
## Residual deviance:  1393.5  on 42987  degrees of freedom
## AIC: 644409
## 
## Number of Fisher Scoring iterations: 8

Table is not significant in this model. So we can remove it from the model.

model_glm2 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm2)
## 
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4 - table, 
##     family = Gamma(link = "sqrt"), data = diamonds_train, start = c(0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.27264  -0.12163  -0.02084   0.09062   2.28622  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -73.19219    4.56920 -16.019  < 2e-16 ***
## carat        52.81087    0.34721 152.102  < 2e-16 ***
## cut           0.50918    0.03704  13.748  < 2e-16 ***
## color         1.19915    0.01676  71.540  < 2e-16 ***
## clarity       2.00950    0.01347 149.222  < 2e-16 ***
## depth         0.71171    0.07371   9.655  < 2e-16 ***
## x            16.05383    0.61528  26.092  < 2e-16 ***
## y            -3.63697    0.61902  -5.875 4.25e-09 ***
## z           -12.17752    1.42666  -8.536  < 2e-16 ***
## cluster2     -0.12735    0.06919  -1.841 0.065677 .  
## cluster3     -1.61398    0.14540 -11.101  < 2e-16 ***
## cluster4     -2.87373    0.13132 -21.884  < 2e-16 ***
## cluster5     -0.14129    0.10604  -1.332 0.182730    
## cluster6      0.32846    0.09832   3.341 0.000836 ***
## cluster7      1.38189    0.10096  13.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.0340283)
## 
##     Null deviance: 42126.8  on 43002  degrees of freedom
## Residual deviance:  1393.6  on 42988  degrees of freedom
## AIC: 644408
## 
## Number of Fisher Scoring iterations: 8

Also, we can test to remove the cluster column.

model_glm3 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table - cluster, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm3)
## 
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4 - table - 
##     cluster, family = Gamma(link = "sqrt"), data = diamonds_train, 
##     start = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.25890  -0.12326  -0.02196   0.08701   2.25745  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -86.25420    4.62895 -18.634  < 2e-16 ***
## carat        50.94390    0.34313 148.469  < 2e-16 ***
## cut           0.47216    0.01724  27.390  < 2e-16 ***
## color         1.33121    0.01096 121.467  < 2e-16 ***
## clarity       2.04246    0.01187 172.057  < 2e-16 ***
## depth         0.89650    0.07468  12.005  < 2e-16 ***
## x            17.25907    0.60617  28.472  < 2e-16 ***
## y            -2.30449    0.60142  -3.832 0.000127 ***
## z           -15.66734    1.44552 -10.839  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.03557524)
## 
##     Null deviance: 42126.8  on 43002  degrees of freedom
## Residual deviance:  1441.6  on 42994  degrees of freedom
## AIC: 645863
## 
## Number of Fisher Scoring iterations: 9

So we can say that cluster makes our model better. Now, we need to check whether we can replace numeric values with PCA values.

model_glm4 = glm(price ~ pca1 + pca2 + pca3 + pca4 + cluster, data = diamonds_train, family = Gamma(link = "sqrt"))
summary(model_glm4)
## 
## Call:
## glm(formula = price ~ pca1 + pca2 + pca3 + pca4 + cluster, family = Gamma(link = "sqrt"), 
##     data = diamonds_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.19335  -0.26935  -0.06091   0.16983   1.92149  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.58457    0.12355 457.991  < 2e-16 ***
## pca1        11.83082    0.04129 286.500  < 2e-16 ***
## pca2         4.07612    0.04255  95.803  < 2e-16 ***
## pca3         3.14036    0.04449  70.588  < 2e-16 ***
## pca4        -0.30761    0.06086  -5.054 4.34e-07 ***
## cluster2    -3.32685    0.14251 -23.345  < 2e-16 ***
## cluster3    -2.57743    0.27792  -9.274  < 2e-16 ***
## cluster4    -7.05281    0.27171 -25.957  < 2e-16 ***
## cluster5     2.92040    0.17373  16.810  < 2e-16 ***
## cluster6     3.86938    0.15565  24.859  < 2e-16 ***
## cluster7     1.72285    0.21070   8.177 2.99e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1343288)
## 
##     Null deviance: 42126.8  on 43002  degrees of freedom
## Residual deviance:  5343.6  on 42992  degrees of freedom
## AIC: 702856
## 
## Number of Fisher Scoring iterations: 12

PCA helps us to use less feature in a model. But, as expected, with fewer feature we will get more AIC value in the model. We have 13 numeric feature. So, we don’t need to decrease the feature number.

To be able to compare glm model with other models, we need to predict for test data and calculate the R squared value.

model_glm2_prediction<-predict(model_glm2,newdata=diamonds_test)

model_glm2_r2 <- 1 - (sum((model_glm2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm2_r2
## [1] -0.9323168

The result is negative. It means that we have a very bad model. So, we can change the family/link function of the glm to get a better model. The second best model have Gamma family with identity link function. So we need to create the same model with this family and link function.

model_glm5 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))

model_glm5_prediction<-predict(model_glm5,newdata=diamonds_test)

model_glm5_r2 <- 1 - (sum((model_glm5_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm5_r2
## [1] 0.8122059

This is a far more better model. Also we can plot the real values vs predictions.

pred_vs_real_glm = as.data.frame(cbind(model_glm5_prediction, diamonds_test$price))
colnames(pred_vs_real_glm) = c("predictions", "real")

pred_vs_real_glm %>%
  ggplot(aes(predictions, real)) +
  geom_point()  +
  geom_abline(intercept = 0, slope = 1, color="red", size=1.5) +
  labs(title = "Prediction vs Real Values For Best GLM Model",
       subtitle = st,
       x = "Predictions",
       y = "Real Values")

5.3 CART

model_rpart <- rpart(price~., data=diamonds_train)
prp(model_rpart)

From the plot of a tree, we can see that the nodes are divided with using only carat, y and clarity columns. It means that these are better features to reduce the variance in the dataset with default argument. We can make a cross validation for cp argument. To do so, we need to apply this code chunk (these lines are taken from the Analytics Edge Course:

# Number of folds
tr.control = trainControl(method = "cv", number = 10)

# cp values
cp.grid = expand.grid( .cp = (1:10)*0.001)

# Cross-validation
tr = train(price~. - pca1 - pca2 - pca3 - pca4, data = diamonds_train, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
tr
## CART 
## 
## 43003 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 38703, 38704, 38702, 38701, 38703, 38702, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE       Rsquared   MAE     
##   0.001   944.8551  0.9436613  577.6079
##   0.002  1037.4473  0.9320865  628.6129
##   0.003  1079.4290  0.9264358  653.3161
##   0.004  1143.8428  0.9174002  689.0738
##   0.005  1146.8377  0.9169674  690.2158
##   0.006  1241.1488  0.9026995  809.9428
##   0.007  1277.7140  0.8968892  845.0753
##   0.008  1277.7140  0.8968892  845.0753
##   0.009  1277.7140  0.8968892  845.0753
##   0.010  1334.7119  0.8871967  870.7893
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001.

We can see that we need to use 0.001 for cp when we compare with other cp values.

model_rpart2 <- rpart(price~. - pca1 - pca2 - pca3 - pca4, data=diamonds_train, cp = 0.001)
prp(model_rpart2)

This is a more detailed tree. From the plot of this tree, we can say that we will define the price with respect to carat, y, clarity and color values. To be able to compare these two trees, we need to calculate the R squared values. To do so, firstly we need to calculate the predictions for test data and then calculate the R squared values.

model_rpart_prediction<-predict(model_rpart,newdata=diamonds_test)

model_rpart_r2 <- 1 - (sum((model_rpart_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart_r2
## [1] 0.881602
model_rpart2_prediction<-predict(model_rpart2,newdata=diamonds_test)

model_rpart2_r2 <- 1 - (sum((model_rpart2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart2_r2
## [1] 0.9456136

As we can from the results, we can see that second model is better than the first model.

If we want to compare the best tree model with best glm model, we need to compare the R squared values for these models.

model_rpart2_r2 > model_glm5_r2
## [1] TRUE

So, our CART model is better than the glm model.

6 Conclusion

We prepared and checked the accuracy of the data for creating a model. Before creating models, we applied the PCA and KMeans algorithms. As we expect, we get worse result with columns that we craeted by PCA algorithm. PCA is better to use for data that have many features. With KMeans algorithm, we created a feature for our models after scaling the numerical columns (which removes the importance of all columns). This feature makes all models better. After creating the best linear, generalized linear and decision tree models, the decision tree model performed better than the others.