Assignment 3: Diamonds Data

1. Introduction

1.1. Diamonds

Diamonds are the result of carbon exposed to extremely high heat and pressure, deep in the earth. This process results in a variety of both internal and external characteristics(i.e. size, shape, color, etc.) which makes every diamond unique. Therefore, jewelry industry uses a systematic way to evaluate and discus these factors. Diamond professionals use the grading system developed by GIA in the 1950s, which established the use of four important factors to describe and classify diamonds: Clarity, Color, Cut, and Carat Weight.These are known as the 4Cs. When used together, they describe the quality of a finished diamond. The value of a finished diamond is based on this combination. The explanation of 4C can be seen below:

Color: Most commercially available diamonds are classified by color, or more appropriately, the most valuable diamonds are those classified as colorless. Color is graded on a letter scale from D to Z, with D representing a colorless diamond. Diamonds that are graded in the D-F range are the rarest and consequently most valuable. In reality, diamonds in the G-K range have such a small amount of color that the untrained eye can not see it.
Cut: The beauty of a diamond resides not only in a favorable body color, but also the high refractive index and color dispersion. The way a diamond is cut, its width, depth, roundness, size and position of the facets determine the brilliance of the stone. Even if the color and clarity are perfect, if the diamond is not cut to good proportions, it will be less impressive to the eye.
Clarity: The clarity of a diamond is determined by the number, location and type of inclusions it contains. Inclusions can be microscopic cracks, mineral deposits or external markings. Clarity is rated using a scale which contains a combination of letters and numbers to signify the amount and type of inclusions. This scale ranges from FL to I3, FL being Flawless and the most valuable.
Carat Weight: In addition to color, clarity and cut, weight provides a further basis in the valuation of a diamond. The weight of diamonds is measured in carats. One carat is equal to 1/5 of a gram. Smaller diamonds are more readily available than larger ones, which results in higher values based on weight. A 2 carat diamond will not be twice the cost of a 1 carat diamond, despite being twice the size. The larger the diamond, the rarer it becomes and as such the price increases exponentially.

The diamonds dataset that we use in this report consists of prices and quality information from about 54,000 diamonds, and is included in the ggplot2 package. The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price (in 2008). These attributes consist of the 4C and some physical measurements such as depth, table, x, y, and z.

1.2. Objectives

In this assignment, firstly, the data will be investigated for preprocessing to improve its quality. Then an exploratory data analysis will be performed by data manipulation and data visualization steps. The main purpose is to identify which variables affect the price of diamonds mostly and come up with a conclusion for the relationship between variables. In addition to these, forecasting models will be studied in order to estimate the price of diamonds with given properties. The processes during the assignment can be listed as below:

Data Preprocessing
Data Manipulation
Data Visualization
Forecasting

2. Data Preprocessing

2.1. Used Packages

library(tidyverse)
library(knitr)
library(tinytex)
library(kableExtra)
library(corrplot)
library(RColorBrewer)
library(ggExtra)
library(rpart)
library(rpart.plot)
library(rattle)
library(data.table)
library(caret)
library(e1071)

The packages used during the assignment can be listed as below:

tidyverse
knitr
tinytex
kableExtra
corrplot
RColorBrewer
ggExtra
rpart
rpart.plot
rattle
data.table
caret
e1071

2.2. Variables

Before starting the analysis with the dataset, it will be useful to have information about the properties of diamonds. There are 10 columns regarding the properties of diamonds.

carat: Weight of the diamond (numeric variable)
cut: Quality of the cut, from Fair, Good, Very Good, Premium, Ideal (categoric variable)
color: Color of the diamond, from D (best) to J (worst) (categoric variable)
clarity: A measurement of how clear the diamond is, from I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)(categoric variable)
depth: A measurement of the diamond called total depth percentage equals to z / mean(x, y) = 2 * z / (x + y)
table: width of top of diamond relative to widest point (numeric variable)
price: Price of the diamond in US dollars ranging between $326 to $18,823 (numeric variable)
x: Length of the diamond in mm (numeric variable)
y: Width of the diamond in mm (numeric variable)
z: Depth of the diamond in mm (numeric variable)

Diamond color type D is known as the best color while J is the worst, however str() output shows that the order is wrong, therefore we should identify the levels of the color variable in diamonds dataset.

str(diamonds)

## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

diamonds$color <- factor(diamonds$color, levels = c("J", "I", "H", "G", "F", "E", "D"))

2.3. Data Examination

2.3.1. NA, and Duplicated Values

The usefulness of the results obtained from analyzes depends on quality of the data. For this reason, it is important to investigate NA values and duplicate rows in diamonds dataset.

There are not any NA values.
There are not are not any negative values for price,carat,depth,table,x, yeand, z variables.
There are 146 duplicated rows, therefore they should be removed from the dataset. After we remove the duplicated lines, there is no duplicated line, and we have 53940 rows and 10 columns.

sum(any(is.na(diamonds)))
diamonds %>% filter(price<0 | carat<0 | depth<0 | table<0 | x<0 | y<0 | z<0)  %>%nrow()
sum(duplicated(diamonds))
diamonds <- unique(diamonds)
sum(as.numeric(duplicated(diamonds)))

2.3.2. Accuracy of Values

Unrealistic price values:
The dataset must include realistic values, therefore it shouldn’t contain negative values for the numeric values which correspond to a specific characteristics of a diamond. Price should be greater than zero, therefore, we check dataset to control negative price values. According to the results there is no negative price value, as expected.

Missing x, y, z values:
These values correspond to length, width, and depth, therefore they should be positive. We can impute the wrongly filled zero values if we have other dimensions since depth is a value obtained by a formula which include x,y, and z.

There is no row with x equal to 0, where y and z are not equal to 0.
There is no row with y equal to 0, where x and z are not equal to 0.
However, there are 12 rows with z equal to 0, where x and y are not equal to 0. In other words, we can calculate z values by using depth, x, and y values for these rows. To remind,
```
 *Depth(%)= z / mean(x, y)*
```

diamonds %>%
  filter(x == 0 & y != 0 & z !=0)

diamonds %>%
  filter(y == 0 & x != 0 & z !=0)

diamonds %>%
  filter(z == 0 & x != 0 & y !=0)

diamonds <- diamonds %>%
  mutate(z = ifelse(z == 0 & x != 0 & y != 0,round(depth * mean(c(x, y)) / 100, 2), z))

We can remove the rows with 0 value in x, y, or z, which cannot be filled by calculation. It can be seen that there are 7 such rows in this dataset.

diamonds %>%
  filter(x == 0 | y == 0 | z == 0)

## # A tibble: 7 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
## 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
## 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
## 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
## 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
## 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
## 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0

diamonds<-diamonds %>%
  filter(!(x == 0 | y == 0 | z == 0))

Unrealistic x, y, z values:
Previous part, we make possible corrections to dimensions for 0 values. In this part, we will check the unrealistic values by the help of depth value.

The range of x seems normal with [3.73, 10.74] interval.
The range of y does not seem normal with [3.68, 58.90] interval, because when we sort by decreasing order the first two values are way greater than the others.
The range of z does not seem normal with [ 1.07, 31.80] interval, because when we sort by decreasing order the first value is way greater than the others.

range(diamonds$x)
range(diamonds$y)
range(diamonds$z)

sort(unique(diamonds$x), decreasing = TRUE)  %>% head(10)

##  [1] 10.74 10.23 10.14 10.02 10.01 10.00  9.86  9.66  9.65  9.54

sort(unique(diamonds$y), decreasing = TRUE)  %>% head(10)

##  [1] 58.90 31.80 10.54 10.16 10.10  9.94  9.85  9.81  9.63  9.59

sort(unique(diamonds$z), decreasing = TRUE)  %>% head(10)

##  [1] 31.80  8.06  6.98  6.72  6.43  6.38  6.31  6.27  6.24  6.17

We can calculate these abnormal y and z values by using depth formula to decide if there are any typo with decimals.

The calculated y values for these rows are less than 0, which makes no sense, therefore it is suitable to remove them from the dataset.
The calculated z value is 1/10 of the value in the dataset, which means there is an error with deciaml writing. Also, the z values for similar rows are around the calculated z value, therefore z value can be corrected by dividing by 10.

diamonds %>%
  filter(y > 20) %>%
  mutate(calculated_y = (2 * z / depth) / 100 - x) %>%
  select(depth, x, z, y, calculated_y)

## # A tibble: 2 x 5
##   depth     x     z     y calculated_y
##   <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1  58.9  8.09  8.06  58.9        -8.09
## 2  61.8  5.15  5.12  31.8        -5.15

diamonds <- diamonds %>%
  filter(y < 20)

diamonds %>%
  filter(z > 10) %>%
  mutate(calculated_z = depth * mean(c(5.12, 5.15)) / 100) %>%
  select(depth, x, y, z, calculated_z)

## # A tibble: 1 x 5
##   depth     x     y     z calculated_z
##   <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1  61.8  5.12  5.15  31.8         3.17

diamonds$z[diamonds$z == 31.8] = diamonds$z[diamonds$z == 31.8] / 10

Depth deviation from calculated depth:
Since we know the formula for depth(%) we can calculate, and compare it with the depth column in the dataset. While small differences can be ignored, rows with large differences are an indication that there is a measurement problem with those diamond, therefore, they can be removed from the dataset.We can plot depth and calculated depth in order to see the differences.

diamonds$calculated_depth = 2 * diamonds$z / (diamonds$x + diamonds$y) * 100
diamonds %>%
  ggplot(., aes(x = calculated_depth, y = depth)) +
  geom_point(color="seagreen3") + 
  geom_abline(intercept = 0, slope = 1, color="red3", size=2, linetype=2)+
  theme_minimal()+
  labs(title="Calculated Depth vs Dataset Depth",
       subtitle="Diamonds dataset",
       x= "Calculated Depth",
       y="Dataset Depth")

Although most of the points are around the line, there are still points with large distances from the line. Now, we should investigate those points. I assumed that if a point’s depth deviates from calculated depth more than 5% percentage, I can remove it from the dataset. I come up with this conclusion after reading this link, which shows the depth intervals for different types of diamonds. More than almost a 5% of depth mistake will result in a wrong identification.These points can be seen below plot.

diamonds <- diamonds %>%
  filter((abs(calculated_depth - depth) < 5))

diamonds <- subset(diamonds, select = -c(calculated_depth))

x, y, and z relationship:
x, y, and z values are belong to length, width, and depth, respectively. For a physical substance, in general all those dimensions are highly correlated, therefore we can check these relationships by using a line graph to see if there are any unrealistic points. As expected, these values are highly correlated. We can assume that these x, y and z values are valid values.

diamonds %>%
  ggplot(., aes(x = x, y = y)) +
  geom_point(color="royalblue2") + 
  geom_smooth(color="salmon", size=2, linetype=2)+
  theme_minimal()+
  labs(title="Diamond Length vs Diamond Width",
       subtitle="Diamonds dataset",
       x= "Diamond Length",
       y="Diamond Width")

diamonds %>%
  ggplot(., aes(x = z, y = y)) +
  geom_point(color="royalblue2") + 
  geom_smooth(color="salmon", size=2, linetype=2)+
  theme_minimal()+
  labs(title="Diamond Depth vs Diamond Width",
       subtitle="Diamonds dataset",
       x= "Diamond Depth",
       y="Diamond Width")

diamonds %>%
  ggplot(., aes(x = x, y = z)) +
  geom_point(color="royalblue2") + 
  geom_smooth(color="salmon", size=2, linetype=2)+
  theme_minimal()+
  labs(title="Diamond Length vs Diamond Depth",
       subtitle="Diamonds dataset",
       x= "Diamond Length",
       y="Diamond Depth")

2.4 Summary of Data

After preprocessing of the data, the summary and glimpse of the data can be seen below.In this new dataset, we have 53,749 rows.

summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1592   J: 2799   SI1    :13024   Min.   :50.80  
##  1st Qu.:0.4000   Good     : 4887   I: 5405   VS2    :12220   1st Qu.:61.00  
##  Median :0.7000   Very Good:12065   H: 8258   SI2    : 9136   Median :61.80  
##  Mean   :0.7974   Premium  :13728   G:11249   VS1    : 8150   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21477   F: 9515   VVS2   : 5054   3rd Qu.:62.50  
##  Max.   :5.0100                     E: 9770   VVS1   : 3644   Max.   :79.00  
##                                     D: 6753   (Other): 2521                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 3.730   Min.   : 3.680  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3930   Mean   : 5.731   Mean   : 5.733  
##  3rd Qu.:59.00   3rd Qu.: 5323   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :10.540  
##                                                                   
##        z        
##  Min.   :2.240  
##  1st Qu.:2.910  
##  Median :3.530  
##  Mean   :3.539  
##  3rd Qu.:4.030  
##  Max.   :6.980  
##

glimpse(diamonds)

## Rows: 53,749
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

3. Exploratory Data Analysis

3.1. Price

Since the mean is greater than median for price, the distribution is right-skewed.

diamonds%>% ggplot(.,aes(x=price))+
  geom_vline(aes(xintercept=mean(price)),
            color="black", linetype="dashed", size=1)+
  geom_histogram(bins=30, color="tomato3", fill="salmon")+ 
  theme_minimal()+ 
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Distribution of Price",
       subtitle = "Diamonds dataset",
       x = "Price")

3.2. Price vs Categoric Variables

3.2.1. Price vs Cut

diamonds %>%
  group_by(cut) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((cut)) %>%
  select(cut, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Cut","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)

Diamond Cut	Count	Percentage	Minimum Price	Average Price	Maximum Price
Fair	1592	3.0	337	4330.16	18574
Good	4887	9.1	327	3916.91	18707
Very Good	12065	22.4	336	3981.30	18818
Premium	13728	25.5	326	4577.45	18823
Ideal	21477	40.0	326	3461.78	18806

The most of the diamonds belong to ideal cut type.
The percentage of premium and very good are very close.
There is a little fair cut type.

diamonds%>% ggplot(., aes(cut, price)) + 
  geom_boxplot(aes(fill=factor(cut))) + 
  scale_fill_brewer(palette="Pastel1")+
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme_minimal()+
  labs(title="Price vs Cut", 
       subtitle="Diamonds dataset",
       x="Diamond Cut",
       y="Price",
       fill="Diamond Cut")
diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="pink2") +
  theme_minimal()+facet_wrap(~cut, ncol=5)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="pink3"))+
    labs(title="Price Frequency vs Diamond Cut", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Cut")

Although the best cut type is Ideal, its price is the lowest. According to the average prices, the most expensive diamonds belong to Premium and Fair cut types. These results present that cut is not enough to explain response variable price, since price does not increase while cut feature improves.

3.2.2. Price vs Color

diamonds %>%
  group_by(color) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((color)) %>%
  select(color, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Color","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)

Diamond Color	Count	Percentage	Minimum Price	Average Price	Maximum Price
J	2799	5.2	335	5328.11	18710
I	5405	10.1	334	5079.12	18823
H	8258	15.4	337	4474.49	18803
G	11249	20.9	354	3997.28	18818
F	9515	17.7	342	3726.73	18791
E	9770	18.2	326	3080.42	18731
D	6753	12.6	357	3172.01	18693

The most of the diamonds belong to G color type.
The percentage of E, F, and G are close.
There is a little J color type.

diamonds %>%
  group_by(color) %>%
  summarise(Average_Price = mean(price)) %>%
  
  ggplot(.,aes(x=color, y = Average_Price, fill= color)) +
    geom_col(color="black") +
    geom_text(aes(label = format(Average_Price, digits = 3)), position = position_dodge(0.9),vjust = 5) +
    geom_line(aes(y = Average_Price), color="black", group=1, size=0.8)+
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 65))+
     scale_fill_brewer(palette="Pastel2")+
   
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  labs(title="Price vs Color", 
       subtitle="Diamonds dataset",
       x="Diamond Color",
       y="Average Price",
       fill="Diamond color")
diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="palegreen3") +
  theme_minimal()+facet_wrap(~color, ncol=7)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="seagreen3"))+
    labs(title="Price Frequency vs Diamond Cut", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Cut")

Although the best color type is D, its price is one of the lowest. According to the average prices, the most expensive diamonds belong to J and I cut types which are actually the worst two color type in this dataset. These results present that color is not enough to explain response variable price, since price does not increase while color feature improves.

3.2.3. Price vs Clarity

diamonds %>%
  group_by(clarity) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((clarity)) %>%
  select(clarity, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Clarity","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)

Diamond Clarity	Count	Percentage	Minimum Price	Average Price	Maximum Price
I1	737	1.4	345	3927.30	18531
SI2	9136	17.0	326	5052.29	18804
SI1	13024	24.2	326	3993.19	18818
VS2	12220	22.7	334	3925.98	18823
VS1	8150	15.2	327	3839.59	18795
VVS2	5054	9.4	336	3287.22	18768
VVS1	3644	6.8	336	2523.49	18777
IF	1784	3.3	369	2870.57	18806

The most of the diamonds belong to SI1 clarity type.
The percentage of SI2 and SI1 are close.
I1, IF, VVS1, and VVS2 clarity types are less than 10%.
The majority of the dataset consists of VS1, VS2, SI1, and SI2 clarity types.

diamonds %>%
  ggplot(., aes(x = clarity, y = price, color = clarity)) +
  scale_color_brewer(palette="Set3")+
  geom_jitter(alpha=0.7) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title = "Price vs Clarity",
       subtitle = "Diamonds dataset",
       x = "Diamond Clarity",
       y = "Price",
       color = "Diamond Clarity")
diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="skyblue3") +
  theme_minimal()+facet_wrap(~clarity, ncol=8)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="skyblue2"))+
    labs(title="Price Frequency vs Diamond  Clarity", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Clarity")

Although the best clarity type is IF, its price is the lowest. According to the average prices, the most expensive diamonds belong to SI2 clarity types which is actually the second worst clarity type in this dataset. These results present that clarity is not enough to explain response variable price, since price does not increase while clarity feature improves.

3.3. Price vs Numeric Variables

3.3.1. Price vs Carat

The most frequent carat weight equals to 0.3 in this dataset. From the histogram, we can see the distribution of carat. To see all carats according to their count, following table can be analyzed. It can be said that, most of the carat values in this dataset are less than 1.

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(diamonds$carat)

## [1] 0.3

diamonds%>% ggplot(.,aes(x=carat))+
  geom_vline(aes(xintercept=mean(carat)),
            color="black", linetype="dashed", size=1)+
  geom_histogram(bins=30, color="tomato3", fill="salmon")+ 
  theme_minimal()+ 
  labs(title = "Distribution of Carat",
       subtitle = "Diamonds dataset",
       x = "Carat")

diamonds %>%
  group_by(carat) %>%
  summarise(carat_count = n(),
            minPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price))%>%
  arrange(desc(carat_count)) %>%
  kable(col.names = c("Carat", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_styling("striped", full_width = T) %>%
  scroll_box(width = "100%", height = "300px")

Carat	Count	Minimum Price	Average Price	Maximum Price
0.30	2596	339	680.6367	2366
0.31	2238	335	708.1997	1917
1.01	2238	2017	5508.9821	16234
0.70	1980	945	2516.2434	5539
0.32	1827	345	719.1916	1715
1.00	1542	1681	5250.9514	16469
0.90	1484	1599	3939.5889	9182
0.41	1378	467	973.0530	1745
0.40	1291	484	933.0077	1908
0.71	1291	1305	2674.3106	5019
0.50	1254	584	1503.8453	3378
0.33	1179	366	765.1747	1389
0.51	1123	826	1623.5841	4368
0.34	908	413	755.7423	2346
1.02	883	2344	5598.4632	17100
0.52	813	870	1656.4785	3373
1.51	804	3497	10552.0211	18806
1.50	787	2964	10082.9835	18691
0.72	764	945	2705.7788	5086
0.53	706	844	1685.0368	3827
0.42	704	552	994.4545	1722
0.38	669	497	904.1973	1992
0.35	663	409	799.5098	1672
1.20	643	2360	6670.7387	16256
0.54	624	935	1704.3317	4042
0.91	570	1570	3981.1561	9636
0.36	569	410	781.0070	1718
1.03	522	1262	5583.8966	17590
0.55	495	959	1725.2283	3509
0.56	492	953	1831.9980	4632
0.73	492	1049	2791.2480	5154
0.43	486	452	995.2737	1827
1.04	474	2037	5774.6709	18542
1.21	470	2396	6905.4043	17353
2.01	433	5696	14772.7136	18741
0.57	429	929	1798.0443	4032
0.39	397	451	923.4761	1624
0.37	394	451	816.0711	1764
1.52	380	3105	10629.9868	17723
1.06	372	2080	5947.0081	15813
1.05	361	2066	5781.6814	13060
1.07	340	2260	6004.2118	18279
0.74	322	1632	2821.9006	5841
0.58	310	893	1798.6484	3741
1.11	308	3183	6005.5682	15464
1.22	298	2699	6984.3188	14414
0.23	293	326	486.1433	688
1.09	285	2708	6020.0947	18231
0.80	284	1232	3146.1901	6834
0.59	282	903	1831.4504	4916
1.23	277	3168	7290.8881	16253
1.10	275	3216	5921.1345	12748
2.00	261	5051	14107.1456	18818
0.24	254	336	505.1850	963
0.26	253	337	550.8972	814
0.77	251	1651	2884.3984	5251
0.76	250	1140	2891.4520	5056
0.75	249	1013	2763.4016	5288
1.12	249	2854	6058.9116	14407
1.08	246	2869	5774.3415	14032
1.13	246	2968	6193.4024	14525
1.24	235	2940	7147.1149	15026
0.27	233	361	574.7597	893
0.60	226	806	1956.2434	4209
0.92	226	2283	4035.2124	7544
1.53	219	5631	10703.4658	17223
1.70	215	5617	12110.3209	18718
0.25	212	357	550.9245	1186
0.44	212	648	1056.6226	2506
1.14	206	2327	6354.4078	18112
0.61	204	931	2069.2451	4279
0.81	199	1687	3091.5126	6338
0.28	198	360	580.1212	828
0.78	187	1720	2918.9091	4904
1.25	187	3276	7123.2513	16823
0.46	177	746	1240.2429	2139
2.02	176	6346	14767.4091	18731
1.54	174	4492	10865.0287	18416
1.16	172	2368	6274.2965	13595
0.79	151	1809	2993.2649	4858
1.15	148	3220	5962.5811	12895
1.26	146	3881	7490.7466	15247
0.93	142	2374	3994.2958	5992
0.82	140	2033	3083.6500	4931
0.62	135	933	1973.3704	3968
1.27	134	2845	7119.1194	15377
1.31	132	3697	7750.6364	17496
0.83	131	1863	3181.9924	7889
0.29	130	334	601.1923	1776
1.19	126	2892	6483.2381	15767
1.18	123	3219	6481.1057	14847
1.55	123	4965	10717.7642	18188
2.03	122	6002	14710.1148	18781
1.30	121	2512	7851.6612	15874
1.71	119	6320	12396.5546	18791
0.45	110	672	1093.3182	1833
1.17	110	2336	6224.8818	11886
1.56	108	5766	10205.9444	17455
1.28	106	3355	7711.8113	18700
1.57	105	5170	10444.5143	17548
0.96	103	1637	3623.4660	6807
0.63	102	1065	2244.7843	6607
1.29	101	2596	7033.0198	17932
0.47	99	719	1272.3131	2460
1.60	95	6272	11124.2526	18780
1.32	89	4140	7639.3708	15802
1.58	89	4459	11427.9888	18057
1.59	89	6010	10574.3596	17552
1.33	87	4634	7527.0805	18435
2.04	86	7403	14840.1744	18795
0.64	80	1080	2113.5875	4281
1.35	77	4082	7448.2857	14616
1.34	68	4352	7758.5000	17663
0.65	65	1214	2322.6154	5052
0.95	65	2195	4013.6615	6951
2.05	65	7026	15647.2462	18787
0.84	64	1749	3294.9844	6399
1.61	64	6915	10992.8438	18318
0.48	63	841	1322.0476	2677
0.85	62	1250	3204.1774	7230
1.62	61	5157	11705.8852	18281
0.94	59	2571	4195.7797	7120
0.97	59	1547	4026.2373	7415
2.06	59	5203	14348.6780	18779
1.72	57	5240	12040.3158	18730
1.73	52	6007	11875.9615	18377
2.10	51	6597	14464.3725	18648
1.36	50	4158	8591.5200	15386
1.40	50	4939	9108.5200	16808
1.63	50	7100	11556.3200	18179
1.75	50	7276	12673.7000	17343
2.07	50	6503	15384.2800	18804
0.66	48	1188	2235.7292	4311
0.67	48	1244	2067.8542	3271
2.14	48	5405	14905.0417	18528
1.37	46	5063	8044.0870	13439
0.49	45	900	1306.3333	2199
2.09	45	10042	15287.3111	18640
1.64	43	4849	10738.0000	17604
2.11	43	7019	14563.4186	18575
2.08	41	6150	15128.6829	18760
1.74	40	4677	12270.3250	18430
1.41	39	4145	9452.2051	17216
1.39	36	3914	8335.1111	13061
0.86	34	1828	3289.9706	5029
1.65	32	5914	10973.8750	17729
0.87	31	1827	3342.3226	5640
0.98	31	1712	3469.7742	4950
2.20	31	7934	15226.8710	18364
1.66	30	5310	11084.5000	16294
2.18	30	10340	14901.7000	18717
1.76	28	9053	12398.5714	18178
2.22	27	5607	14691.5185	18706
0.69	26	1636	2206.6154	3784
1.38	26	4598	8164.3077	17598
0.68	25	1341	2334.3200	4325
1.42	25	6069	9487.2000	18682
1.67	25	7911	12281.4800	17351
2.12	25	7508	15007.6800	18656
2.16	25	8709	15860.4800	18678
1.69	24	5182	11878.0833	17803
0.88	23	1964	3431.3478	5595
0.99	23	1789	4406.1739	6094
2.21	23	6535	15287.9565	18483
2.15	22	5430	14187.1818	18791
2.19	22	10179	15521.0909	18693
0.89	21	1334	3291.2381	4854
1.47	21	4898	7833.7143	12547
2.13	21	12356	15220.4286	18442
1.80	20	7862	12955.1500	17273
2.28	20	9068	15422.9500	18055
2.30	20	7226	14861.5500	18304
1.43	19	6086	9234.9474	14429
1.68	19	5765	11779.1053	16985
1.44	18	4064	8021.6111	12093
1.46	18	5421	8237.4444	12261
1.83	18	5083	11930.0000	18525
2.17	18	6817	14416.6111	17805
1.77	17	7771	11103.1765	15278
2.29	17	11502	15834.7647	18823
2.51	17	14209	16616.8824	18308
2.24	16	10913	15573.9375	18299
2.25	16	6653	13403.5625	18034
2.32	16	6860	16042.1875	18532
2.50	16	7854	14316.1875	18447
1.45	15	4320	8709.3333	17545
1.79	15	9122	13144.0667	18429
2.26	15	11226	15330.2000	18426
1.82	13	5993	12986.1538	15802
2.23	13	7006	13947.4615	18149
2.31	13	7257	15017.3846	17715
2.40	13	11988	16362.4615	18541
0.20	12	345	365.1667	367
1.78	12	8889	12504.5833	18128
1.91	12	6632	12365.8333	17509
2.27	12	5733	14198.4167	18343
3.01	12	8040	15521.6667	18710
1.49	11	5407	9174.2727	18614
0.21	9	326	380.2222	394
1.81	9	5859	11374.4444	14177
1.86	9	9791	12164.3333	17267
2.33	9	8220	14089.7778	18119
2.48	9	12883	16537.1111	18692
2.52	9	10076	16242.1111	18252
2.36	8	11238	15756.7500	18745
2.38	8	10716	15481.3750	18559
2.53	8	14659	16548.8750	18254
2.54	8	12095	15125.2500	17996
3.00	8	6512	12378.5000	16970
1.48	7	6216	8791.2857	15164
1.87	7	9955	12933.5714	17761
1.90	7	8576	11477.1429	13919
2.35	7	14185	16090.4286	17999
2.39	7	15917	17182.4286	17955
2.42	7	16198	16923.5714	18615
1.93	6	13278	16828.1667	18306
2.37	6	14837	16525.1667	18508
2.43	6	9716	15356.0000	18692
0.22	5	337	391.4000	470
1.98	5	12923	14667.8000	16171
2.34	5	8491	11922.0000	15818
2.41	5	13563	15808.2000	17923
1.84	4	7922	11108.5000	13905
1.88	4	8048	10946.2500	13607
1.89	4	10055	13969.0000	17553
1.96	4	5554	8642.2500	13099
1.97	4	14410	15966.0000	17535
2.44	4	13027	16365.5000	18430
2.45	4	11830	14593.7500	18113
1.85	3	5688	11436.0000	14375
1.94	3	9344	14189.3333	18735
1.95	3	5045	12340.6667	17374
1.99	3	11486	14179.0000	17713
2.46	3	10470	14227.0000	16466
2.47	3	14970	15644.0000	16532
2.49	3	6289	13843.0000	18325
2.55	3	14351	15964.0000	18766
2.56	3	15231	16723.3333	17753
2.57	3	17116	17841.6667	18485
2.58	3	12500	14481.3333	16195
2.60	3	17027	17535.0000	18369
2.61	3	13784	16583.0000	18756
2.63	3	10437	12659.6667	16914
2.72	3	6870	12088.3333	17801
2.74	3	8807	14385.0000	17184
1.92	2	15364	15418.0000	15472
2.66	2	16239	17367.0000	18495
2.68	2	8419	9042.0000	9665
2.75	2	13156	14285.5000	15415
3.04	2	15354	16956.5000	18559
4.01	2	15223	15223.0000	15223
2.59	1	16465	16465.0000	16465
2.64	1	17407	17407.0000	17407
2.65	1	16314	16314.0000	16314
2.67	1	18686	18686.0000	18686
2.70	1	14341	14341.0000	14341
2.71	1	17146	17146.0000	17146
2.77	1	10424	10424.0000	10424
2.80	1	15030	15030.0000	15030
3.02	1	10577	10577.0000	10577
3.05	1	10453	10453.0000	10453
3.11	1	9823	9823.0000	9823
3.22	1	12545	12545.0000	12545
3.24	1	12300	12300.0000	12300
3.40	1	15964	15964.0000	15964
3.50	1	12587	12587.0000	12587
3.51	1	18701	18701.0000	18701
3.65	1	11668	11668.0000	11668
3.67	1	16193	16193.0000	16193
4.00	1	15984	15984.0000	15984
4.13	1	17329	17329.0000	17329
4.50	1	18531	18531.0000	18531
5.01	1	18018	18018.0000	18018

3.3.2. Price vs x, y, z, and Depth

Price is highly related with x, y, and, z which can be seen below scatter plots. Since depth is a formula obtained by using these variables, the plot belongs to depth is not presented in this part.

ggplot(diamonds, aes(x, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) +
  theme_minimal()+
  labs(title = "Price vs Diamond Length",
       subtitle = "Diamonds dataset",
       x = "Diamond Length",
       y = "Price")

ggplot(diamonds, aes(y, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) +
  theme_minimal()+
  labs(title = "Price vs Diamond Width",
       subtitle = "Diamonds dataset",
       x = "Diamond Width",
       y = "Price")

ggplot(diamonds, aes(z, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) + 
  theme_minimal()+
  labs(title = "Price vs Diamond Depth",
       subtitle = "Diamonds dataset",
       x = "Diamond Depth",
       y = "Price")

3.4. Between Variables

In order to explore the relationship between 4C and price, following plots can be examined.

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +scale_color_brewer(palette="Set1")+
  geom_smooth(method="loess", se=F) + 
  labs(title = "Price vs Carat Grouped by Clarity",
       subtitle = "Diamonds dataset",
       x = "Diamond Carat",
       y = "Price",
       color = "Diamond Clarity")

ggplot(diamonds, aes(carat, price, color = cut)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +scale_color_brewer(palette="Set1")+
  geom_smooth(method="loess", se=F) + 
  labs(title = "Price vs Carat Grouped by Cut",
       subtitle = "Diamonds dataset",
       x = "Diamond Carat",
       y = "Price",
       color = "Diamond Cut")

ggplot(diamonds, aes(carat, price, color = color)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +scale_color_brewer(palette="Set1")+
  geom_smooth(method="loess", se=F) + 
  labs(title = "Price vs Carat Grouped by Color",
       subtitle = "Diamonds dataset",
       x = "Diamond Carat",
       y = "Price",
       color = "Diamond Color")

When comparing diamons with equal carat weight, IF clarity type is the most expensive, as expected.
When comparing diamons with equal carat weight, Ideal cut type is the most expensive, as expected.
When comparing diamons with equal carat weight, D color group is the most expensive, as expected.
It can be said that, while color, cut, and clarity alone are not meaningful for estimating price, they gain meaning with carat information.

4. Further Analyses

Some possible analyses may include Clustering and Principal Component Analysis for our dataset.

By clustering, we can mutate a new column by identifying the clusters of each row.
By PCA, we can reduce the dimensionality of our data and use less variable. Also, this method is useful for dealing with multicollinearity.

Before proceding with these methods, we should divide our dataset into two, one for train and one for test set. Since, our factor variables namely cut, clarity, and color are ordered, we can treat them as numeric values.

set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

diamonds_train = diamonds_train[, c(-ncol(diamonds_train))]
diamonds_test = diamonds_test[, c(-ncol(diamonds_test))]

diamonds_train$cut <- as.numeric(diamonds_train$cut)
diamonds_train$clarity <- as.numeric(diamonds_train$clarity)
diamonds_train$color <- as.numeric(diamonds_train$color)
diamonds_test$cut <- as.numeric(diamonds_test$cut)
diamonds_test$clarity <- as.numeric(diamonds_test$clarity)
diamonds_test$color <- as.numeric(diamonds_test$color)

4.1. Principal Component Analysis(PCA)

Principal Component Analysis is suitable for numeric variables, therefore we choose this type of variables from our dataset to continue. It can be seen from summary output that 4 component is enough in order to explain 88% of the data. We should add these components to both train and test datasets for the following price estimation models.

diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
summary(diamonds_pca,loadings=TRUE)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     2.0742952 1.1892966 1.1065552 0.9926732 0.83121848
## Proportion of Variance 0.4780779 0.1571585 0.1360516 0.1094889 0.07676935
## Cumulative Proportion  0.4780779 0.6352364 0.7712880 0.8807769 0.95754622
##                            Comp.6      Comp.7       Comp.8       Comp.9
## Standard deviation     0.59218546 0.173206533 0.0320924547 1.923472e-02
## Proportion of Variance 0.03896485 0.003333389 0.0001144362 4.110829e-05
## Cumulative Proportion  0.99651107 0.999844456 0.9999588917 1.000000e+00
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat    0.471                       0.126         0.868              
## cut     -0.109  0.595  0.451  0.241 -0.175  0.584                     
## color   -0.158 -0.202         0.802  0.538                            
## clarity -0.222  0.208  0.215 -0.506  0.778                            
## depth           0.209 -0.827         0.118  0.492                0.101
## table    0.129 -0.698  0.227 -0.181         0.641                     
## x        0.475                       0.115        -0.279 -0.715  0.397
## y        0.474                       0.120        -0.297  0.699  0.410
## z        0.474  0.103                0.132        -0.283        -0.815

pca_results <- predict(diamonds_pca, newdata = diamonds_train)
diamonds_train$pca1 = pca_results[,1]
diamonds_train$pca2 = pca_results[,2]
diamonds_train$pca3 = pca_results[,3]
diamonds_train$pca4 = pca_results[,4]

pca_results_test = predict(diamonds_pca, newdata = diamonds_test)
diamonds_test$pca1 = pca_results_test[,1]
diamonds_test$pca2 = pca_results_test[,2]
diamonds_test$pca3 = pca_results_test[,3]
diamonds_test$pca4 = pca_results_test[,4]

4.2. Clustering

There are different kinds of clustering methods to create a feature, in this assignment K-means is used. First of all, scaling should be applied since if a columnn have much higher values than the others, it may dominate the results. In order to scale the data, we should find the minimum and maximum values of the train set, and then we will scale both train and test set with the same values.

min_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], min)
max_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], max)
diamonds_train_scale <- sweep(sweep(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")

For deciding the number of cluster, we can use a for loop from 1 to 15 center, and then select the number of center value.

errors = c()
for (i in (1:15)){
  set.seed(1357) 
  cluster<-kmeans(diamonds_train_scale,centers=i) 
  errors = c(errors, 100*round(1 - (cluster$tot.withinss/cluster$totss), digits = 3))
  }

errors_df <- data.frame(x=c(1:15), y=errors) 

ggplot(errors_df, aes(x=x, y=y)) +
  geom_point(color = "firebrick2") +
  geom_line(color="firebrick3") +
  geom_text(aes(label = errors), size=3, color = "black", position = position_stack(vjust = 0.95))+
  theme_minimal() +
  labs(title = "Center Number vs R-Square",
       subtitle = "Diamonds dataset",
       x = "Cluster Number",
       y = "R-Square")

The decrease in errors are slowly changing after the cluster with 8 centers. So, we can say that we should select the model with center equals to 8.

set.seed(1357)
best_cluster = kmeans(diamonds_train_scale,centers=8)
diamonds_train$cluster = as.factor(best_cluster$cluster)

Now, we need to apply clustering process to the test dataset.

diamonds_test_scale <- sweep(sweep(diamonds_test[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")

closest.cluster <- function(x) {
  cluster.dist <- apply(best_cluster$centers, 1, function(y) sqrt(sum((x-y)^2)))
  return(which.min(cluster.dist)[1])
}
diamonds_test$cluster <- as.factor(apply(diamonds_test_scale, 1, closest.cluster))

diamonds_train %>%
  ggplot(., aes(x = cluster, y = price, color = cluster)) +
  scale_color_brewer(palette="Set2")+
  geom_jitter(alpha=0.7) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title = "Price vs Clusters",
       subtitle = "Diamonds train dataset",
       x = "Diamond Cluster",
       y = "Price",
       color = "Diamond Cluster")

diamonds_test %>%
  ggplot(., aes(x = cluster, y = price, color = cluster)) +
  scale_color_brewer(palette="Set2")+
  geom_jitter(alpha=0.7) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Price vs Clusters",
       subtitle = "Diamonds test dataset",
       x = "Diamond Cluster",
       y = "Price",
       color = "Diamond Cluster")

5. Price Estimation Models

Correlation matrix can be examined to get an idea about the relationships between numeric variables, therefore, correlogram is illustrated below. We can say that carat, x, y, and z columns are highly correlated with price column.

diamonds_cor<-cor(diamonds_train[-c(11:15)])
corrplot(diamonds_cor, method="number")

5.1. Generalized Linear Model

There are four assumptions to be fulfilled in a linear model:

Linearity Assumption
Constant Variance Assumption
Independence Assumption
Normality Assumption

Since, our dataset does not fulfill these assumptions we will construct a generalized linear model which is more flexible than standard linear model in the terms of assumptions. The responce variable, price, is continuous therefore Gamma or Gaussian family may fit. The lowest AIC value is obtained by Gamma family with identity link function.

model_glm <- glm(price ~ carat+cut+color+clarity+depth+table+x+y+z+cluster, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm)

## 
## Call:
## glm(formula = price ~ carat + cut + color + clarity + depth + 
##     table + x + y + z + cluster, family = Gamma(link = "identity"), 
##     data = diamonds_train, start = c(0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0649  -0.1433  -0.0057   0.1532  10.2150  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.052e+04  6.219e+02 -33.003  < 2e-16 ***
## carat        7.241e+03  4.341e+01 166.797  < 2e-16 ***
## cut         -6.238e+00  3.367e+00  -1.853 0.063928 .  
## color        3.589e+01  1.661e+00  21.612  < 2e-16 ***
## clarity      8.070e+01  1.300e+00  62.065  < 2e-16 ***
## depth        3.433e+02  9.938e+00  34.548  < 2e-16 ***
## table       -6.750e+00  9.723e-01  -6.942 3.92e-12 ***
## x            3.884e+03  8.192e+01  47.409  < 2e-16 ***
## y            1.082e+03  7.990e+01  13.543  < 2e-16 ***
## z           -8.913e+03  2.165e+02 -41.174  < 2e-16 ***
## cluster2    -2.341e+01  6.809e+00  -3.438 0.000586 ***
## cluster3    -7.069e+01  6.950e+00 -10.171  < 2e-16 ***
## cluster4    -8.468e+01  8.813e+00  -9.609  < 2e-16 ***
## cluster5    -2.753e+02  2.304e+01 -11.950  < 2e-16 ***
## cluster6    -6.137e+01  9.056e+00  -6.777 1.24e-11 ***
## cluster7    -2.242e+02  2.195e+01 -10.216  < 2e-16 ***
## cluster8     8.039e+02  2.382e+01  33.742  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.0868286)
## 
##     Null deviance: 42126.5  on 42990  degrees of freedom
## Residual deviance:  2908.6  on 42974  degrees of freedom
## AIC: 676132
## 
## Number of Fisher Scoring iterations: 25

The model can be improved by some arrangement in the explanatory variables.

model_glm2 <- glm(price ~ carat*color+carat*clarity+I(carat^2)+cluster+y+depth, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm2)

## 
## Call:
## glm(formula = price ~ carat * color + carat * clarity + I(carat^2) + 
##     cluster + y + depth, family = Gamma(link = "identity"), data = diamonds_train, 
##     start = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.23761  -0.10711  -0.00628   0.09001   2.68079  
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    2860.8797   106.3588   26.898  < 2e-16 ***
## carat         -3616.5228   101.2272  -35.727  < 2e-16 ***
## color          -112.0154     1.9102  -58.642  < 2e-16 ***
## clarity        -184.5964     1.7927 -102.970  < 2e-16 ***
## I(carat^2)     3886.1793    31.1401  124.797  < 2e-16 ***
## cluster2         21.9462     3.6092    6.081 1.21e-09 ***
## cluster3        -17.1453     3.8160   -4.493 7.04e-06 ***
## cluster4        -78.7454     3.3094  -23.795  < 2e-16 ***
## cluster5        459.0441    13.3378   34.417  < 2e-16 ***
## cluster6        -43.3751     3.5333  -12.276  < 2e-16 ***
## cluster7        329.4331    11.7776   27.971  < 2e-16 ***
## cluster8        511.6552    11.7190   43.660  < 2e-16 ***
## y              -245.2199    18.9533  -12.938  < 2e-16 ***
## depth           -13.3575     0.9476  -14.096  < 2e-16 ***
## carat:color     486.7423     4.6738  104.142  < 2e-16 ***
## carat:clarity   794.7105     4.6681  170.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.02385886)
## 
##     Null deviance: 42126.5  on 42990  degrees of freedom
## Residual deviance:  1022.2  on 42975  degrees of freedom
## AIC: 630859
## 
## Number of Fisher Scoring iterations: 25

To make an example with components, folowing model can be constructed.

model_glm_pca = glm(price ~ pca1 + pca2 + pca3 + pca4 + cluster, data = diamonds_train, family = Gamma(link = "sqrt"))
summary(model_glm_pca)

## 
## Call:
## glm(formula = price ~ pca1 + pca2 + pca3 + pca4 + cluster, family = Gamma(link = "sqrt"), 
##     data = diamonds_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.12715  -0.24877  -0.04553   0.16452   1.95690  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.02880    0.11968  459.78   <2e-16 ***
## pca1        11.87183    0.03733  318.03   <2e-16 ***
## pca2         3.37035    0.03747   89.96   <2e-16 ***
## pca3         2.25217    0.04057   55.51   <2e-16 ***
## pca4        -0.86510    0.05872  -14.73   <2e-16 ***
## cluster2     6.62809    0.12793   51.81   <2e-16 ***
## cluster3    -3.24163    0.15406  -21.04   <2e-16 ***
## cluster4     3.01052    0.14841   20.29   <2e-16 ***
## cluster5    -6.77135    0.25980  -26.06   <2e-16 ***
## cluster6     1.73995    0.16787   10.37   <2e-16 ***
## cluster7    -2.82795    0.26942  -10.50   <2e-16 ***
## cluster8     7.34767    0.21746   33.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1129233)
## 
##     Null deviance: 42126.5  on 42990  degrees of freedom
## Residual deviance:  4544.4  on 42979  degrees of freedom
## AIC: 695577
## 
## Number of Fisher Scoring iterations: 11

PCA helps us to use less feature in a model. But, as expected, with fewer feature we will get more AIC value in the model which means the explanatory power of the model decreases. Since we do not have so many features, we do not need to decrease the feature number.We may also use all the principal components instead of the variables to deal with multicollinearity however in this study this model is not investigated. For the best glm model, we can calculate the R^2 value. It can be calculated easily with:

model_glm2_prediction<-predict(model_glm2, newdata=diamonds_test)
model_glm2_r2 <- 1 - (sum((model_glm2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm2_r2

## [1] 0.9053068

To visually see the prediction vs actual price values, we can check the below plot.

pred_vs_real_glm <- as.data.frame(cbind(model_glm2_prediction, diamonds_test$price))
colnames(pred_vs_real_glm) = c("prediction", "actual")

pred_vs_real_glm %>%
  ggplot(aes(prediction, actual)) +
  geom_point(color="seagreen3")  +
  theme_minimal()+
  geom_abline(intercept = 0, slope = 1, color="red3", size=2, linetype=2) +
  labs(title = "Prediction vs Actual Values for the Generalized Linear Model",
       subtitle = "Diamonds dataset",
       x = "Predictions",
       y = "Real Values")

5.2. Classification and Regression Tree

model_rpart <- rpart(price~., data=diamonds_train)
fancyRpartPlot(model_rpart, sub="CART Model")

From the plot of a tree, we can see that the nodes are divided with using only carat, y and clarity columns. It means that these are better features to reduce the variance in the dataset. We can make a cross validation for cp argument. To do so, we need to apply this code:

tr.control = trainControl(method = "cv", number = 10)
cp.grid = expand.grid( .cp = (1:10)*0.001)
tr = train(price~. - pca1 - pca2 - pca3 - pca4, data = diamonds_train, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
tr

## CART 
## 
## 42991 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 38691, 38693, 38694, 38691, 38692, 38691, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE      Rsquared   MAE     
##   0.001   950.902  0.9430052  580.0119
##   0.002  1036.373  0.9322755  628.2955
##   0.003  1087.797  0.9253495  656.4300
##   0.004  1137.831  0.9182846  685.5459
##   0.005  1145.996  0.9171573  688.0164
##   0.006  1245.355  0.9021746  811.5315
##   0.007  1279.303  0.8967777  846.7474
##   0.008  1279.303  0.8967777  846.7474
##   0.009  1279.303  0.8967777  846.7474
##   0.010  1362.123  0.8829236  883.8798
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001.

We can see that we need to use 0.001 for cp when we compare with other cp values.

model_rpart2 <- rpart(price~. - pca1 - pca2 - pca3 - pca4, data=diamonds_train, cp = 0.001)
fancyRpartPlot(model_rpart2, cex=0.4, sub="CART Model")

This is a more detailed tree. From the plot of this tree, we can say that we will define the price with respect to carat, y, clarity and color values. To be able to compare these two trees, we need to calculate the R^2 values.

model_rpart2_prediction<-predict(model_rpart2,newdata=diamonds_test)
model_rpart2_r2 <- 1 - (sum((model_rpart2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart2_r2

## [1] 0.944111

The R^2 value for the detailed tree is calculated above. Now, we can compare the generalized linear model and CART. We can conclude by saying CART model is better than Generalized Linear Model.

6. Conclusion

The steps during the assignment can be summarized as below:

Preprocessing of data by checking the accuracy of variables, NA, and duplicated variables
Exploratory data analysis for visualization of the data
Principal component cnalysis and K-means algorithm
Linear, and generalized linear model construction
Classification and regression tree model construction
Comparison between models

Important results:

Data preprocessing is important since the quality of the data may affect the estimation of price and mislead us.
Explarotary data analysis is useful for comprehending the data before constructing models, however only using EDA approaches may mislead us to understand the relationship between variables.
K-means is a useful algorithm in order to create new features.
Principal Component Analysis helps us to reduce the number of variable in cases where we have extremely many variables. Also, PCA is important for handling with multicollinearity in models.
Linear model is widely used however if a linear model does not fulfil assumptions, we cannot use it, therefore in those cases generalized linear models are convenient with more flexible assumptions.
The best CART tree may be defined by finding the best cp value.

References

Diamonds Info
Diamonds Info 2 link
Correlogram
ggplot visualizations
Kaggle Notebook1
Kaggle Notebook2
EDA Example
Duplicated Value
Color Cheatsheet
Ideal Cut
Depth Percentages
Mode Function
Division of Vector and Matrix Cluster for Test Data R2 for GLM