Exploratory Data Analysis and Price Estimation

1. Introduction

1.1. Diamonds

Diamonds are the result of carbon exposed to extremely high heat and pressure, deep in the earth. This process results in a variety of both internal and external characteristics(i.e. size, shape, color, etc.) which makes every diamond unique. Therefore, jewelry industry uses a systematic way to evaluate and discus these factors. Diamond professionals use the grading system developed by GIA in the 1950s, which established the use of four important factors to describe and classify diamonds: Clarity, Color, Cut, and Carat Weight.These are known as the 4Cs. When used together, they describe the quality of a finished diamond. The value of a finished diamond is based on this combination. The explanation of 4C can be seen below:

  • Color: Most commercially available diamonds are classified by color, or more appropriately, the most valuable diamonds are those classified as colorless. Color is graded on a letter scale from D to Z, with D representing a colorless diamond. Diamonds that are graded in the D-F range are the rarest and consequently most valuable. In reality, diamonds in the G-K range have such a small amount of color that the untrained eye can not see it.
  • Cut: The beauty of a diamond resides not only in a favorable body color, but also the high refractive index and color dispersion. The way a diamond is cut, its width, depth, roundness, size and position of the facets determine the brilliance of the stone. Even if the color and clarity are perfect, if the diamond is not cut to good proportions, it will be less impressive to the eye.
  • Clarity: The clarity of a diamond is determined by the number, location and type of inclusions it contains. Inclusions can be microscopic cracks, mineral deposits or external markings. Clarity is rated using a scale which contains a combination of letters and numbers to signify the amount and type of inclusions. This scale ranges from FL to I3, FL being Flawless and the most valuable.
  • Carat Weight: In addition to color, clarity and cut, weight provides a further basis in the valuation of a diamond. The weight of diamonds is measured in carats. One carat is equal to 1/5 of a gram. Smaller diamonds are more readily available than larger ones, which results in higher values based on weight. A 2 carat diamond will not be twice the cost of a 1 carat diamond, despite being twice the size. The larger the diamond, the rarer it becomes and as such the price increases exponentially.

The diamonds dataset that we use in this report consists of prices and quality information from about 54,000 diamonds, and is included in the ggplot2 package. The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price (in 2008). These attributes consist of the 4C and some physical measurements such as depth, table, x, y, and z.

1.2. Objectives

In this assignment, firstly, the data will be investigated for preprocessing to improve its quality. Then an exploratory data analysis will be performed by data manipulation and data visualization steps. The main purpose is to identify which variables affect the price of diamonds mostly and come up with a conclusion for the relationship between variables. In addition to these, forecasting models will be studied in order to estimate the price of diamonds with given properties. The processes during the assignment can be listed as below:

  1. Data Preprocessing
  2. Data Manipulation
  3. Data Visualization
  4. Forecasting

2. Data Preprocessing

2.1. Used Packages

library(tidyverse)
library(knitr)
library(tinytex)
library(kableExtra)
library(corrplot)
library(RColorBrewer)
library(ggExtra)
library(rpart)
library(rpart.plot)
library(rattle)
library(data.table)
library(caret)
library(e1071)

The packages used during the assignment can be listed as below:

  1. tidyverse
  2. knitr
  3. tinytex
  4. kableExtra
  5. corrplot
  6. RColorBrewer
  7. ggExtra
  8. rpart
  9. rpart.plot
  10. rattle
  11. data.table
  12. caret
  13. e1071

2.2. Variables

Before starting the analysis with the dataset, it will be useful to have information about the properties of diamonds. There are 10 columns regarding the properties of diamonds.

  1. carat: Weight of the diamond (numeric variable)
  2. cut: Quality of the cut, from Fair, Good, Very Good, Premium, Ideal (categoric variable)
  3. color: Color of the diamond, from D (best) to J (worst) (categoric variable)
  4. clarity: A measurement of how clear the diamond is, from I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)(categoric variable)
  5. depth: A measurement of the diamond called total depth percentage equals to z / mean(x, y) = 2 * z / (x + y)
  6. table: width of top of diamond relative to widest point (numeric variable)
  7. price: Price of the diamond in US dollars ranging between $326 to $18,823 (numeric variable)
  8. x: Length of the diamond in mm (numeric variable)
  9. y: Width of the diamond in mm (numeric variable)
  10. z: Depth of the diamond in mm (numeric variable)

Diamond color type D is known as the best color while J is the worst, however str() output shows that the order is wrong, therefore we should identify the levels of the color variable in diamonds dataset.

str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
diamonds$color <- factor(diamonds$color, levels = c("J", "I", "H", "G", "F", "E", "D"))

2.3. Data Examination

2.3.1. NA, and Duplicated Values

The usefulness of the results obtained from analyzes depends on quality of the data. For this reason, it is important to investigate NA values and duplicate rows in diamonds dataset.

  • There are not any NA values.
  • There are not are not any negative values for price,carat,depth,table,x, yeand, z variables.
  • There are 146 duplicated rows, therefore they should be removed from the dataset. After we remove the duplicated lines, there is no duplicated line, and we have 53940 rows and 10 columns.
sum(any(is.na(diamonds)))
diamonds %>% filter(price<0 | carat<0 | depth<0 | table<0 | x<0 | y<0 | z<0)  %>%nrow()
sum(duplicated(diamonds))
diamonds <- unique(diamonds)
sum(as.numeric(duplicated(diamonds)))

2.3.2. Accuracy of Values

Unrealistic price values:
The dataset must include realistic values, therefore it shouldn’t contain negative values for the numeric values which correspond to a specific characteristics of a diamond. Price should be greater than zero, therefore, we check dataset to control negative price values. According to the results there is no negative price value, as expected.

Missing x, y, z values:
These values correspond to length, width, and depth, therefore they should be positive. We can impute the wrongly filled zero values if we have other dimensions since depth is a value obtained by a formula which include x,y, and z.

  • There is no row with x equal to 0, where y and z are not equal to 0.
  • There is no row with y equal to 0, where x and z are not equal to 0.
  • However, there are 12 rows with z equal to 0, where x and y are not equal to 0. In other words, we can calculate z values by using depth, x, and y values for these rows. To remind,
     *Depth(%)= z / mean(x, y)*
diamonds %>%
  filter(x == 0 & y != 0 & z !=0)
diamonds %>%
  filter(y == 0 & x != 0 & z !=0)
diamonds %>%
  filter(z == 0 & x != 0 & y !=0)
diamonds <- diamonds %>%
  mutate(z = ifelse(z == 0 & x != 0 & y != 0,round(depth * mean(c(x, y)) / 100, 2), z))

We can remove the rows with 0 value in x, y, or z, which cannot be filled by calculation. It can be seen that there are 7 such rows in this dataset.

diamonds %>%
  filter(x == 0 | y == 0 | z == 0)
## # A tibble: 7 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
## 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
## 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
## 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
## 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
## 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
## 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
diamonds<-diamonds %>%
  filter(!(x == 0 | y == 0 | z == 0))

Unrealistic x, y, z values:
Previous part, we make possible corrections to dimensions for 0 values. In this part, we will check the unrealistic values by the help of depth value.

  • The range of x seems normal with [3.73, 10.74] interval.
  • The range of y does not seem normal with [3.68, 58.90] interval, because when we sort by decreasing order the first two values are way greater than the others.
  • The range of z does not seem normal with [ 1.07, 31.80] interval, because when we sort by decreasing order the first value is way greater than the others.
range(diamonds$x)
range(diamonds$y)
range(diamonds$z)
sort(unique(diamonds$x), decreasing = TRUE)  %>% head(10)
##  [1] 10.74 10.23 10.14 10.02 10.01 10.00  9.86  9.66  9.65  9.54
sort(unique(diamonds$y), decreasing = TRUE)  %>% head(10)
##  [1] 58.90 31.80 10.54 10.16 10.10  9.94  9.85  9.81  9.63  9.59
sort(unique(diamonds$z), decreasing = TRUE)  %>% head(10)
##  [1] 31.80  8.06  6.98  6.72  6.43  6.38  6.31  6.27  6.24  6.17

We can calculate these abnormal y and z values by using depth formula to decide if there are any typo with decimals.

  • The calculated y values for these rows are less than 0, which makes no sense, therefore it is suitable to remove them from the dataset.
  • The calculated z value is 1/10 of the value in the dataset, which means there is an error with deciaml writing. Also, the z values for similar rows are around the calculated z value, therefore z value can be corrected by dividing by 10.
diamonds %>%
  filter(y > 20) %>%
  mutate(calculated_y = (2 * z / depth) / 100 - x) %>%
  select(depth, x, z, y, calculated_y)
## # A tibble: 2 x 5
##   depth     x     z     y calculated_y
##   <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1  58.9  8.09  8.06  58.9        -8.09
## 2  61.8  5.15  5.12  31.8        -5.15
diamonds <- diamonds %>%
  filter(y < 20)
diamonds %>%
  filter(z > 10) %>%
  mutate(calculated_z = depth * mean(c(5.12, 5.15)) / 100) %>%
  select(depth, x, y, z, calculated_z)
## # A tibble: 1 x 5
##   depth     x     y     z calculated_z
##   <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1  61.8  5.12  5.15  31.8         3.17
diamonds$z[diamonds$z == 31.8] = diamonds$z[diamonds$z == 31.8] / 10

Depth deviation from calculated depth:
Since we know the formula for depth(%) we can calculate, and compare it with the depth column in the dataset. While small differences can be ignored, rows with large differences are an indication that there is a measurement problem with those diamond, therefore, they can be removed from the dataset.We can plot depth and calculated depth in order to see the differences.

diamonds$calculated_depth = 2 * diamonds$z / (diamonds$x + diamonds$y) * 100
diamonds %>%
  ggplot(., aes(x = calculated_depth, y = depth)) +
  geom_point(color="seagreen3") + 
  geom_abline(intercept = 0, slope = 1, color="red3", size=2, linetype=2)+
  theme_minimal()+
  labs(title="Calculated Depth vs Dataset Depth",
       subtitle="Diamonds dataset",
       x= "Calculated Depth",
       y="Dataset Depth")

Although most of the points are around the line, there are still points with large distances from the line. Now, we should investigate those points. I assumed that if a point’s depth deviates from calculated depth more than 5% percentage, I can remove it from the dataset. I come up with this conclusion after reading this link, which shows the depth intervals for different types of diamonds. More than almost a 5% of depth mistake will result in a wrong identification.These points can be seen below plot.

diamonds <- diamonds %>%
  filter((abs(calculated_depth - depth) < 5))

diamonds <- subset(diamonds, select = -c(calculated_depth))

x, y, and z relationship:
x, y, and z values are belong to length, width, and depth, respectively. For a physical substance, in general all those dimensions are highly correlated, therefore we can check these relationships by using a line graph to see if there are any unrealistic points. As expected, these values are highly correlated. We can assume that these x, y and z values are valid values.

diamonds %>%
  ggplot(., aes(x = x, y = y)) +
  geom_point(color="royalblue2") + 
  geom_smooth(color="salmon", size=2, linetype=2)+
  theme_minimal()+
  labs(title="Diamond Length vs Diamond Width",
       subtitle="Diamonds dataset",
       x= "Diamond Length",
       y="Diamond Width")

diamonds %>%
  ggplot(., aes(x = z, y = y)) +
  geom_point(color="royalblue2") + 
  geom_smooth(color="salmon", size=2, linetype=2)+
  theme_minimal()+
  labs(title="Diamond Depth vs Diamond Width",
       subtitle="Diamonds dataset",
       x= "Diamond Depth",
       y="Diamond Width")

diamonds %>%
  ggplot(., aes(x = x, y = z)) +
  geom_point(color="royalblue2") + 
  geom_smooth(color="salmon", size=2, linetype=2)+
  theme_minimal()+
  labs(title="Diamond Length vs Diamond Depth",
       subtitle="Diamonds dataset",
       x= "Diamond Length",
       y="Diamond Depth")

2.4 Summary of Data

After preprocessing of the data, the summary and glimpse of the data can be seen below.In this new dataset, we have 53,749 rows.

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1592   J: 2799   SI1    :13024   Min.   :50.80  
##  1st Qu.:0.4000   Good     : 4887   I: 5405   VS2    :12220   1st Qu.:61.00  
##  Median :0.7000   Very Good:12065   H: 8258   SI2    : 9136   Median :61.80  
##  Mean   :0.7974   Premium  :13728   G:11249   VS1    : 8150   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21477   F: 9515   VVS2   : 5054   3rd Qu.:62.50  
##  Max.   :5.0100                     E: 9770   VVS1   : 3644   Max.   :79.00  
##                                     D: 6753   (Other): 2521                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 3.730   Min.   : 3.680  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3930   Mean   : 5.731   Mean   : 5.733  
##  3rd Qu.:59.00   3rd Qu.: 5323   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :10.540  
##                                                                   
##        z        
##  Min.   :2.240  
##  1st Qu.:2.910  
##  Median :3.530  
##  Mean   :3.539  
##  3rd Qu.:4.030  
##  Max.   :6.980  
## 
glimpse(diamonds)
## Rows: 53,749
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

3. Exploratory Data Analysis

3.1. Price

Since the mean is greater than median for price, the distribution is right-skewed.

diamonds%>% ggplot(.,aes(x=price))+
  geom_vline(aes(xintercept=mean(price)),
            color="black", linetype="dashed", size=1)+
  geom_histogram(bins=30, color="tomato3", fill="salmon")+ 
  theme_minimal()+ 
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Distribution of Price",
       subtitle = "Diamonds dataset",
       x = "Price")

3.2. Price vs Categoric Variables

3.2.1. Price vs Cut

diamonds %>%
  group_by(cut) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((cut)) %>%
  select(cut, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Cut","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)
Diamond Cut Count Percentage Minimum Price Average Price Maximum Price
Fair 1592 3.0 337 4330.16 18574
Good 4887 9.1 327 3916.91 18707
Very Good 12065 22.4 336 3981.30 18818
Premium 13728 25.5 326 4577.45 18823
Ideal 21477 40.0 326 3461.78 18806
  • The most of the diamonds belong to ideal cut type.
  • The percentage of premium and very good are very close.
  • There is a little fair cut type.
diamonds%>% ggplot(., aes(cut, price)) + 
  geom_boxplot(aes(fill=factor(cut))) + 
  scale_fill_brewer(palette="Pastel1")+
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme_minimal()+
  labs(title="Price vs Cut", 
       subtitle="Diamonds dataset",
       x="Diamond Cut",
       y="Price",
       fill="Diamond Cut")
diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="pink2") +
  theme_minimal()+facet_wrap(~cut, ncol=5)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="pink3"))+
    labs(title="Price Frequency vs Diamond Cut", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Cut")

Although the best cut type is Ideal, its price is the lowest. According to the average prices, the most expensive diamonds belong to Premium and Fair cut types. These results present that cut is not enough to explain response variable price, since price does not increase while cut feature improves.

3.2.2. Price vs Color

diamonds %>%
  group_by(color) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((color)) %>%
  select(color, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Color","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)
Diamond Color Count Percentage Minimum Price Average Price Maximum Price
J 2799 5.2 335 5328.11 18710
I 5405 10.1 334 5079.12 18823
H 8258 15.4 337 4474.49 18803
G 11249 20.9 354 3997.28 18818
F 9515 17.7 342 3726.73 18791
E 9770 18.2 326 3080.42 18731
D 6753 12.6 357 3172.01 18693
  • The most of the diamonds belong to G color type.
  • The percentage of E, F, and G are close.
  • There is a little J color type.
diamonds %>%
  group_by(color) %>%
  summarise(Average_Price = mean(price)) %>%
  
  ggplot(.,aes(x=color, y = Average_Price, fill= color)) +
    geom_col(color="black") +
    geom_text(aes(label = format(Average_Price, digits = 3)), position = position_dodge(0.9),vjust = 5) +
    geom_line(aes(y = Average_Price), color="black", group=1, size=0.8)+
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 65))+
     scale_fill_brewer(palette="Pastel2")+
   
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  labs(title="Price vs Color", 
       subtitle="Diamonds dataset",
       x="Diamond Color",
       y="Average Price",
       fill="Diamond color")
diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="palegreen3") +
  theme_minimal()+facet_wrap(~color, ncol=7)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="seagreen3"))+
    labs(title="Price Frequency vs Diamond Cut", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Cut")

Although the best color type is D, its price is one of the lowest. According to the average prices, the most expensive diamonds belong to J and I cut types which are actually the worst two color type in this dataset. These results present that color is not enough to explain response variable price, since price does not increase while color feature improves.

3.2.3. Price vs Clarity

diamonds %>%
  group_by(clarity) %>%
  summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
  mutate(percentage = 100*round(count/sum(count),3))%>%
  arrange((clarity)) %>%
  select(clarity, count,  percentage, Min_Price, Average_Price, Max_Price ) %>%
  kable(col.names = c("Diamond Clarity","Count","Percentage",  "Minimum Price", "Average Price", "Maximum Price"))%>%
  kable_minimal(full_width = F)
Diamond Clarity Count Percentage Minimum Price Average Price Maximum Price
I1 737 1.4 345 3927.30 18531
SI2 9136 17.0 326 5052.29 18804
SI1 13024 24.2 326 3993.19 18818
VS2 12220 22.7 334 3925.98 18823
VS1 8150 15.2 327 3839.59 18795
VVS2 5054 9.4 336 3287.22 18768
VVS1 3644 6.8 336 2523.49 18777
IF 1784 3.3 369 2870.57 18806
  • The most of the diamonds belong to SI1 clarity type.
  • The percentage of SI2 and SI1 are close.
  • I1, IF, VVS1, and VVS2 clarity types are less than 10%.
  • The majority of the dataset consists of VS1, VS2, SI1, and SI2 clarity types.
diamonds %>%
  ggplot(., aes(x = clarity, y = price, color = clarity)) +
  scale_color_brewer(palette="Set3")+
  geom_jitter(alpha=0.7) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title = "Price vs Clarity",
       subtitle = "Diamonds dataset",
       x = "Diamond Clarity",
       y = "Price",
       color = "Diamond Clarity")
diamonds%>%
  ggplot(., aes(x=price)) + 
  geom_histogram(binwidth=500, position="dodge", fill="skyblue3") +
  theme_minimal()+facet_wrap(~clarity, ncol=8)+ 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  theme(strip.background = element_rect(fill="skyblue2"))+
    labs(title="Price Frequency vs Diamond  Clarity", 
       subtitle="Diamonds dataset",
       x="Price",
       y="Frequency",
       fill="Diamond Clarity")

Although the best clarity type is IF, its price is the lowest. According to the average prices, the most expensive diamonds belong to SI2 clarity types which is actually the second worst clarity type in this dataset. These results present that clarity is not enough to explain response variable price, since price does not increase while clarity feature improves.

3.3. Price vs Numeric Variables

3.3.1. Price vs Carat

The most frequent carat weight equals to 0.3 in this dataset. From the histogram, we can see the distribution of carat. To see all carats according to their count, following table can be analyzed. It can be said that, most of the carat values in this dataset are less than 1.

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(diamonds$carat)
## [1] 0.3
diamonds%>% ggplot(.,aes(x=carat))+
  geom_vline(aes(xintercept=mean(carat)),
            color="black", linetype="dashed", size=1)+
  geom_histogram(bins=30, color="tomato3", fill="salmon")+ 
  theme_minimal()+ 
  labs(title = "Distribution of Carat",
       subtitle = "Diamonds dataset",
       x = "Carat")

diamonds %>%
  group_by(carat) %>%
  summarise(carat_count = n(),
            minPrice = min(price),
            AveragePrice = mean(price),
            MaxPrice = max(price))%>%
  arrange(desc(carat_count)) %>%
  kable(col.names = c("Carat", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
  kable_styling("striped", full_width = T) %>%
  scroll_box(width = "100%", height = "300px")
Carat Count Minimum Price Average Price Maximum Price
0.30 2596 339 680.6367 2366
0.31 2238 335 708.1997 1917
1.01 2238 2017 5508.9821 16234
0.70 1980 945 2516.2434 5539
0.32 1827 345 719.1916 1715
1.00 1542 1681 5250.9514 16469
0.90 1484 1599 3939.5889 9182
0.41 1378 467 973.0530 1745
0.40 1291 484 933.0077 1908
0.71 1291 1305 2674.3106 5019
0.50 1254 584 1503.8453 3378
0.33 1179 366 765.1747 1389
0.51 1123 826 1623.5841 4368
0.34 908 413 755.7423 2346
1.02 883 2344 5598.4632 17100
0.52 813 870 1656.4785 3373
1.51 804 3497 10552.0211 18806
1.50 787 2964 10082.9835 18691
0.72 764 945 2705.7788 5086
0.53 706 844 1685.0368 3827
0.42 704 552 994.4545 1722
0.38 669 497 904.1973 1992
0.35 663 409 799.5098 1672
1.20 643 2360 6670.7387 16256
0.54 624 935 1704.3317 4042
0.91 570 1570 3981.1561 9636
0.36 569 410 781.0070 1718
1.03 522 1262 5583.8966 17590
0.55 495 959 1725.2283 3509
0.56 492 953 1831.9980 4632
0.73 492 1049 2791.2480 5154
0.43 486 452 995.2737 1827
1.04 474 2037 5774.6709 18542
1.21 470 2396 6905.4043 17353
2.01 433 5696 14772.7136 18741
0.57 429 929 1798.0443 4032
0.39 397 451 923.4761 1624
0.37 394 451 816.0711 1764
1.52 380 3105 10629.9868 17723
1.06 372 2080 5947.0081 15813
1.05 361 2066 5781.6814 13060
1.07 340 2260 6004.2118 18279
0.74 322 1632 2821.9006 5841
0.58 310 893 1798.6484 3741
1.11 308 3183 6005.5682 15464
1.22 298 2699 6984.3188 14414
0.23 293 326 486.1433 688
1.09 285 2708 6020.0947 18231
0.80 284 1232 3146.1901 6834
0.59 282 903 1831.4504 4916
1.23 277 3168 7290.8881 16253
1.10 275 3216 5921.1345 12748
2.00 261 5051 14107.1456 18818
0.24 254 336 505.1850 963
0.26 253 337 550.8972 814
0.77 251 1651 2884.3984 5251
0.76 250 1140 2891.4520 5056
0.75 249 1013 2763.4016 5288
1.12 249 2854 6058.9116 14407
1.08 246 2869 5774.3415 14032
1.13 246 2968 6193.4024 14525
1.24 235 2940 7147.1149 15026
0.27 233 361 574.7597 893
0.60 226 806 1956.2434 4209
0.92 226 2283 4035.2124 7544
1.53 219 5631 10703.4658 17223
1.70 215 5617 12110.3209 18718
0.25 212 357 550.9245 1186
0.44 212 648 1056.6226 2506
1.14 206 2327 6354.4078 18112
0.61 204 931 2069.2451 4279
0.81 199 1687 3091.5126 6338
0.28 198 360 580.1212 828
0.78 187 1720 2918.9091 4904
1.25 187 3276 7123.2513 16823
0.46 177 746 1240.2429 2139
2.02 176 6346 14767.4091 18731
1.54 174 4492 10865.0287 18416
1.16 172 2368 6274.2965 13595
0.79 151 1809 2993.2649 4858
1.15 148 3220 5962.5811 12895
1.26 146 3881 7490.7466 15247
0.93 142 2374 3994.2958 5992
0.82 140 2033 3083.6500 4931
0.62 135 933 1973.3704 3968
1.27 134 2845 7119.1194 15377
1.31 132 3697 7750.6364 17496
0.83 131 1863 3181.9924 7889
0.29 130 334 601.1923 1776
1.19 126 2892 6483.2381 15767
1.18 123 3219 6481.1057 14847
1.55 123 4965 10717.7642 18188
2.03 122 6002 14710.1148 18781
1.30 121 2512 7851.6612 15874
1.71 119 6320 12396.5546 18791
0.45 110 672 1093.3182 1833
1.17 110 2336 6224.8818 11886
1.56 108 5766 10205.9444 17455
1.28 106 3355 7711.8113 18700
1.57 105 5170 10444.5143 17548
0.96 103 1637 3623.4660 6807
0.63 102 1065 2244.7843 6607
1.29 101 2596 7033.0198 17932
0.47 99 719 1272.3131 2460
1.60 95 6272 11124.2526 18780
1.32 89 4140 7639.3708 15802
1.58 89 4459 11427.9888 18057
1.59 89 6010 10574.3596 17552
1.33 87 4634 7527.0805 18435
2.04 86 7403 14840.1744 18795
0.64 80 1080 2113.5875 4281
1.35 77 4082 7448.2857 14616
1.34 68 4352 7758.5000 17663
0.65 65 1214 2322.6154 5052
0.95 65 2195 4013.6615 6951
2.05 65 7026 15647.2462 18787
0.84 64 1749 3294.9844 6399
1.61 64 6915 10992.8438 18318
0.48 63 841 1322.0476 2677
0.85 62 1250 3204.1774 7230
1.62 61 5157 11705.8852 18281
0.94 59 2571 4195.7797 7120
0.97 59 1547 4026.2373 7415
2.06 59 5203 14348.6780 18779
1.72 57 5240 12040.3158 18730
1.73 52 6007 11875.9615 18377
2.10 51 6597 14464.3725 18648
1.36 50 4158 8591.5200 15386
1.40 50 4939 9108.5200 16808
1.63 50 7100 11556.3200 18179
1.75 50 7276 12673.7000 17343
2.07 50 6503 15384.2800 18804
0.66 48 1188 2235.7292 4311
0.67 48 1244 2067.8542 3271
2.14 48 5405 14905.0417 18528
1.37 46 5063 8044.0870 13439
0.49 45 900 1306.3333 2199
2.09 45 10042 15287.3111 18640
1.64 43 4849 10738.0000 17604
2.11 43 7019 14563.4186 18575
2.08 41 6150 15128.6829 18760
1.74 40 4677 12270.3250 18430
1.41 39 4145 9452.2051 17216
1.39 36 3914 8335.1111 13061
0.86 34 1828 3289.9706 5029
1.65 32 5914 10973.8750 17729
0.87 31 1827 3342.3226 5640
0.98 31 1712 3469.7742 4950
2.20 31 7934 15226.8710 18364
1.66 30 5310 11084.5000 16294
2.18 30 10340 14901.7000 18717
1.76 28 9053 12398.5714 18178
2.22 27 5607 14691.5185 18706
0.69 26 1636 2206.6154 3784
1.38 26 4598 8164.3077 17598
0.68 25 1341 2334.3200 4325
1.42 25 6069 9487.2000 18682
1.67 25 7911 12281.4800 17351
2.12 25 7508 15007.6800 18656
2.16 25 8709 15860.4800 18678
1.69 24 5182 11878.0833 17803
0.88 23 1964 3431.3478 5595
0.99 23 1789 4406.1739 6094
2.21 23 6535 15287.9565 18483
2.15 22 5430 14187.1818 18791
2.19 22 10179 15521.0909 18693
0.89 21 1334 3291.2381 4854
1.47 21 4898 7833.7143 12547
2.13 21 12356 15220.4286 18442
1.80 20 7862 12955.1500 17273
2.28 20 9068 15422.9500 18055
2.30 20 7226 14861.5500 18304
1.43 19 6086 9234.9474 14429
1.68 19 5765 11779.1053 16985
1.44 18 4064 8021.6111 12093
1.46 18 5421 8237.4444 12261
1.83 18 5083 11930.0000 18525
2.17 18 6817 14416.6111 17805
1.77 17 7771 11103.1765 15278
2.29 17 11502 15834.7647 18823
2.51 17 14209 16616.8824 18308
2.24 16 10913 15573.9375 18299
2.25 16 6653 13403.5625 18034
2.32 16 6860 16042.1875 18532
2.50 16 7854 14316.1875 18447
1.45 15 4320 8709.3333 17545
1.79 15 9122 13144.0667 18429
2.26 15 11226 15330.2000 18426
1.82 13 5993 12986.1538 15802
2.23 13 7006 13947.4615 18149
2.31 13 7257 15017.3846 17715
2.40 13 11988 16362.4615 18541
0.20 12 345 365.1667 367
1.78 12 8889 12504.5833 18128
1.91 12 6632 12365.8333 17509
2.27 12 5733 14198.4167 18343
3.01 12 8040 15521.6667 18710
1.49 11 5407 9174.2727 18614
0.21 9 326 380.2222 394
1.81 9 5859 11374.4444 14177
1.86 9 9791 12164.3333 17267
2.33 9 8220 14089.7778 18119
2.48 9 12883 16537.1111 18692
2.52 9 10076 16242.1111 18252
2.36 8 11238 15756.7500 18745
2.38 8 10716 15481.3750 18559
2.53 8 14659 16548.8750 18254
2.54 8 12095 15125.2500 17996
3.00 8 6512 12378.5000 16970
1.48 7 6216 8791.2857 15164
1.87 7 9955 12933.5714 17761
1.90 7 8576 11477.1429 13919
2.35 7 14185 16090.4286 17999
2.39 7 15917 17182.4286 17955
2.42 7 16198 16923.5714 18615
1.93 6 13278 16828.1667 18306
2.37 6 14837 16525.1667 18508
2.43 6 9716 15356.0000 18692
0.22 5 337 391.4000 470
1.98 5 12923 14667.8000 16171
2.34 5 8491 11922.0000 15818
2.41 5 13563 15808.2000 17923
1.84 4 7922 11108.5000 13905
1.88 4 8048 10946.2500 13607
1.89 4 10055 13969.0000 17553
1.96 4 5554 8642.2500 13099
1.97 4 14410 15966.0000 17535
2.44 4 13027 16365.5000 18430
2.45 4 11830 14593.7500 18113
1.85 3 5688 11436.0000 14375
1.94 3 9344 14189.3333 18735
1.95 3 5045 12340.6667 17374
1.99 3 11486 14179.0000 17713
2.46 3 10470 14227.0000 16466
2.47 3 14970 15644.0000 16532
2.49 3 6289 13843.0000 18325
2.55 3 14351 15964.0000 18766
2.56 3 15231 16723.3333 17753
2.57 3 17116 17841.6667 18485
2.58 3 12500 14481.3333 16195
2.60 3 17027 17535.0000 18369
2.61 3 13784 16583.0000 18756
2.63 3 10437 12659.6667 16914
2.72 3 6870 12088.3333 17801
2.74 3 8807 14385.0000 17184
1.92 2 15364 15418.0000 15472
2.66 2 16239 17367.0000 18495
2.68 2 8419 9042.0000 9665
2.75 2 13156 14285.5000 15415
3.04 2 15354 16956.5000 18559
4.01 2 15223 15223.0000 15223
2.59 1 16465 16465.0000 16465
2.64 1 17407 17407.0000 17407
2.65 1 16314 16314.0000 16314
2.67 1 18686 18686.0000 18686
2.70 1 14341 14341.0000 14341
2.71 1 17146 17146.0000 17146
2.77 1 10424 10424.0000 10424
2.80 1 15030 15030.0000 15030
3.02 1 10577 10577.0000 10577
3.05 1 10453 10453.0000 10453
3.11 1 9823 9823.0000 9823
3.22 1 12545 12545.0000 12545
3.24 1 12300 12300.0000 12300
3.40 1 15964 15964.0000 15964
3.50 1 12587 12587.0000 12587
3.51 1 18701 18701.0000 18701
3.65 1 11668 11668.0000 11668
3.67 1 16193 16193.0000 16193
4.00 1 15984 15984.0000 15984
4.13 1 17329 17329.0000 17329
4.50 1 18531 18531.0000 18531
5.01 1 18018 18018.0000 18018

3.3.2. Price vs x, y, z, and Depth

Price is highly related with x, y, and, z which can be seen below scatter plots. Since depth is a formula obtained by using these variables, the plot belongs to depth is not presented in this part.

ggplot(diamonds, aes(x, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) +
  theme_minimal()+
  labs(title = "Price vs Diamond Length",
       subtitle = "Diamonds dataset",
       x = "Diamond Length",
       y = "Price")

ggplot(diamonds, aes(y, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) +
  theme_minimal()+
  labs(title = "Price vs Diamond Width",
       subtitle = "Diamonds dataset",
       x = "Diamond Width",
       y = "Price")

ggplot(diamonds, aes(z, price,)) +
  geom_point(alpha = 0.5, color="seagreen3") +
  geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) + 
  theme_minimal()+
  labs(title = "Price vs Diamond Depth",
       subtitle = "Diamonds dataset",
       x = "Diamond Depth",
       y = "Price")

3.4. Between Variables

In order to explore the relationship between 4C and price, following plots can be examined.

ggplot(diamonds, aes(carat, price, color = clarity)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +scale_color_brewer(palette="Set1")+
  geom_smooth(method="loess", se=F) + 
  labs(title = "Price vs Carat Grouped by Clarity",
       subtitle = "Diamonds dataset",
       x = "Diamond Carat",
       y = "Price",
       color = "Diamond Clarity")

ggplot(diamonds, aes(carat, price, color = cut)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +scale_color_brewer(palette="Set1")+
  geom_smooth(method="loess", se=F) + 
  labs(title = "Price vs Carat Grouped by Cut",
       subtitle = "Diamonds dataset",
       x = "Diamond Carat",
       y = "Price",
       color = "Diamond Cut")

ggplot(diamonds, aes(carat, price, color = color)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +scale_color_brewer(palette="Set1")+
  geom_smooth(method="loess", se=F) + 
  labs(title = "Price vs Carat Grouped by Color",
       subtitle = "Diamonds dataset",
       x = "Diamond Carat",
       y = "Price",
       color = "Diamond Color")

  • When comparing diamons with equal carat weight, IF clarity type is the most expensive, as expected.
  • When comparing diamons with equal carat weight, Ideal cut type is the most expensive, as expected.
  • When comparing diamons with equal carat weight, D color group is the most expensive, as expected.
  • It can be said that, while color, cut, and clarity alone are not meaningful for estimating price, they gain meaning with carat information.

4. Further Analyses

Some possible analyses may include Clustering and Principal Component Analysis for our dataset.

  • By clustering, we can mutate a new column by identifying the clusters of each row.
  • By PCA, we can reduce the dimensionality of our data and use less variable. Also, this method is useful for dealing with multicollinearity.

Before proceding with these methods, we should divide our dataset into two, one for train and one for test set. Since, our factor variables namely cut, clarity, and color are ordered, we can treat them as numeric values.

set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

diamonds_train = diamonds_train[, c(-ncol(diamonds_train))]
diamonds_test = diamonds_test[, c(-ncol(diamonds_test))]
diamonds_train$cut <- as.numeric(diamonds_train$cut)
diamonds_train$clarity <- as.numeric(diamonds_train$clarity)
diamonds_train$color <- as.numeric(diamonds_train$color)
diamonds_test$cut <- as.numeric(diamonds_test$cut)
diamonds_test$clarity <- as.numeric(diamonds_test$clarity)
diamonds_test$color <- as.numeric(diamonds_test$color)

4.1. Principal Component Analysis(PCA)

Principal Component Analysis is suitable for numeric variables, therefore we choose this type of variables from our dataset to continue. It can be seen from summary output that 4 component is enough in order to explain 88% of the data. We should add these components to both train and test datasets for the following price estimation models.

diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
summary(diamonds_pca,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     2.0742952 1.1892966 1.1065552 0.9926732 0.83121848
## Proportion of Variance 0.4780779 0.1571585 0.1360516 0.1094889 0.07676935
## Cumulative Proportion  0.4780779 0.6352364 0.7712880 0.8807769 0.95754622
##                            Comp.6      Comp.7       Comp.8       Comp.9
## Standard deviation     0.59218546 0.173206533 0.0320924547 1.923472e-02
## Proportion of Variance 0.03896485 0.003333389 0.0001144362 4.110829e-05
## Cumulative Proportion  0.99651107 0.999844456 0.9999588917 1.000000e+00
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat    0.471                       0.126         0.868              
## cut     -0.109  0.595  0.451  0.241 -0.175  0.584                     
## color   -0.158 -0.202         0.802  0.538                            
## clarity -0.222  0.208  0.215 -0.506  0.778                            
## depth           0.209 -0.827         0.118  0.492                0.101
## table    0.129 -0.698  0.227 -0.181         0.641                     
## x        0.475                       0.115        -0.279 -0.715  0.397
## y        0.474                       0.120        -0.297  0.699  0.410
## z        0.474  0.103                0.132        -0.283        -0.815
pca_results <- predict(diamonds_pca, newdata = diamonds_train)
diamonds_train$pca1 = pca_results[,1]
diamonds_train$pca2 = pca_results[,2]
diamonds_train$pca3 = pca_results[,3]
diamonds_train$pca4 = pca_results[,4]

pca_results_test = predict(diamonds_pca, newdata = diamonds_test)
diamonds_test$pca1 = pca_results_test[,1]
diamonds_test$pca2 = pca_results_test[,2]
diamonds_test$pca3 = pca_results_test[,3]
diamonds_test$pca4 = pca_results_test[,4]

4.2. Clustering

There are different kinds of clustering methods to create a feature, in this assignment K-means is used. First of all, scaling should be applied since if a columnn have much higher values than the others, it may dominate the results. In order to scale the data, we should find the minimum and maximum values of the train set, and then we will scale both train and test set with the same values.

min_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], min)
max_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], max)
diamonds_train_scale <- sweep(sweep(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")

For deciding the number of cluster, we can use a for loop from 1 to 15 center, and then select the number of center value.

errors = c()
for (i in (1:15)){
  set.seed(1357) 
  cluster<-kmeans(diamonds_train_scale,centers=i) 
  errors = c(errors, 100*round(1 - (cluster$tot.withinss/cluster$totss), digits = 3))
  }

errors_df <- data.frame(x=c(1:15), y=errors) 

ggplot(errors_df, aes(x=x, y=y)) +
  geom_point(color = "firebrick2") +
  geom_line(color="firebrick3") +
  geom_text(aes(label = errors), size=3, color = "black", position = position_stack(vjust = 0.95))+
  theme_minimal() +
  labs(title = "Center Number vs R-Square",
       subtitle = "Diamonds dataset",
       x = "Cluster Number",
       y = "R-Square")

The decrease in errors are slowly changing after the cluster with 8 centers. So, we can say that we should select the model with center equals to 8.

set.seed(1357)
best_cluster = kmeans(diamonds_train_scale,centers=8)
diamonds_train$cluster = as.factor(best_cluster$cluster)

Now, we need to apply clustering process to the test dataset.

diamonds_test_scale <- sweep(sweep(diamonds_test[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")

closest.cluster <- function(x) {
  cluster.dist <- apply(best_cluster$centers, 1, function(y) sqrt(sum((x-y)^2)))
  return(which.min(cluster.dist)[1])
}
diamonds_test$cluster <- as.factor(apply(diamonds_test_scale, 1, closest.cluster))
diamonds_train %>%
  ggplot(., aes(x = cluster, y = price, color = cluster)) +
  scale_color_brewer(palette="Set2")+
  geom_jitter(alpha=0.7) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title = "Price vs Clusters",
       subtitle = "Diamonds train dataset",
       x = "Diamond Cluster",
       y = "Price",
       color = "Diamond Cluster")

diamonds_test %>%
  ggplot(., aes(x = cluster, y = price, color = cluster)) +
  scale_color_brewer(palette="Set2")+
  geom_jitter(alpha=0.7) +
  theme_minimal() +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Price vs Clusters",
       subtitle = "Diamonds test dataset",
       x = "Diamond Cluster",
       y = "Price",
       color = "Diamond Cluster")

5. Price Estimation Models

Correlation matrix can be examined to get an idea about the relationships between numeric variables, therefore, correlogram is illustrated below. We can say that carat, x, y, and z columns are highly correlated with price column.

diamonds_cor<-cor(diamonds_train[-c(11:15)])
corrplot(diamonds_cor, method="number")

5.1. Generalized Linear Model

There are four assumptions to be fulfilled in a linear model:

  1. Linearity Assumption
  2. Constant Variance Assumption
  3. Independence Assumption
  4. Normality Assumption

Since, our dataset does not fulfill these assumptions we will construct a generalized linear model which is more flexible than standard linear model in the terms of assumptions. The responce variable, price, is continuous therefore Gamma or Gaussian family may fit. The lowest AIC value is obtained by Gamma family with identity link function.

model_glm <- glm(price ~ carat+cut+color+clarity+depth+table+x+y+z+cluster, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm)
## 
## Call:
## glm(formula = price ~ carat + cut + color + clarity + depth + 
##     table + x + y + z + cluster, family = Gamma(link = "identity"), 
##     data = diamonds_train, start = c(0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0649  -0.1433  -0.0057   0.1532  10.2150  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.052e+04  6.219e+02 -33.003  < 2e-16 ***
## carat        7.241e+03  4.341e+01 166.797  < 2e-16 ***
## cut         -6.238e+00  3.367e+00  -1.853 0.063928 .  
## color        3.589e+01  1.661e+00  21.612  < 2e-16 ***
## clarity      8.070e+01  1.300e+00  62.065  < 2e-16 ***
## depth        3.433e+02  9.938e+00  34.548  < 2e-16 ***
## table       -6.750e+00  9.723e-01  -6.942 3.92e-12 ***
## x            3.884e+03  8.192e+01  47.409  < 2e-16 ***
## y            1.082e+03  7.990e+01  13.543  < 2e-16 ***
## z           -8.913e+03  2.165e+02 -41.174  < 2e-16 ***
## cluster2    -2.341e+01  6.809e+00  -3.438 0.000586 ***
## cluster3    -7.069e+01  6.950e+00 -10.171  < 2e-16 ***
## cluster4    -8.468e+01  8.813e+00  -9.609  < 2e-16 ***
## cluster5    -2.753e+02  2.304e+01 -11.950  < 2e-16 ***
## cluster6    -6.137e+01  9.056e+00  -6.777 1.24e-11 ***
## cluster7    -2.242e+02  2.195e+01 -10.216  < 2e-16 ***
## cluster8     8.039e+02  2.382e+01  33.742  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.0868286)
## 
##     Null deviance: 42126.5  on 42990  degrees of freedom
## Residual deviance:  2908.6  on 42974  degrees of freedom
## AIC: 676132
## 
## Number of Fisher Scoring iterations: 25

The model can be improved by some arrangement in the explanatory variables.

model_glm2 <- glm(price ~ carat*color+carat*clarity+I(carat^2)+cluster+y+depth, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm2)
## 
## Call:
## glm(formula = price ~ carat * color + carat * clarity + I(carat^2) + 
##     cluster + y + depth, family = Gamma(link = "identity"), data = diamonds_train, 
##     start = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 
##         0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.23761  -0.10711  -0.00628   0.09001   2.68079  
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    2860.8797   106.3588   26.898  < 2e-16 ***
## carat         -3616.5228   101.2272  -35.727  < 2e-16 ***
## color          -112.0154     1.9102  -58.642  < 2e-16 ***
## clarity        -184.5964     1.7927 -102.970  < 2e-16 ***
## I(carat^2)     3886.1793    31.1401  124.797  < 2e-16 ***
## cluster2         21.9462     3.6092    6.081 1.21e-09 ***
## cluster3        -17.1453     3.8160   -4.493 7.04e-06 ***
## cluster4        -78.7454     3.3094  -23.795  < 2e-16 ***
## cluster5        459.0441    13.3378   34.417  < 2e-16 ***
## cluster6        -43.3751     3.5333  -12.276  < 2e-16 ***
## cluster7        329.4331    11.7776   27.971  < 2e-16 ***
## cluster8        511.6552    11.7190   43.660  < 2e-16 ***
## y              -245.2199    18.9533  -12.938  < 2e-16 ***
## depth           -13.3575     0.9476  -14.096  < 2e-16 ***
## carat:color     486.7423     4.6738  104.142  < 2e-16 ***
## carat:clarity   794.7105     4.6681  170.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.02385886)
## 
##     Null deviance: 42126.5  on 42990  degrees of freedom
## Residual deviance:  1022.2  on 42975  degrees of freedom
## AIC: 630859
## 
## Number of Fisher Scoring iterations: 25

To make an example with components, folowing model can be constructed.

model_glm_pca = glm(price ~ pca1 + pca2 + pca3 + pca4 + cluster, data = diamonds_train, family = Gamma(link = "sqrt"))
summary(model_glm_pca)
## 
## Call:
## glm(formula = price ~ pca1 + pca2 + pca3 + pca4 + cluster, family = Gamma(link = "sqrt"), 
##     data = diamonds_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.12715  -0.24877  -0.04553   0.16452   1.95690  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.02880    0.11968  459.78   <2e-16 ***
## pca1        11.87183    0.03733  318.03   <2e-16 ***
## pca2         3.37035    0.03747   89.96   <2e-16 ***
## pca3         2.25217    0.04057   55.51   <2e-16 ***
## pca4        -0.86510    0.05872  -14.73   <2e-16 ***
## cluster2     6.62809    0.12793   51.81   <2e-16 ***
## cluster3    -3.24163    0.15406  -21.04   <2e-16 ***
## cluster4     3.01052    0.14841   20.29   <2e-16 ***
## cluster5    -6.77135    0.25980  -26.06   <2e-16 ***
## cluster6     1.73995    0.16787   10.37   <2e-16 ***
## cluster7    -2.82795    0.26942  -10.50   <2e-16 ***
## cluster8     7.34767    0.21746   33.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1129233)
## 
##     Null deviance: 42126.5  on 42990  degrees of freedom
## Residual deviance:  4544.4  on 42979  degrees of freedom
## AIC: 695577
## 
## Number of Fisher Scoring iterations: 11

PCA helps us to use less feature in a model. But, as expected, with fewer feature we will get more AIC value in the model which means the explanatory power of the model decreases. Since we do not have so many features, we do not need to decrease the feature number.We may also use all the principal components instead of the variables to deal with multicollinearity however in this study this model is not investigated. For the best glm model, we can calculate the R^2 value. It can be calculated easily with:

model_glm2_prediction<-predict(model_glm2, newdata=diamonds_test)
model_glm2_r2 <- 1 - (sum((model_glm2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm2_r2
## [1] 0.9053068

To visually see the prediction vs actual price values, we can check the below plot.

pred_vs_real_glm <- as.data.frame(cbind(model_glm2_prediction, diamonds_test$price))
colnames(pred_vs_real_glm) = c("prediction", "actual")

pred_vs_real_glm %>%
  ggplot(aes(prediction, actual)) +
  geom_point(color="seagreen3")  +
  theme_minimal()+
  geom_abline(intercept = 0, slope = 1, color="red3", size=2, linetype=2) +
  labs(title = "Prediction vs Actual Values for the Generalized Linear Model",
       subtitle = "Diamonds dataset",
       x = "Predictions",
       y = "Real Values")

5.2. Classification and Regression Tree

model_rpart <- rpart(price~., data=diamonds_train)
fancyRpartPlot(model_rpart, sub="CART Model")

From the plot of a tree, we can see that the nodes are divided with using only carat, y and clarity columns. It means that these are better features to reduce the variance in the dataset. We can make a cross validation for cp argument. To do so, we need to apply this code:

tr.control = trainControl(method = "cv", number = 10)
cp.grid = expand.grid( .cp = (1:10)*0.001)
tr = train(price~. - pca1 - pca2 - pca3 - pca4, data = diamonds_train, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
tr
## CART 
## 
## 42991 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 38691, 38693, 38694, 38691, 38692, 38691, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE      Rsquared   MAE     
##   0.001   950.902  0.9430052  580.0119
##   0.002  1036.373  0.9322755  628.2955
##   0.003  1087.797  0.9253495  656.4300
##   0.004  1137.831  0.9182846  685.5459
##   0.005  1145.996  0.9171573  688.0164
##   0.006  1245.355  0.9021746  811.5315
##   0.007  1279.303  0.8967777  846.7474
##   0.008  1279.303  0.8967777  846.7474
##   0.009  1279.303  0.8967777  846.7474
##   0.010  1362.123  0.8829236  883.8798
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001.

We can see that we need to use 0.001 for cp when we compare with other cp values.

model_rpart2 <- rpart(price~. - pca1 - pca2 - pca3 - pca4, data=diamonds_train, cp = 0.001)
fancyRpartPlot(model_rpart2, cex=0.4, sub="CART Model")

This is a more detailed tree. From the plot of this tree, we can say that we will define the price with respect to carat, y, clarity and color values. To be able to compare these two trees, we need to calculate the R^2 values.

model_rpart2_prediction<-predict(model_rpart2,newdata=diamonds_test)
model_rpart2_r2 <- 1 - (sum((model_rpart2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart2_r2
## [1] 0.944111

The R^2 value for the detailed tree is calculated above. Now, we can compare the generalized linear model and CART. We can conclude by saying CART model is better than Generalized Linear Model.

6. Conclusion

The steps during the assignment can be summarized as below:

  • Preprocessing of data by checking the accuracy of variables, NA, and duplicated variables
  • Exploratory data analysis for visualization of the data
  • Principal component cnalysis and K-means algorithm
  • Linear, and generalized linear model construction
  • Classification and regression tree model construction
  • Comparison between models

Important results:

  • Data preprocessing is important since the quality of the data may affect the estimation of price and mislead us.
  • Explarotary data analysis is useful for comprehending the data before constructing models, however only using EDA approaches may mislead us to understand the relationship between variables.
  • K-means is a useful algorithm in order to create new features.
  • Principal Component Analysis helps us to reduce the number of variable in cases where we have extremely many variables. Also, PCA is important for handling with multicollinearity in models.
  • Linear model is widely used however if a linear model does not fulfil assumptions, we cannot use it, therefore in those cases generalized linear models are convenient with more flexible assumptions.
  • The best CART tree may be defined by finding the best cp value.