Diamonds are the result of carbon exposed to extremely high heat and pressure, deep in the earth. This process results in a variety of both internal and external characteristics(i.e. size, shape, color, etc.) which makes every diamond unique. Therefore, jewelry industry uses a systematic way to evaluate and discus these factors. Diamond professionals use the grading system developed by GIA in the 1950s, which established the use of four important factors to describe and classify diamonds: Clarity, Color, Cut, and Carat Weight.These are known as the 4Cs. When used together, they describe the quality of a finished diamond. The value of a finished diamond is based on this combination. The explanation of 4C can be seen below:
The diamonds dataset that we use in this report consists of prices and quality information from about 54,000 diamonds, and is included in the ggplot2 package. The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price (in 2008). These attributes consist of the 4C and some physical measurements such as depth, table, x, y, and z.
In this assignment, firstly, the data will be investigated for preprocessing to improve its quality. Then an exploratory data analysis will be performed by data manipulation and data visualization steps. The main purpose is to identify which variables affect the price of diamonds mostly and come up with a conclusion for the relationship between variables. In addition to these, forecasting models will be studied in order to estimate the price of diamonds with given properties. The processes during the assignment can be listed as below:
library(tidyverse)
library(knitr)
library(tinytex)
library(kableExtra)
library(corrplot)
library(RColorBrewer)
library(ggExtra)
library(rpart)
library(rpart.plot)
library(rattle)
library(data.table)
library(caret)
library(e1071)
The packages used during the assignment can be listed as below:
Before starting the analysis with the dataset, it will be useful to have information about the properties of diamonds. There are 10 columns regarding the properties of diamonds.
carat
: Weight of the diamond (numeric variable)cut
: Quality of the cut, from Fair, Good, Very Good, Premium, Ideal (categoric variable)color
: Color of the diamond, from D (best) to J (worst) (categoric variable)clarity
: A measurement of how clear the diamond is, from I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)(categoric variable)depth
: A measurement of the diamond called total depth percentage equals to z / mean(x, y) = 2 * z / (x + y)table
: width of top of diamond relative to widest point (numeric variable)price
: Price of the diamond in US dollars ranging between $326 to $18,823 (numeric variable)x
: Length of the diamond in mm (numeric variable)y
: Width of the diamond in mm (numeric variable)z
: Depth of the diamond in mm (numeric variable)Diamond color type D is known as the best color while J is the worst, however str()
output shows that the order is wrong, therefore we should identify the levels of the color variable in diamonds dataset.
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
$color <- factor(diamonds$color, levels = c("J", "I", "H", "G", "F", "E", "D")) diamonds
The usefulness of the results obtained from analyzes depends on quality of the data. For this reason, it is important to investigate NA values and duplicate rows in diamonds dataset.
price
,carat
,depth
,table
,x
, ye
and, z
variables.sum(any(is.na(diamonds)))
%>% filter(price<0 | carat<0 | depth<0 | table<0 | x<0 | y<0 | z<0) %>%nrow()
diamonds sum(duplicated(diamonds))
unique(diamonds)
diamonds <-sum(as.numeric(duplicated(diamonds)))
Unrealistic price values:
The dataset must include realistic values, therefore it shouldn’t contain negative values for the numeric values which correspond to a specific characteristics of a diamond. Price should be greater than zero, therefore, we check dataset to control negative price values. According to the results there is no negative price value, as expected.
Missing x, y, z values:
These values correspond to length, width, and depth, therefore they should be positive. We can impute the wrongly filled zero values if we have other dimensions since depth is a value obtained by a formula which include x,y, and z.
*Depth(%)= z / mean(x, y)*
%>%
diamonds filter(x == 0 & y != 0 & z !=0)
%>%
diamonds filter(y == 0 & x != 0 & z !=0)
%>%
diamonds filter(z == 0 & x != 0 & y !=0)
diamonds %>%
diamonds <- mutate(z = ifelse(z == 0 & x != 0 & y != 0,round(depth * mean(c(x, y)) / 100, 2), z))
We can remove the rows with 0 value in x, y, or z, which cannot be filled by calculation. It can be seen that there are 7 such rows in this dataset.
%>%
diamonds filter(x == 0 | y == 0 | z == 0)
## # A tibble: 7 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
## 2 1 Very Good H VS2 63.3 53 5139 0 0 0
## 3 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 4 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 5 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 6 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 7 0.71 Good F SI2 64.1 60 2130 0 0 0
%>%
diamonds<-diamonds filter(!(x == 0 | y == 0 | z == 0))
Unrealistic x, y, z values:
Previous part, we make possible corrections to dimensions for 0 values. In this part, we will check the unrealistic values by the help of depth value.
range(diamonds$x)
range(diamonds$y)
range(diamonds$z)
sort(unique(diamonds$x), decreasing = TRUE) %>% head(10)
## [1] 10.74 10.23 10.14 10.02 10.01 10.00 9.86 9.66 9.65 9.54
sort(unique(diamonds$y), decreasing = TRUE) %>% head(10)
## [1] 58.90 31.80 10.54 10.16 10.10 9.94 9.85 9.81 9.63 9.59
sort(unique(diamonds$z), decreasing = TRUE) %>% head(10)
## [1] 31.80 8.06 6.98 6.72 6.43 6.38 6.31 6.27 6.24 6.17
We can calculate these abnormal y and z values by using depth formula to decide if there are any typo with decimals.
%>%
diamonds filter(y > 20) %>%
mutate(calculated_y = (2 * z / depth) / 100 - x) %>%
select(depth, x, z, y, calculated_y)
## # A tibble: 2 x 5
## depth x z y calculated_y
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 58.9 8.09 8.06 58.9 -8.09
## 2 61.8 5.15 5.12 31.8 -5.15
diamonds %>%
diamonds <- filter(y < 20)
%>%
diamonds filter(z > 10) %>%
mutate(calculated_z = depth * mean(c(5.12, 5.15)) / 100) %>%
select(depth, x, y, z, calculated_z)
## # A tibble: 1 x 5
## depth x y z calculated_z
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 61.8 5.12 5.15 31.8 3.17
$z[diamonds$z == 31.8] = diamonds$z[diamonds$z == 31.8] / 10 diamonds
Depth deviation from calculated depth:
Since we know the formula for depth(%) we can calculate, and compare it with the depth
column in the dataset. While small differences can be ignored, rows with large differences are an indication that there is a measurement problem with those diamond, therefore, they can be removed from the dataset.We can plot depth and calculated depth in order to see the differences.
$calculated_depth = 2 * diamonds$z / (diamonds$x + diamonds$y) * 100
diamonds%>%
diamonds ggplot(., aes(x = calculated_depth, y = depth)) +
geom_point(color="seagreen3") +
geom_abline(intercept = 0, slope = 1, color="red3", size=2, linetype=2)+
theme_minimal()+
labs(title="Calculated Depth vs Dataset Depth",
subtitle="Diamonds dataset",
x= "Calculated Depth",
y="Dataset Depth")
Although most of the points are around the line, there are still points with large distances from the line. Now, we should investigate those points. I assumed that if a point’s depth deviates from calculated depth more than 5% percentage, I can remove it from the dataset. I come up with this conclusion after reading this link, which shows the depth intervals for different types of diamonds. More than almost a 5% of depth mistake will result in a wrong identification.These points can be seen below plot.
diamonds %>%
diamonds <- filter((abs(calculated_depth - depth) < 5))
subset(diamonds, select = -c(calculated_depth)) diamonds <-
x, y, and z relationship:
x, y, and z values are belong to length, width, and depth, respectively. For a physical substance, in general all those dimensions are highly correlated, therefore we can check these relationships by using a line graph to see if there are any unrealistic points. As expected, these values are highly correlated. We can assume that these x, y and z values are valid values.
%>%
diamonds ggplot(., aes(x = x, y = y)) +
geom_point(color="royalblue2") +
geom_smooth(color="salmon", size=2, linetype=2)+
theme_minimal()+
labs(title="Diamond Length vs Diamond Width",
subtitle="Diamonds dataset",
x= "Diamond Length",
y="Diamond Width")
%>%
diamonds ggplot(., aes(x = z, y = y)) +
geom_point(color="royalblue2") +
geom_smooth(color="salmon", size=2, linetype=2)+
theme_minimal()+
labs(title="Diamond Depth vs Diamond Width",
subtitle="Diamonds dataset",
x= "Diamond Depth",
y="Diamond Width")
%>%
diamonds ggplot(., aes(x = x, y = z)) +
geom_point(color="royalblue2") +
geom_smooth(color="salmon", size=2, linetype=2)+
theme_minimal()+
labs(title="Diamond Length vs Diamond Depth",
subtitle="Diamonds dataset",
x= "Diamond Length",
y="Diamond Depth")
After preprocessing of the data, the summary and glimpse of the data can be seen below.In this new dataset, we have 53,749 rows.
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1592 J: 2799 SI1 :13024 Min. :50.80
## 1st Qu.:0.4000 Good : 4887 I: 5405 VS2 :12220 1st Qu.:61.00
## Median :0.7000 Very Good:12065 H: 8258 SI2 : 9136 Median :61.80
## Mean :0.7974 Premium :13728 G:11249 VS1 : 8150 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21477 F: 9515 VVS2 : 5054 3rd Qu.:62.50
## Max. :5.0100 E: 9770 VVS1 : 3644 Max. :79.00
## D: 6753 (Other): 2521
## table price x y
## Min. :43.00 Min. : 326 Min. : 3.730 Min. : 3.680
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3930 Mean : 5.731 Mean : 5.733
## 3rd Qu.:59.00 3rd Qu.: 5323 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :10.540
##
## z
## Min. :2.240
## 1st Qu.:2.910
## Median :3.530
## Mean :3.539
## 3rd Qu.:4.030
## Max. :6.980
##
glimpse(diamonds)
## Rows: 53,749
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
Since the mean is greater than median for price, the distribution is right-skewed.
%>% ggplot(.,aes(x=price))+
diamonds geom_vline(aes(xintercept=mean(price)),
color="black", linetype="dashed", size=1)+
geom_histogram(bins=30, color="tomato3", fill="salmon")+
theme_minimal()+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
labs(title = "Distribution of Price",
subtitle = "Diamonds dataset",
x = "Price")
%>%
diamonds group_by(cut) %>%
summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
mutate(percentage = 100*round(count/sum(count),3))%>%
arrange((cut)) %>%
select(cut, count, percentage, Min_Price, Average_Price, Max_Price ) %>%
kable(col.names = c("Diamond Cut","Count","Percentage", "Minimum Price", "Average Price", "Maximum Price"))%>%
kable_minimal(full_width = F)
Diamond Cut | Count | Percentage | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|---|
Fair | 1592 | 3.0 | 337 | 4330.16 | 18574 |
Good | 4887 | 9.1 | 327 | 3916.91 | 18707 |
Very Good | 12065 | 22.4 | 336 | 3981.30 | 18818 |
Premium | 13728 | 25.5 | 326 | 4577.45 | 18823 |
Ideal | 21477 | 40.0 | 326 | 3461.78 | 18806 |
%>% ggplot(., aes(cut, price)) +
diamonds geom_boxplot(aes(fill=factor(cut))) +
scale_fill_brewer(palette="Pastel1")+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme_minimal()+
labs(title="Price vs Cut",
subtitle="Diamonds dataset",
x="Diamond Cut",
y="Price",
fill="Diamond Cut")
%>%
diamonds ggplot(., aes(x=price)) +
geom_histogram(binwidth=500, position="dodge", fill="pink2") +
theme_minimal()+facet_wrap(~cut, ncol=5)+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme(strip.background = element_rect(fill="pink3"))+
labs(title="Price Frequency vs Diamond Cut",
subtitle="Diamonds dataset",
x="Price",
y="Frequency",
fill="Diamond Cut")
Although the best cut type is Ideal, its price is the lowest. According to the average prices, the most expensive diamonds belong to Premium and Fair cut types. These results present that cut is not enough to explain response variable price, since price does not increase while cut feature improves.
%>%
diamonds group_by(color) %>%
summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
mutate(percentage = 100*round(count/sum(count),3))%>%
arrange((color)) %>%
select(color, count, percentage, Min_Price, Average_Price, Max_Price ) %>%
kable(col.names = c("Diamond Color","Count","Percentage", "Minimum Price", "Average Price", "Maximum Price"))%>%
kable_minimal(full_width = F)
Diamond Color | Count | Percentage | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|---|
J | 2799 | 5.2 | 335 | 5328.11 | 18710 |
I | 5405 | 10.1 | 334 | 5079.12 | 18823 |
H | 8258 | 15.4 | 337 | 4474.49 | 18803 |
G | 11249 | 20.9 | 354 | 3997.28 | 18818 |
F | 9515 | 17.7 | 342 | 3726.73 | 18791 |
E | 9770 | 18.2 | 326 | 3080.42 | 18731 |
D | 6753 | 12.6 | 357 | 3172.01 | 18693 |
%>%
diamonds group_by(color) %>%
summarise(Average_Price = mean(price)) %>%
ggplot(.,aes(x=color, y = Average_Price, fill= color)) +
geom_col(color="black") +
geom_text(aes(label = format(Average_Price, digits = 3)), position = position_dodge(0.9),vjust = 5) +
geom_line(aes(y = Average_Price), color="black", group=1, size=0.8)+
theme_minimal() +
theme(axis.text.x = element_text(angle = 65))+
scale_fill_brewer(palette="Pastel2")+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
labs(title="Price vs Color",
subtitle="Diamonds dataset",
x="Diamond Color",
y="Average Price",
fill="Diamond color")
%>%
diamonds ggplot(., aes(x=price)) +
geom_histogram(binwidth=500, position="dodge", fill="palegreen3") +
theme_minimal()+facet_wrap(~color, ncol=7)+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme(strip.background = element_rect(fill="seagreen3"))+
labs(title="Price Frequency vs Diamond Cut",
subtitle="Diamonds dataset",
x="Price",
y="Frequency",
fill="Diamond Cut")
Although the best color type is D, its price is one of the lowest. According to the average prices, the most expensive diamonds belong to J and I cut types which are actually the worst two color type in this dataset. These results present that color is not enough to explain response variable price, since price does not increase while color feature improves.
%>%
diamonds group_by(clarity) %>%
summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
mutate(percentage = 100*round(count/sum(count),3))%>%
arrange((clarity)) %>%
select(clarity, count, percentage, Min_Price, Average_Price, Max_Price ) %>%
kable(col.names = c("Diamond Clarity","Count","Percentage", "Minimum Price", "Average Price", "Maximum Price"))%>%
kable_minimal(full_width = F)
Diamond Clarity | Count | Percentage | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|---|
I1 | 737 | 1.4 | 345 | 3927.30 | 18531 |
SI2 | 9136 | 17.0 | 326 | 5052.29 | 18804 |
SI1 | 13024 | 24.2 | 326 | 3993.19 | 18818 |
VS2 | 12220 | 22.7 | 334 | 3925.98 | 18823 |
VS1 | 8150 | 15.2 | 327 | 3839.59 | 18795 |
VVS2 | 5054 | 9.4 | 336 | 3287.22 | 18768 |
VVS1 | 3644 | 6.8 | 336 | 2523.49 | 18777 |
IF | 1784 | 3.3 | 369 | 2870.57 | 18806 |
%>%
diamonds ggplot(., aes(x = clarity, y = price, color = clarity)) +
scale_color_brewer(palette="Set3")+
geom_jitter(alpha=0.7) +
theme_minimal() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title = "Price vs Clarity",
subtitle = "Diamonds dataset",
x = "Diamond Clarity",
y = "Price",
color = "Diamond Clarity")
%>%
diamonds ggplot(., aes(x=price)) +
geom_histogram(binwidth=500, position="dodge", fill="skyblue3") +
theme_minimal()+facet_wrap(~clarity, ncol=8)+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme(strip.background = element_rect(fill="skyblue2"))+
labs(title="Price Frequency vs Diamond Clarity",
subtitle="Diamonds dataset",
x="Price",
y="Frequency",
fill="Diamond Clarity")
Although the best clarity type is IF, its price is the lowest. According to the average prices, the most expensive diamonds belong to SI2 clarity types which is actually the second worst clarity type in this dataset. These results present that clarity is not enough to explain response variable price, since price does not increase while clarity feature improves.
The most frequent carat weight equals to 0.3 in this dataset. From the histogram, we can see the distribution of carat. To see all carats according to their count, following table can be analyzed. It can be said that, most of the carat values in this dataset are less than 1.
function(v) {
getmode <- unique(v)
uniqv <-which.max(tabulate(match(v, uniqv)))]
uniqv[
}getmode(diamonds$carat)
## [1] 0.3
%>% ggplot(.,aes(x=carat))+
diamonds geom_vline(aes(xintercept=mean(carat)),
color="black", linetype="dashed", size=1)+
geom_histogram(bins=30, color="tomato3", fill="salmon")+
theme_minimal()+
labs(title = "Distribution of Carat",
subtitle = "Diamonds dataset",
x = "Carat")
%>%
diamonds group_by(carat) %>%
summarise(carat_count = n(),
minPrice = min(price),
AveragePrice = mean(price),
MaxPrice = max(price))%>%
arrange(desc(carat_count)) %>%
kable(col.names = c("Carat", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
kable_styling("striped", full_width = T) %>%
scroll_box(width = "100%", height = "300px")
Carat | Count | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|
0.30 | 2596 | 339 | 680.6367 | 2366 |
0.31 | 2238 | 335 | 708.1997 | 1917 |
1.01 | 2238 | 2017 | 5508.9821 | 16234 |
0.70 | 1980 | 945 | 2516.2434 | 5539 |
0.32 | 1827 | 345 | 719.1916 | 1715 |
1.00 | 1542 | 1681 | 5250.9514 | 16469 |
0.90 | 1484 | 1599 | 3939.5889 | 9182 |
0.41 | 1378 | 467 | 973.0530 | 1745 |
0.40 | 1291 | 484 | 933.0077 | 1908 |
0.71 | 1291 | 1305 | 2674.3106 | 5019 |
0.50 | 1254 | 584 | 1503.8453 | 3378 |
0.33 | 1179 | 366 | 765.1747 | 1389 |
0.51 | 1123 | 826 | 1623.5841 | 4368 |
0.34 | 908 | 413 | 755.7423 | 2346 |
1.02 | 883 | 2344 | 5598.4632 | 17100 |
0.52 | 813 | 870 | 1656.4785 | 3373 |
1.51 | 804 | 3497 | 10552.0211 | 18806 |
1.50 | 787 | 2964 | 10082.9835 | 18691 |
0.72 | 764 | 945 | 2705.7788 | 5086 |
0.53 | 706 | 844 | 1685.0368 | 3827 |
0.42 | 704 | 552 | 994.4545 | 1722 |
0.38 | 669 | 497 | 904.1973 | 1992 |
0.35 | 663 | 409 | 799.5098 | 1672 |
1.20 | 643 | 2360 | 6670.7387 | 16256 |
0.54 | 624 | 935 | 1704.3317 | 4042 |
0.91 | 570 | 1570 | 3981.1561 | 9636 |
0.36 | 569 | 410 | 781.0070 | 1718 |
1.03 | 522 | 1262 | 5583.8966 | 17590 |
0.55 | 495 | 959 | 1725.2283 | 3509 |
0.56 | 492 | 953 | 1831.9980 | 4632 |
0.73 | 492 | 1049 | 2791.2480 | 5154 |
0.43 | 486 | 452 | 995.2737 | 1827 |
1.04 | 474 | 2037 | 5774.6709 | 18542 |
1.21 | 470 | 2396 | 6905.4043 | 17353 |
2.01 | 433 | 5696 | 14772.7136 | 18741 |
0.57 | 429 | 929 | 1798.0443 | 4032 |
0.39 | 397 | 451 | 923.4761 | 1624 |
0.37 | 394 | 451 | 816.0711 | 1764 |
1.52 | 380 | 3105 | 10629.9868 | 17723 |
1.06 | 372 | 2080 | 5947.0081 | 15813 |
1.05 | 361 | 2066 | 5781.6814 | 13060 |
1.07 | 340 | 2260 | 6004.2118 | 18279 |
0.74 | 322 | 1632 | 2821.9006 | 5841 |
0.58 | 310 | 893 | 1798.6484 | 3741 |
1.11 | 308 | 3183 | 6005.5682 | 15464 |
1.22 | 298 | 2699 | 6984.3188 | 14414 |
0.23 | 293 | 326 | 486.1433 | 688 |
1.09 | 285 | 2708 | 6020.0947 | 18231 |
0.80 | 284 | 1232 | 3146.1901 | 6834 |
0.59 | 282 | 903 | 1831.4504 | 4916 |
1.23 | 277 | 3168 | 7290.8881 | 16253 |
1.10 | 275 | 3216 | 5921.1345 | 12748 |
2.00 | 261 | 5051 | 14107.1456 | 18818 |
0.24 | 254 | 336 | 505.1850 | 963 |
0.26 | 253 | 337 | 550.8972 | 814 |
0.77 | 251 | 1651 | 2884.3984 | 5251 |
0.76 | 250 | 1140 | 2891.4520 | 5056 |
0.75 | 249 | 1013 | 2763.4016 | 5288 |
1.12 | 249 | 2854 | 6058.9116 | 14407 |
1.08 | 246 | 2869 | 5774.3415 | 14032 |
1.13 | 246 | 2968 | 6193.4024 | 14525 |
1.24 | 235 | 2940 | 7147.1149 | 15026 |
0.27 | 233 | 361 | 574.7597 | 893 |
0.60 | 226 | 806 | 1956.2434 | 4209 |
0.92 | 226 | 2283 | 4035.2124 | 7544 |
1.53 | 219 | 5631 | 10703.4658 | 17223 |
1.70 | 215 | 5617 | 12110.3209 | 18718 |
0.25 | 212 | 357 | 550.9245 | 1186 |
0.44 | 212 | 648 | 1056.6226 | 2506 |
1.14 | 206 | 2327 | 6354.4078 | 18112 |
0.61 | 204 | 931 | 2069.2451 | 4279 |
0.81 | 199 | 1687 | 3091.5126 | 6338 |
0.28 | 198 | 360 | 580.1212 | 828 |
0.78 | 187 | 1720 | 2918.9091 | 4904 |
1.25 | 187 | 3276 | 7123.2513 | 16823 |
0.46 | 177 | 746 | 1240.2429 | 2139 |
2.02 | 176 | 6346 | 14767.4091 | 18731 |
1.54 | 174 | 4492 | 10865.0287 | 18416 |
1.16 | 172 | 2368 | 6274.2965 | 13595 |
0.79 | 151 | 1809 | 2993.2649 | 4858 |
1.15 | 148 | 3220 | 5962.5811 | 12895 |
1.26 | 146 | 3881 | 7490.7466 | 15247 |
0.93 | 142 | 2374 | 3994.2958 | 5992 |
0.82 | 140 | 2033 | 3083.6500 | 4931 |
0.62 | 135 | 933 | 1973.3704 | 3968 |
1.27 | 134 | 2845 | 7119.1194 | 15377 |
1.31 | 132 | 3697 | 7750.6364 | 17496 |
0.83 | 131 | 1863 | 3181.9924 | 7889 |
0.29 | 130 | 334 | 601.1923 | 1776 |
1.19 | 126 | 2892 | 6483.2381 | 15767 |
1.18 | 123 | 3219 | 6481.1057 | 14847 |
1.55 | 123 | 4965 | 10717.7642 | 18188 |
2.03 | 122 | 6002 | 14710.1148 | 18781 |
1.30 | 121 | 2512 | 7851.6612 | 15874 |
1.71 | 119 | 6320 | 12396.5546 | 18791 |
0.45 | 110 | 672 | 1093.3182 | 1833 |
1.17 | 110 | 2336 | 6224.8818 | 11886 |
1.56 | 108 | 5766 | 10205.9444 | 17455 |
1.28 | 106 | 3355 | 7711.8113 | 18700 |
1.57 | 105 | 5170 | 10444.5143 | 17548 |
0.96 | 103 | 1637 | 3623.4660 | 6807 |
0.63 | 102 | 1065 | 2244.7843 | 6607 |
1.29 | 101 | 2596 | 7033.0198 | 17932 |
0.47 | 99 | 719 | 1272.3131 | 2460 |
1.60 | 95 | 6272 | 11124.2526 | 18780 |
1.32 | 89 | 4140 | 7639.3708 | 15802 |
1.58 | 89 | 4459 | 11427.9888 | 18057 |
1.59 | 89 | 6010 | 10574.3596 | 17552 |
1.33 | 87 | 4634 | 7527.0805 | 18435 |
2.04 | 86 | 7403 | 14840.1744 | 18795 |
0.64 | 80 | 1080 | 2113.5875 | 4281 |
1.35 | 77 | 4082 | 7448.2857 | 14616 |
1.34 | 68 | 4352 | 7758.5000 | 17663 |
0.65 | 65 | 1214 | 2322.6154 | 5052 |
0.95 | 65 | 2195 | 4013.6615 | 6951 |
2.05 | 65 | 7026 | 15647.2462 | 18787 |
0.84 | 64 | 1749 | 3294.9844 | 6399 |
1.61 | 64 | 6915 | 10992.8438 | 18318 |
0.48 | 63 | 841 | 1322.0476 | 2677 |
0.85 | 62 | 1250 | 3204.1774 | 7230 |
1.62 | 61 | 5157 | 11705.8852 | 18281 |
0.94 | 59 | 2571 | 4195.7797 | 7120 |
0.97 | 59 | 1547 | 4026.2373 | 7415 |
2.06 | 59 | 5203 | 14348.6780 | 18779 |
1.72 | 57 | 5240 | 12040.3158 | 18730 |
1.73 | 52 | 6007 | 11875.9615 | 18377 |
2.10 | 51 | 6597 | 14464.3725 | 18648 |
1.36 | 50 | 4158 | 8591.5200 | 15386 |
1.40 | 50 | 4939 | 9108.5200 | 16808 |
1.63 | 50 | 7100 | 11556.3200 | 18179 |
1.75 | 50 | 7276 | 12673.7000 | 17343 |
2.07 | 50 | 6503 | 15384.2800 | 18804 |
0.66 | 48 | 1188 | 2235.7292 | 4311 |
0.67 | 48 | 1244 | 2067.8542 | 3271 |
2.14 | 48 | 5405 | 14905.0417 | 18528 |
1.37 | 46 | 5063 | 8044.0870 | 13439 |
0.49 | 45 | 900 | 1306.3333 | 2199 |
2.09 | 45 | 10042 | 15287.3111 | 18640 |
1.64 | 43 | 4849 | 10738.0000 | 17604 |
2.11 | 43 | 7019 | 14563.4186 | 18575 |
2.08 | 41 | 6150 | 15128.6829 | 18760 |
1.74 | 40 | 4677 | 12270.3250 | 18430 |
1.41 | 39 | 4145 | 9452.2051 | 17216 |
1.39 | 36 | 3914 | 8335.1111 | 13061 |
0.86 | 34 | 1828 | 3289.9706 | 5029 |
1.65 | 32 | 5914 | 10973.8750 | 17729 |
0.87 | 31 | 1827 | 3342.3226 | 5640 |
0.98 | 31 | 1712 | 3469.7742 | 4950 |
2.20 | 31 | 7934 | 15226.8710 | 18364 |
1.66 | 30 | 5310 | 11084.5000 | 16294 |
2.18 | 30 | 10340 | 14901.7000 | 18717 |
1.76 | 28 | 9053 | 12398.5714 | 18178 |
2.22 | 27 | 5607 | 14691.5185 | 18706 |
0.69 | 26 | 1636 | 2206.6154 | 3784 |
1.38 | 26 | 4598 | 8164.3077 | 17598 |
0.68 | 25 | 1341 | 2334.3200 | 4325 |
1.42 | 25 | 6069 | 9487.2000 | 18682 |
1.67 | 25 | 7911 | 12281.4800 | 17351 |
2.12 | 25 | 7508 | 15007.6800 | 18656 |
2.16 | 25 | 8709 | 15860.4800 | 18678 |
1.69 | 24 | 5182 | 11878.0833 | 17803 |
0.88 | 23 | 1964 | 3431.3478 | 5595 |
0.99 | 23 | 1789 | 4406.1739 | 6094 |
2.21 | 23 | 6535 | 15287.9565 | 18483 |
2.15 | 22 | 5430 | 14187.1818 | 18791 |
2.19 | 22 | 10179 | 15521.0909 | 18693 |
0.89 | 21 | 1334 | 3291.2381 | 4854 |
1.47 | 21 | 4898 | 7833.7143 | 12547 |
2.13 | 21 | 12356 | 15220.4286 | 18442 |
1.80 | 20 | 7862 | 12955.1500 | 17273 |
2.28 | 20 | 9068 | 15422.9500 | 18055 |
2.30 | 20 | 7226 | 14861.5500 | 18304 |
1.43 | 19 | 6086 | 9234.9474 | 14429 |
1.68 | 19 | 5765 | 11779.1053 | 16985 |
1.44 | 18 | 4064 | 8021.6111 | 12093 |
1.46 | 18 | 5421 | 8237.4444 | 12261 |
1.83 | 18 | 5083 | 11930.0000 | 18525 |
2.17 | 18 | 6817 | 14416.6111 | 17805 |
1.77 | 17 | 7771 | 11103.1765 | 15278 |
2.29 | 17 | 11502 | 15834.7647 | 18823 |
2.51 | 17 | 14209 | 16616.8824 | 18308 |
2.24 | 16 | 10913 | 15573.9375 | 18299 |
2.25 | 16 | 6653 | 13403.5625 | 18034 |
2.32 | 16 | 6860 | 16042.1875 | 18532 |
2.50 | 16 | 7854 | 14316.1875 | 18447 |
1.45 | 15 | 4320 | 8709.3333 | 17545 |
1.79 | 15 | 9122 | 13144.0667 | 18429 |
2.26 | 15 | 11226 | 15330.2000 | 18426 |
1.82 | 13 | 5993 | 12986.1538 | 15802 |
2.23 | 13 | 7006 | 13947.4615 | 18149 |
2.31 | 13 | 7257 | 15017.3846 | 17715 |
2.40 | 13 | 11988 | 16362.4615 | 18541 |
0.20 | 12 | 345 | 365.1667 | 367 |
1.78 | 12 | 8889 | 12504.5833 | 18128 |
1.91 | 12 | 6632 | 12365.8333 | 17509 |
2.27 | 12 | 5733 | 14198.4167 | 18343 |
3.01 | 12 | 8040 | 15521.6667 | 18710 |
1.49 | 11 | 5407 | 9174.2727 | 18614 |
0.21 | 9 | 326 | 380.2222 | 394 |
1.81 | 9 | 5859 | 11374.4444 | 14177 |
1.86 | 9 | 9791 | 12164.3333 | 17267 |
2.33 | 9 | 8220 | 14089.7778 | 18119 |
2.48 | 9 | 12883 | 16537.1111 | 18692 |
2.52 | 9 | 10076 | 16242.1111 | 18252 |
2.36 | 8 | 11238 | 15756.7500 | 18745 |
2.38 | 8 | 10716 | 15481.3750 | 18559 |
2.53 | 8 | 14659 | 16548.8750 | 18254 |
2.54 | 8 | 12095 | 15125.2500 | 17996 |
3.00 | 8 | 6512 | 12378.5000 | 16970 |
1.48 | 7 | 6216 | 8791.2857 | 15164 |
1.87 | 7 | 9955 | 12933.5714 | 17761 |
1.90 | 7 | 8576 | 11477.1429 | 13919 |
2.35 | 7 | 14185 | 16090.4286 | 17999 |
2.39 | 7 | 15917 | 17182.4286 | 17955 |
2.42 | 7 | 16198 | 16923.5714 | 18615 |
1.93 | 6 | 13278 | 16828.1667 | 18306 |
2.37 | 6 | 14837 | 16525.1667 | 18508 |
2.43 | 6 | 9716 | 15356.0000 | 18692 |
0.22 | 5 | 337 | 391.4000 | 470 |
1.98 | 5 | 12923 | 14667.8000 | 16171 |
2.34 | 5 | 8491 | 11922.0000 | 15818 |
2.41 | 5 | 13563 | 15808.2000 | 17923 |
1.84 | 4 | 7922 | 11108.5000 | 13905 |
1.88 | 4 | 8048 | 10946.2500 | 13607 |
1.89 | 4 | 10055 | 13969.0000 | 17553 |
1.96 | 4 | 5554 | 8642.2500 | 13099 |
1.97 | 4 | 14410 | 15966.0000 | 17535 |
2.44 | 4 | 13027 | 16365.5000 | 18430 |
2.45 | 4 | 11830 | 14593.7500 | 18113 |
1.85 | 3 | 5688 | 11436.0000 | 14375 |
1.94 | 3 | 9344 | 14189.3333 | 18735 |
1.95 | 3 | 5045 | 12340.6667 | 17374 |
1.99 | 3 | 11486 | 14179.0000 | 17713 |
2.46 | 3 | 10470 | 14227.0000 | 16466 |
2.47 | 3 | 14970 | 15644.0000 | 16532 |
2.49 | 3 | 6289 | 13843.0000 | 18325 |
2.55 | 3 | 14351 | 15964.0000 | 18766 |
2.56 | 3 | 15231 | 16723.3333 | 17753 |
2.57 | 3 | 17116 | 17841.6667 | 18485 |
2.58 | 3 | 12500 | 14481.3333 | 16195 |
2.60 | 3 | 17027 | 17535.0000 | 18369 |
2.61 | 3 | 13784 | 16583.0000 | 18756 |
2.63 | 3 | 10437 | 12659.6667 | 16914 |
2.72 | 3 | 6870 | 12088.3333 | 17801 |
2.74 | 3 | 8807 | 14385.0000 | 17184 |
1.92 | 2 | 15364 | 15418.0000 | 15472 |
2.66 | 2 | 16239 | 17367.0000 | 18495 |
2.68 | 2 | 8419 | 9042.0000 | 9665 |
2.75 | 2 | 13156 | 14285.5000 | 15415 |
3.04 | 2 | 15354 | 16956.5000 | 18559 |
4.01 | 2 | 15223 | 15223.0000 | 15223 |
2.59 | 1 | 16465 | 16465.0000 | 16465 |
2.64 | 1 | 17407 | 17407.0000 | 17407 |
2.65 | 1 | 16314 | 16314.0000 | 16314 |
2.67 | 1 | 18686 | 18686.0000 | 18686 |
2.70 | 1 | 14341 | 14341.0000 | 14341 |
2.71 | 1 | 17146 | 17146.0000 | 17146 |
2.77 | 1 | 10424 | 10424.0000 | 10424 |
2.80 | 1 | 15030 | 15030.0000 | 15030 |
3.02 | 1 | 10577 | 10577.0000 | 10577 |
3.05 | 1 | 10453 | 10453.0000 | 10453 |
3.11 | 1 | 9823 | 9823.0000 | 9823 |
3.22 | 1 | 12545 | 12545.0000 | 12545 |
3.24 | 1 | 12300 | 12300.0000 | 12300 |
3.40 | 1 | 15964 | 15964.0000 | 15964 |
3.50 | 1 | 12587 | 12587.0000 | 12587 |
3.51 | 1 | 18701 | 18701.0000 | 18701 |
3.65 | 1 | 11668 | 11668.0000 | 11668 |
3.67 | 1 | 16193 | 16193.0000 | 16193 |
4.00 | 1 | 15984 | 15984.0000 | 15984 |
4.13 | 1 | 17329 | 17329.0000 | 17329 |
4.50 | 1 | 18531 | 18531.0000 | 18531 |
5.01 | 1 | 18018 | 18018.0000 | 18018 |
Price is highly related with x, y, and, z which can be seen below scatter plots. Since depth is a formula obtained by using these variables, the plot belongs to depth is not presented in this part.
ggplot(diamonds, aes(x, price,)) +
geom_point(alpha = 0.5, color="seagreen3") +
geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) +
theme_minimal()+
labs(title = "Price vs Diamond Length",
subtitle = "Diamonds dataset",
x = "Diamond Length",
y = "Price")
ggplot(diamonds, aes(y, price,)) +
geom_point(alpha = 0.5, color="seagreen3") +
geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) +
theme_minimal()+
labs(title = "Price vs Diamond Width",
subtitle = "Diamonds dataset",
x = "Diamond Width",
y = "Price")
ggplot(diamonds, aes(z, price,)) +
geom_point(alpha = 0.5, color="seagreen3") +
geom_smooth(method="loess", se=F,color="red3", size=2, linetype=2 ) +
theme_minimal()+
labs(title = "Price vs Diamond Depth",
subtitle = "Diamonds dataset",
x = "Diamond Depth",
y = "Price")
In order to explore the relationship between 4C and price, following plots can be examined.
ggplot(diamonds, aes(carat, price, color = clarity)) +
geom_point(alpha = 0.5) +
theme_minimal() +scale_color_brewer(palette="Set1")+
geom_smooth(method="loess", se=F) +
labs(title = "Price vs Carat Grouped by Clarity",
subtitle = "Diamonds dataset",
x = "Diamond Carat",
y = "Price",
color = "Diamond Clarity")
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(alpha = 0.5) +
theme_minimal() +scale_color_brewer(palette="Set1")+
geom_smooth(method="loess", se=F) +
labs(title = "Price vs Carat Grouped by Cut",
subtitle = "Diamonds dataset",
x = "Diamond Carat",
y = "Price",
color = "Diamond Cut")
ggplot(diamonds, aes(carat, price, color = color)) +
geom_point(alpha = 0.5) +
theme_minimal() +scale_color_brewer(palette="Set1")+
geom_smooth(method="loess", se=F) +
labs(title = "Price vs Carat Grouped by Color",
subtitle = "Diamonds dataset",
x = "Diamond Carat",
y = "Price",
color = "Diamond Color")
Some possible analyses may include Clustering and Principal Component Analysis for our dataset.
Before proceding with these methods, we should divide our dataset into two, one for train and one for test set. Since, our factor variables namely cut, clarity, and color are ordered, we can treat them as numeric values.
set.seed(503)
diamonds %>% mutate(diamond_id = row_number()) %>%
diamonds_test <- group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_train <-by = "diamond_id")
diamonds_test,
diamonds_train[, c(-ncol(diamonds_train))]
diamonds_train = diamonds_test[, c(-ncol(diamonds_test))] diamonds_test =
$cut <- as.numeric(diamonds_train$cut)
diamonds_train$clarity <- as.numeric(diamonds_train$clarity)
diamonds_train$color <- as.numeric(diamonds_train$color)
diamonds_train$cut <- as.numeric(diamonds_test$cut)
diamonds_test$clarity <- as.numeric(diamonds_test$clarity)
diamonds_test$color <- as.numeric(diamonds_test$color) diamonds_test
Principal Component Analysis is suitable for numeric variables, therefore we choose this type of variables from our dataset to continue. It can be seen from summary output that 4 component is enough in order to explain 88% of the data. We should add these components to both train and test datasets for the following price estimation models.
princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
diamonds_pca <-summary(diamonds_pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.0742952 1.1892966 1.1065552 0.9926732 0.83121848
## Proportion of Variance 0.4780779 0.1571585 0.1360516 0.1094889 0.07676935
## Cumulative Proportion 0.4780779 0.6352364 0.7712880 0.8807769 0.95754622
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.59218546 0.173206533 0.0320924547 1.923472e-02
## Proportion of Variance 0.03896485 0.003333389 0.0001144362 4.110829e-05
## Cumulative Proportion 0.99651107 0.999844456 0.9999588917 1.000000e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat 0.471 0.126 0.868
## cut -0.109 0.595 0.451 0.241 -0.175 0.584
## color -0.158 -0.202 0.802 0.538
## clarity -0.222 0.208 0.215 -0.506 0.778
## depth 0.209 -0.827 0.118 0.492 0.101
## table 0.129 -0.698 0.227 -0.181 0.641
## x 0.475 0.115 -0.279 -0.715 0.397
## y 0.474 0.120 -0.297 0.699 0.410
## z 0.474 0.103 0.132 -0.283 -0.815
predict(diamonds_pca, newdata = diamonds_train)
pca_results <-$pca1 = pca_results[,1]
diamonds_train$pca2 = pca_results[,2]
diamonds_train$pca3 = pca_results[,3]
diamonds_train$pca4 = pca_results[,4]
diamonds_train
predict(diamonds_pca, newdata = diamonds_test)
pca_results_test =$pca1 = pca_results_test[,1]
diamonds_test$pca2 = pca_results_test[,2]
diamonds_test$pca3 = pca_results_test[,3]
diamonds_test$pca4 = pca_results_test[,4] diamonds_test
There are different kinds of clustering methods to create a feature, in this assignment K-means is used. First of all, scaling should be applied since if a columnn have much higher values than the others, it may dominate the results. In order to scale the data, we should find the minimum and maximum values of the train set, and then we will scale both train and test set with the same values.
sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], min)
min_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], max)
max_vals = sweep(sweep(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/") diamonds_train_scale <-
For deciding the number of cluster, we can use a for loop from 1 to 15 center, and then select the number of center value.
c()
errors =for (i in (1:15)){
set.seed(1357)
kmeans(diamonds_train_scale,centers=i)
cluster<- c(errors, 100*round(1 - (cluster$tot.withinss/cluster$totss), digits = 3))
errors =
}
data.frame(x=c(1:15), y=errors)
errors_df <-
ggplot(errors_df, aes(x=x, y=y)) +
geom_point(color = "firebrick2") +
geom_line(color="firebrick3") +
geom_text(aes(label = errors), size=3, color = "black", position = position_stack(vjust = 0.95))+
theme_minimal() +
labs(title = "Center Number vs R-Square",
subtitle = "Diamonds dataset",
x = "Cluster Number",
y = "R-Square")
The decrease in errors are slowly changing after the cluster with 8 centers. So, we can say that we should select the model with center equals to 8.
set.seed(1357)
kmeans(diamonds_train_scale,centers=8)
best_cluster =$cluster = as.factor(best_cluster$cluster) diamonds_train
Now, we need to apply clustering process to the test dataset.
sweep(sweep(diamonds_test[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")
diamonds_test_scale <-
function(x) {
closest.cluster <- apply(best_cluster$centers, 1, function(y) sqrt(sum((x-y)^2)))
cluster.dist <-return(which.min(cluster.dist)[1])
}$cluster <- as.factor(apply(diamonds_test_scale, 1, closest.cluster)) diamonds_test
%>%
diamonds_train ggplot(., aes(x = cluster, y = price, color = cluster)) +
scale_color_brewer(palette="Set2")+
geom_jitter(alpha=0.7) +
theme_minimal() +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title = "Price vs Clusters",
subtitle = "Diamonds train dataset",
x = "Diamond Cluster",
y = "Price",
color = "Diamond Cluster")
%>%
diamonds_test ggplot(., aes(x = cluster, y = price, color = cluster)) +
scale_color_brewer(palette="Set2")+
geom_jitter(alpha=0.7) +
theme_minimal() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
labs(title = "Price vs Clusters",
subtitle = "Diamonds test dataset",
x = "Diamond Cluster",
y = "Price",
color = "Diamond Cluster")
Correlation matrix can be examined to get an idea about the relationships between numeric variables, therefore, correlogram is illustrated below. We can say that carat, x, y, and z columns are highly correlated with price column.
cor(diamonds_train[-c(11:15)])
diamonds_cor<-corrplot(diamonds_cor, method="number")
There are four assumptions to be fulfilled in a linear model:
Since, our dataset does not fulfill these assumptions we will construct a generalized linear model which is more flexible than standard linear model in the terms of assumptions. The responce variable, price, is continuous therefore Gamma or Gaussian family may fit. The lowest AIC value is obtained by Gamma family with identity link function.
glm(price ~ carat+cut+color+clarity+depth+table+x+y+z+cluster, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5,0.5))
model_glm <-summary(model_glm)
##
## Call:
## glm(formula = price ~ carat + cut + color + clarity + depth +
## table + x + y + z + cluster, family = Gamma(link = "identity"),
## data = diamonds_train, start = c(0.5, 0.5, 0.5, 0.5, 0.5,
## 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
## 0.5))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0649 -0.1433 -0.0057 0.1532 10.2150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.052e+04 6.219e+02 -33.003 < 2e-16 ***
## carat 7.241e+03 4.341e+01 166.797 < 2e-16 ***
## cut -6.238e+00 3.367e+00 -1.853 0.063928 .
## color 3.589e+01 1.661e+00 21.612 < 2e-16 ***
## clarity 8.070e+01 1.300e+00 62.065 < 2e-16 ***
## depth 3.433e+02 9.938e+00 34.548 < 2e-16 ***
## table -6.750e+00 9.723e-01 -6.942 3.92e-12 ***
## x 3.884e+03 8.192e+01 47.409 < 2e-16 ***
## y 1.082e+03 7.990e+01 13.543 < 2e-16 ***
## z -8.913e+03 2.165e+02 -41.174 < 2e-16 ***
## cluster2 -2.341e+01 6.809e+00 -3.438 0.000586 ***
## cluster3 -7.069e+01 6.950e+00 -10.171 < 2e-16 ***
## cluster4 -8.468e+01 8.813e+00 -9.609 < 2e-16 ***
## cluster5 -2.753e+02 2.304e+01 -11.950 < 2e-16 ***
## cluster6 -6.137e+01 9.056e+00 -6.777 1.24e-11 ***
## cluster7 -2.242e+02 2.195e+01 -10.216 < 2e-16 ***
## cluster8 8.039e+02 2.382e+01 33.742 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.0868286)
##
## Null deviance: 42126.5 on 42990 degrees of freedom
## Residual deviance: 2908.6 on 42974 degrees of freedom
## AIC: 676132
##
## Number of Fisher Scoring iterations: 25
The model can be improved by some arrangement in the explanatory variables.
glm(price ~ carat*color+carat*clarity+I(carat^2)+cluster+y+depth, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5))
model_glm2 <-summary(model_glm2)
##
## Call:
## glm(formula = price ~ carat * color + carat * clarity + I(carat^2) +
## cluster + y + depth, family = Gamma(link = "identity"), data = diamonds_train,
## start = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
## 0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.23761 -0.10711 -0.00628 0.09001 2.68079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2860.8797 106.3588 26.898 < 2e-16 ***
## carat -3616.5228 101.2272 -35.727 < 2e-16 ***
## color -112.0154 1.9102 -58.642 < 2e-16 ***
## clarity -184.5964 1.7927 -102.970 < 2e-16 ***
## I(carat^2) 3886.1793 31.1401 124.797 < 2e-16 ***
## cluster2 21.9462 3.6092 6.081 1.21e-09 ***
## cluster3 -17.1453 3.8160 -4.493 7.04e-06 ***
## cluster4 -78.7454 3.3094 -23.795 < 2e-16 ***
## cluster5 459.0441 13.3378 34.417 < 2e-16 ***
## cluster6 -43.3751 3.5333 -12.276 < 2e-16 ***
## cluster7 329.4331 11.7776 27.971 < 2e-16 ***
## cluster8 511.6552 11.7190 43.660 < 2e-16 ***
## y -245.2199 18.9533 -12.938 < 2e-16 ***
## depth -13.3575 0.9476 -14.096 < 2e-16 ***
## carat:color 486.7423 4.6738 104.142 < 2e-16 ***
## carat:clarity 794.7105 4.6681 170.243 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.02385886)
##
## Null deviance: 42126.5 on 42990 degrees of freedom
## Residual deviance: 1022.2 on 42975 degrees of freedom
## AIC: 630859
##
## Number of Fisher Scoring iterations: 25
To make an example with components, folowing model can be constructed.
glm(price ~ pca1 + pca2 + pca3 + pca4 + cluster, data = diamonds_train, family = Gamma(link = "sqrt"))
model_glm_pca =summary(model_glm_pca)
##
## Call:
## glm(formula = price ~ pca1 + pca2 + pca3 + pca4 + cluster, family = Gamma(link = "sqrt"),
## data = diamonds_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.12715 -0.24877 -0.04553 0.16452 1.95690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.02880 0.11968 459.78 <2e-16 ***
## pca1 11.87183 0.03733 318.03 <2e-16 ***
## pca2 3.37035 0.03747 89.96 <2e-16 ***
## pca3 2.25217 0.04057 55.51 <2e-16 ***
## pca4 -0.86510 0.05872 -14.73 <2e-16 ***
## cluster2 6.62809 0.12793 51.81 <2e-16 ***
## cluster3 -3.24163 0.15406 -21.04 <2e-16 ***
## cluster4 3.01052 0.14841 20.29 <2e-16 ***
## cluster5 -6.77135 0.25980 -26.06 <2e-16 ***
## cluster6 1.73995 0.16787 10.37 <2e-16 ***
## cluster7 -2.82795 0.26942 -10.50 <2e-16 ***
## cluster8 7.34767 0.21746 33.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.1129233)
##
## Null deviance: 42126.5 on 42990 degrees of freedom
## Residual deviance: 4544.4 on 42979 degrees of freedom
## AIC: 695577
##
## Number of Fisher Scoring iterations: 11
PCA helps us to use less feature in a model. But, as expected, with fewer feature we will get more AIC value in the model which means the explanatory power of the model decreases. Since we do not have so many features, we do not need to decrease the feature number.We may also use all the principal components instead of the variables to deal with multicollinearity however in this study this model is not investigated. For the best glm model, we can calculate the R^2 value. It can be calculated easily with:
predict(model_glm2, newdata=diamonds_test)
model_glm2_prediction<- 1 - (sum((model_glm2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm2_r2 <- model_glm2_r2
## [1] 0.9053068
To visually see the prediction vs actual price values, we can check the below plot.
as.data.frame(cbind(model_glm2_prediction, diamonds_test$price))
pred_vs_real_glm <-colnames(pred_vs_real_glm) = c("prediction", "actual")
%>%
pred_vs_real_glm ggplot(aes(prediction, actual)) +
geom_point(color="seagreen3") +
theme_minimal()+
geom_abline(intercept = 0, slope = 1, color="red3", size=2, linetype=2) +
labs(title = "Prediction vs Actual Values for the Generalized Linear Model",
subtitle = "Diamonds dataset",
x = "Predictions",
y = "Real Values")
rpart(price~., data=diamonds_train)
model_rpart <-fancyRpartPlot(model_rpart, sub="CART Model")
From the plot of a tree, we can see that the nodes are divided with using only carat, y and clarity columns. It means that these are better features to reduce the variance in the dataset. We can make a cross validation for cp argument. To do so, we need to apply this code:
trainControl(method = "cv", number = 10)
tr.control = expand.grid( .cp = (1:10)*0.001)
cp.grid = train(price~. - pca1 - pca2 - pca3 - pca4, data = diamonds_train, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
tr = tr
## CART
##
## 42991 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 38691, 38693, 38694, 38691, 38692, 38691, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.001 950.902 0.9430052 580.0119
## 0.002 1036.373 0.9322755 628.2955
## 0.003 1087.797 0.9253495 656.4300
## 0.004 1137.831 0.9182846 685.5459
## 0.005 1145.996 0.9171573 688.0164
## 0.006 1245.355 0.9021746 811.5315
## 0.007 1279.303 0.8967777 846.7474
## 0.008 1279.303 0.8967777 846.7474
## 0.009 1279.303 0.8967777 846.7474
## 0.010 1362.123 0.8829236 883.8798
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001.
We can see that we need to use 0.001 for cp when we compare with other cp values.
rpart(price~. - pca1 - pca2 - pca3 - pca4, data=diamonds_train, cp = 0.001)
model_rpart2 <-fancyRpartPlot(model_rpart2, cex=0.4, sub="CART Model")
This is a more detailed tree. From the plot of this tree, we can say that we will define the price with respect to carat, y, clarity and color values. To be able to compare these two trees, we need to calculate the R^2 values.
predict(model_rpart2,newdata=diamonds_test)
model_rpart2_prediction<- 1 - (sum((model_rpart2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart2_r2 <- model_rpart2_r2
## [1] 0.944111
The R^2 value for the detailed tree is calculated above. Now, we can compare the generalized linear model and CART. We can conclude by saying CART model is better than Generalized Linear Model.
The steps during the assignment can be summarized as below:
Important results: