0— title: “Assigment3” author: “Talha Ünlü” date: “09 09 2020” output: html_document: code_folding: hide
toc: yes toc_depth: 3 toc_float: collapsed: no theme: united highlight: tango pdf_document: toc: yes toc_depth: ‘3’ —
The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price (in 2008 $s): the 4 Cs (carat, cut, color, and clarity) , as well as some physical measurements (depth, table, price, x, y, and z).Carat is a unit of mass equal to 200 mg and is used for measuring gemstones and pearls. Cut grade is is an objective measure of a diamond’s light performance, or, what we generally think of as sparkle.
In this assigment, we will first examine our data in general terms and then try to get more inside information in the explatory data analysis (EDA) section. Our aim in the part up to now is to know our data and to have information about variables. In the next step, we will try to establish a model. Before starting to set up a model, we will do the PCA (Princible Component Analysis) study, and finally, we will evaluate our results by establishing our model using different algorithms to check each other.
1.Overview of the Data 2.Explotory Data Analysis (EDA) 3.Princible Components Analysis (PCA) 4.Establishing & Evaluating the Models
I have used several packages during the analysis of the diamonds dataset in order to make data manipulation and visualization. The list of packages used in this assignment can be seen below:
1.tidyverse 2.readr 3.ggplot2 4.cvms 5.broom 6.tree 7.randomForest
I used to glimpse() function in order to investigate the diamonds dataset.The glimpse() is a function of dplyr package.
diamonds%>%
glimpse()
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
I created a new variable called volume using x, y and z dependent variables. The aim here is to find the relationship between diamond’s volume and price, and try to obtain different information with the variable we created.
diamond.update<-diamonds%>%
mutate(volume=x*y*z)
ggplot(diamond.update,aes(volume,price,color=color)) +
geom_point()+
xlim(25,100)+
labs(title = "Volume vs Price",
x="Volume of Diamonds",
y="Prices of Diamonds" )
## Warning: Removed 29426 rows containing missing values (geom_point).
When we examine the chart above, although it seems that there are exceptions, we observe that the price of the diamond increases as the color of the diamond gets darker in the range where the number of diamonds is intense.
ggplot(diamond.update,aes(clarity,price,color=cut))+
geom_jitter()+
theme_minimal() +
theme( plot.title = element_text(vjust = 0.5)) +
labs(x = "Clarity Levels",
y = "Prices",
title = "Clarity vs Price",
subtitle = "Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF")
When we examine the chart above, we see that there are expensive and cheap diamonds at almost every clarity level. Also, when the price goes up, there are quite a few diamonds at I1 and IF clarity levels compared to other levels.
After the first two charts, I decided to categorize the price. I categorized Price variables as “Very Low”, “Low”, “Medium”, “High” and “Very High” using the case_when () function.
quant = quantile(diamond.update$price, seq(0, 1, 0.2))
diamonds_price_group <- diamond.update %>%
mutate(price_group = case_when(
price < quant[2] ~ "Very Low",
price < quant[3] ~ "Low",
price < quant[4] ~ "Medium",
price < quant[5] ~ "High",
TRUE ~ "Very High"
)) %>%
mutate(price_group = factor(price_group, levels = c("Very Low", "Low", "Medium", "High", "Very High")))
I have created another variable called price_group and now I will continue to examine the price clarity relation using this variable.
diamonds_price_group%>%
group_by(price_group)%>%
summarise(count=n())%>%
mutate(percentage=100*count/sum(count))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## price_group count percentage
## <fct> <int> <dbl>
## 1 Very Low 10783 20.0
## 2 Low 10762 20.0
## 3 Medium 10814 20.0
## 4 High 10793 20.0
## 5 Very High 10788 20
diamonds_price_group %>%
group_by(clarity, price_group) %>%
summarize(counter = n()) %>%
ggplot(., aes(x = '', y = counter, fill = price_group)) +
geom_bar(width = 1, stat = "identity", position = "fill") +
coord_polar("y") +
theme_void() +
theme(plot.title = element_text(vjust = 0.5)) +
facet_wrap(~clarity) +
labs(title = "Price Group Analyses of Clarity",
fill = "Price Group")
## `summarise()` regrouping output by 'clarity' (override with `.groups` argument)
As you seen above, pie charts was created in terms of price_group. Consequently, i could observe distribution of clarity levels for each price group.
In this assigment, i first performed PC analysis then I created models using linear model and randomForest algorithms.
I have defined our test and train datasets as given in Assigment 3.3 description.
set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
diamonds_train
## # A tibble: 43,143 x 11
## carat cut color clarity depth table price x y z diamond_id
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 2
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3
## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 4
## 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 5
## 5 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 7
## 6 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 8
## 7 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 9
## 8 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 10
## 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 11
## 10 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 12
## # ... with 43,133 more rows
Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing you to better visualize the variation present in a dataset with many variables. It is particularly helpful in the case of “wide” datasets, where you have many variables for each sample. Read more
In order to do PC analysis, my variables must be numeric, so we make our variables numeric.
diamonds.all<-diamonds%>%
mutate(color=as.numeric(color),cut=as.numeric(cut),clarity=as.numeric(clarity))
diamond.pca<-princomp(diamonds.all,cor=TRUE,scores = TRUE)
summary(diamond.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.2376039 1.1989134 1.1108068 0.99307294 0.87777143
## Proportion of Variance 0.5006871 0.1437393 0.1233892 0.09861939 0.07704827
## Cumulative Proportion 0.5006871 0.6444264 0.7678156 0.86643500 0.94348327
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.59853413 0.35742920 0.200419345 0.164186308
## Proportion of Variance 0.03582431 0.01277556 0.004016791 0.002695714
## Cumulative Proportion 0.97930758 0.99208314 0.996099933 0.998795647
## Comp.10
## Standard deviation 0.109743019
## Proportion of Variance 0.001204353
## Cumulative Proportion 1.000000000
diamond.pca$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## carat 0.441 0.165 0.733 0.482
## cut 0.633 0.371 0.220 -0.245 0.586
## color 0.137 0.154 -0.818 -0.521 0.106
## clarity -0.181 0.279 0.221 -0.493 0.737 -0.219
## depth 0.120 -0.839 0.162 0.490
## table 0.107 -0.671 0.303 -0.180 0.640
## price 0.403 0.141 0.303 0.755 -0.361 -0.122
## x 0.441 -0.236 0.304 -0.803
## y 0.435 -0.381 -0.746 -0.247 0.195
## z 0.435 -0.375 0.651 -0.411 0.260
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## Proportion Var 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
## Cumulative Var 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
## Comp.10
## SS loadings 1.0
## Proportion Var 0.1
## Cumulative Var 1.0
As can be seen in Cumulative Proportion, the first five components can explain 94.3% of variation.
ggplot(data.frame(pc=1:7,cum_var=c(0.5006871,0.6444264,0.7678156 ,0.86643500,0.94348327,0.97930758,0.99208314 )),aes(x=pc,y=cum_var)) +
geom_point() +
geom_line()
Linear models describe a continuous response variable as a function of one or more predictor variables.So i used linear regression in prediction model.
fmla<-price~carat+cut+color+clarity+depth+table+x+y+z
Diamond.lm.model<-lm(fmla,diamonds_train)
diamonds_test$lmprediction<-predict(Diamond.lm.model,newdata = diamonds_test)
The linear model’s r square value is 92%. He says that our model works pretty well and is a statistically logical predict.
summary(Diamond.lm.model)
##
## Call:
## lm(formula = fmla, data = diamonds_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21446.0 -593.0 -182.4 378.8 10701.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6032.795 440.677 13.690 < 2e-16 ***
## carat 11306.047 55.032 205.445 < 2e-16 ***
## cut.L 587.057 25.144 23.347 < 2e-16 ***
## cut.Q -300.964 20.114 -14.963 < 2e-16 ***
## cut.C 148.352 17.317 8.567 < 2e-16 ***
## cut^4 -26.568 13.826 -1.922 0.0547 .
## color.L -1949.838 19.404 -100.484 < 2e-16 ***
## color.Q -672.931 17.640 -38.148 < 2e-16 ***
## color.C -161.554 16.458 -9.816 < 2e-16 ***
## color^4 25.973 15.117 1.718 0.0858 .
## color^5 -98.819 14.278 -6.921 4.55e-12 ***
## color^6 -59.129 12.977 -4.556 5.22e-06 ***
## clarity.L 4129.825 33.880 121.897 < 2e-16 ***
## clarity.Q -1954.978 31.602 -61.863 < 2e-16 ***
## clarity.C 997.867 27.030 36.918 < 2e-16 ***
## clarity^4 -382.021 21.572 -17.709 < 2e-16 ***
## clarity^5 242.941 17.611 13.795 < 2e-16 ***
## clarity^6 12.562 15.328 0.820 0.4125
## clarity^7 87.466 13.525 6.467 1.01e-10 ***
## depth -65.123 4.992 -13.045 < 2e-16 ***
## table -29.084 3.250 -8.950 < 2e-16 ***
## x -1020.664 34.775 -29.350 < 2e-16 ***
## y -1.438 19.396 -0.074 0.9409
## z -38.006 33.858 -1.123 0.2616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 43119 degrees of freedom
## Multiple R-squared: 0.9202, Adjusted R-squared: 0.9202
## F-statistic: 2.162e+04 on 23 and 43119 DF, p-value: < 2.2e-16
lm.rsquared<-1 - (sum((diamonds_test$lmprediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_test$price))^2))
lm.rsquared # 0.9207
## [1] 0.9180486
ggplot(diamonds_test,aes(lmprediction,price,alpha=0.6)) +
geom_point(color="darkblue")+
geom_abline(color="red",size=1.2)+
theme_minimal() +
theme(legend.position = "none", plot.title = element_text(vjust = 0.5))+
ggtitle("lm-Predictions vs Actual Prices")
We observe that as the price increases, the accuracy of the model decreases. However, our model works very well in the price range that covers a large proportion of our data.
In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model. read more
Diamond.rf.model<-randomForest(fmla,data=diamonds_train,ntree=200)
diamonds_test$rfprediction<-predict(Diamond.rf.model,newdata=diamonds_test)
print(error<-sqrt(mean((diamonds_test$rfprediction-diamonds_test$price)^2)))
## [1] 562.8579
It is expected to perform more accurate prediction since it creates many decision trees in the randomForest algorithm and can include most of the observations in the model. As a matter of fact, the r square value is quite high at 97.9%.
rf.rsquared<-1 - (sum((diamonds_test$rfprediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_test$price))^2))
rf.rsquared #0.9797688
## [1] 0.9797688
ggplot(diamonds_test,aes(rfprediction,price))+
geom_point(color="darkblue")+
geom_abline(color="red",size=1)+
theme_minimal() +
theme(legend.position = "none", plot.title = element_text(vjust = 0.5))+
ggtitle("rf-Predictions vs Actual Prices")