0— title: “Assigment3” author: “Mustafa Keser” date: “11 09 2020” output: html_document:
The diamond data set contains various properties of over 50000 diamonds. These properties affect the price and quality of diamonds. Some of the features in the data set are continuous and some are discrete.
Some Continous Variables;
Prices of the diamonds (in US dollars) Carat of diamonds, this fetaure is related with weight of the diamond where 1carat = 200mg x: Length in mm y: Width in mm z: Depth in mm
Some Discrete Variables;
Cut: Quality of cut (Fair <Good <Very Good <Premium <Ideal) Color: Color of the diamond [D (best), E, F, G, H, I, J (worst)] Clarity-: Clearness amount of the diamond (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
As this assignment shows, we will work with a large data set. Therefore, we will do Exploratory Data Analysis (EDA) as the first step to analyze large data sets. At this stage, we will first examine the data from certain angles and determine which factors we will focus on. We will try to determine the effects of diamonds’ properties on price. For this, we will first do PCA (Principle Component Analysis) and then create a model. With this model, we will evaluate the results we found.
The stages in the assignment are as follows;
I used the libraries that we learned in the course for the analysis I will do on the Diaomand Data set. I used them to do data manipulation and visualization.
Below you can see the libraries I have tried and used; 1.readr 2.tidyverse 3.broom 4.ggplot2 5.cvms 6.randomForest 7.tree
I used str function to examine the data and see its structure in more detail.
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
I colored each point according to the clarity quality of the diamond by using geom_point. As can be seen from the graph, some diamonds are lighter but more expensive because they have good clarity. Likewise, although some diamonds are heavier, they are not expensive because they have low clarity. This situation causes the formation of a rainbow shape on the graphic. For these reasons, we see that clarity is an important feature in determining the price of a diamond.
ggplot(diamonds,aes(carat,price,color=clarity))+
geom_point()+
theme_minimal()+
labs(title="Prices vs Carat on distribution of clarity",
x="Carat value of diamonds",
y="Prices of diamonds")
##Prices vs Carat on distribution of color We colored each point according to the color type of the diamond by using geom_point. As can be seen from the graph, some diamonds are lighter but more expensive because they have desired color. Likewise, although some diamonds are heavier, they are not expensive because they have not desired color. This situation causes the formation of a rainbow shape on the graphic. For these reasons, we see that color type is an important feature in determining the price of a diamond.
ggplot(diamonds,aes(carat,price,color=color))+
geom_point()+
theme_minimal()+
labs(title="Prices vs Carat on distribution of color",
x="Carat value of diamonds",
y="Prices of diamonds")
##Prices vs Carat on distribution of cut We colored each point according to the cut type of the diamond by using geom_point. In this graphic we made on Cut Type, the rainbow figure did not exactly form because the Cut Type of most diamonds was determined as ideal and premium. For these reasons, a rainbow dominated by two colors has emerged. But we are still faced with the same situation. Although some diamonds are lighter, they are more expensive because the Cut Type is in the desired shape and likewise, although some diamonds are heavier, they are not expensive because the Cut Type is not as desired. So that cut type is an important feature in determining the price of a diamond.
ggplot(diamonds,aes(carat,price,color=cut))+
geom_point()+
theme_minimal()+
labs(title="Prices vs Carat on distribution of cut",
x="Carat value of diamonds",
y="Prices of diamonds")
ggplot(diamonds,aes(depth,price,color=color))+
geom_point()+
theme_classic() +
labs(x = "Depths of Diamonds",
y = "Prices",
title = "Color vs Price")
##Densities of Diamonds based on Cut,Clarity,Color
##Comments abou Densities of Cut,Clarity,Color When we look at the density of diamonds in the data according to their properties, we come across a different graph. Although the cut type, color and clarity of the diamonds are at the desired level, we see that these diamonds show density at low prices. But this is because the diamonds with good properties in our data set are of very low weight. If all diamonds had a fixed carat value, we would see that diamonds with the Cut Type, Color and Clarity at the desired level show high prices.
g <- ggplot(diamonds, aes(price))
g + geom_density(aes(fill=color), color = NA, alpha=.35) +
labs(title="Density plot",
subtitle="Density Plot Grouped by Number of Color",
caption="Source: In R studio",
x="Price",
fill="# Color")
g <- ggplot(diamonds, aes(price))
g + geom_density(aes(fill=clarity), color = NA, alpha=.35) +
labs(title="Density plot",
subtitle="Density Plot Grouped by Number of clarity",
caption="Source: In R studio",
x="Price",
fill="# Clarity")
g <- ggplot(diamonds, aes(price))
g + geom_density(aes(fill=cut), color = NA, alpha=.35) +
labs(title="Density plot",
subtitle="Density Plot Grouped by Number of cut",
caption="Source: In R studio",
x="Price",
fill="# Cut")
A different implementation of Density display
diamonds %>%
group_by(cut,clarity)%>%
summarize(count=n())%>%
ggplot(.,aes(x="",y=count,fill=clarity))+
geom_bar(width = 1, stat = "identity", position = "fill") +
coord_polar("y") +
theme_minimal() +
theme(plot.title = element_text(vjust = 0.5)) +
facet_wrap(~cut) +
labs(title = " Clarity Analyses on Cutting Type",
fill = "Clarity")
## `summarise()` regrouping output by 'cut' (override with `.groups` argument)
I determined the test and training data Set according to the format given in the Assignment descripton.
set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
diamonds_train
## # A tibble: 43,143 x 11
## carat cut color clarity depth table price x y z diamond_id
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 2
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3
## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 4
## 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 5
## 5 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 7
## 6 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 8
## 7 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 9
## 8 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 10
## 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 11
## 10 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 12
## # ... with 43,133 more rows
I tried to detect the most important components using PCA
diamonds.1<-diamonds%>%
mutate(color=as.numeric(color),cut=as.numeric(cut),clarity=as.numeric(clarity))
pca.diamond<-princomp(diamonds.1,cor=TRUE,scores = TRUE)
summary(pca.diamond)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.2376039 1.1989134 1.1108068 0.99307294 0.87777143
## Proportion of Variance 0.5006871 0.1437393 0.1233892 0.09861939 0.07704827
## Cumulative Proportion 0.5006871 0.6444264 0.7678156 0.86643500 0.94348327
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.59853413 0.35742920 0.200419345 0.164186308
## Proportion of Variance 0.03582431 0.01277556 0.004016791 0.002695714
## Cumulative Proportion 0.97930758 0.99208314 0.996099933 0.998795647
## Comp.10
## Standard deviation 0.109743019
## Proportion of Variance 0.001204353
## Cumulative Proportion 1.000000000
plot(pca.diamond,type="l")
Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. It works for both categorical and continuous input and output variables.
First I determined my formula, then I set up the model and lastly I made prediction.
formula=price~carat+cut+color+depth+clarity+table+x+y+z
diamonds.tree.model<-tree(formula,diamonds_train)
diamonds_test$tree_prediction<-predict(diamonds.tree.model,newdata = diamonds_test)
diamonds.tree.model
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 43143 6.894e+11 3939
## 2) carat < 0.995 27894 3.482e+10 1632
## 4) carat < 0.625 19820 5.247e+09 1050 *
## 5) carat > 0.625 8074 6.397e+09 3060 *
## 3) carat > 0.995 15249 2.346e+11 8158
## 6) y < 7.195 10296 4.892e+10 6152
## 12) clarity: I1,SI2,SI1,VS2 7824 1.622e+10 5401 *
## 13) clarity: VS1,VVS2,VVS1,IF 2472 1.435e+10 8527 *
## 7) y > 7.195 4953 5.805e+10 12330
## 14) y < 7.855 3223 2.811e+10 10940 *
## 15) y > 7.855 1730 1.214e+10 14920 *
You can see how the nodes of my Decision tree model are determined in the above output.
R.squared.tree<-1 - (sum((diamonds_test$tree_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_test$price))^2))
R.squared.tree #0.8785363
## [1] 0.8785363
plot(diamonds.tree.model)
text(diamonds.tree.model)
summary(diamonds.tree.model)
##
## Regression tree:
## tree(formula = formula, data = diamonds_train)
## Variables actually used in tree construction:
## [1] "carat" "y" "clarity"
## Number of terminal nodes: 6
## Residual mean deviance: 1912000 = 8.246e+10 / 43140
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9873.0 -529.0 -164.2 0.0 542.8 13170.0