0— title: “Assigment3” author: “Mustafa Keser” date: “11 09 2020” output: html_document:


Introduction

diamonds dataset:

The diamond data set contains various properties of over 50000 diamonds. These properties affect the price and quality of diamonds. Some of the features in the data set are continuous and some are discrete.

Some Continous Variables;

Prices of the diamonds (in US dollars) Carat of diamonds, this fetaure is related with weight of the diamond where 1carat = 200mg x: Length in mm y: Width in mm z: Depth in mm

Some Discrete Variables;

Cut: Quality of cut (Fair <Good <Very Good <Premium <Ideal) Color: Color of the diamond [D (best), E, F, G, H, I, J (worst)] Clarity-: Clearness amount of the diamond (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))

What Will I Do

As this assignment shows, we will work with a large data set. Therefore, we will do Exploratory Data Analysis (EDA) as the first step to analyze large data sets. At this stage, we will first examine the data from certain angles and determine which factors we will focus on. We will try to determine the effects of diamonds’ properties on price. For this, we will first do PCA (Principle Component Analysis) and then create a model. With this model, we will evaluate the results we found.

The stages in the assignment are as follows;

  1. Various examinations on data
  2. Doing Exploratory Data Analysis (EDA),
  3. Doing Principle Components Analysis (PCA)
  4. Creation and evaluation of the model

Data Explanation

Used Libraries

I used the libraries that we learned in the course for the analysis I will do on the Diaomand Data set. I used them to do data manipulation and visualization.

Below you can see the libraries I have tried and used; 1.readr 2.tidyverse 3.broom 4.ggplot2 5.cvms 6.randomForest 7.tree

Data

I used str function to examine the data and see its structure in more detail.

str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Summary of Data

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

Explotory Data Analysis (EDA)

Prices vs Carat on distribution of clarity

I colored each point according to the clarity quality of the diamond by using geom_point. As can be seen from the graph, some diamonds are lighter but more expensive because they have good clarity. Likewise, although some diamonds are heavier, they are not expensive because they have low clarity. This situation causes the formation of a rainbow shape on the graphic. For these reasons, we see that clarity is an important feature in determining the price of a diamond.

ggplot(diamonds,aes(carat,price,color=clarity))+
  geom_point()+
  theme_minimal()+
  labs(title="Prices vs Carat on distribution of clarity",
       x="Carat value of diamonds",
       y="Prices of diamonds")

##Prices vs Carat on distribution of color We colored each point according to the color type of the diamond by using geom_point. As can be seen from the graph, some diamonds are lighter but more expensive because they have desired color. Likewise, although some diamonds are heavier, they are not expensive because they have not desired color. This situation causes the formation of a rainbow shape on the graphic. For these reasons, we see that color type is an important feature in determining the price of a diamond.

ggplot(diamonds,aes(carat,price,color=color))+
  geom_point()+
  theme_minimal()+
  labs(title="Prices vs Carat on distribution of color",
       x="Carat value of diamonds",
       y="Prices of diamonds")

##Prices vs Carat on distribution of cut We colored each point according to the cut type of the diamond by using geom_point. In this graphic we made on Cut Type, the rainbow figure did not exactly form because the Cut Type of most diamonds was determined as ideal and premium. For these reasons, a rainbow dominated by two colors has emerged. But we are still faced with the same situation. Although some diamonds are lighter, they are more expensive because the Cut Type is in the desired shape and likewise, although some diamonds are heavier, they are not expensive because the Cut Type is not as desired. So that cut type is an important feature in determining the price of a diamond.

ggplot(diamonds,aes(carat,price,color=cut))+
  geom_point()+
  theme_minimal()+
  labs(title="Prices vs Carat on distribution of cut",
       x="Carat value of diamonds",
       y="Prices of diamonds")

ggplot(diamonds,aes(depth,price,color=color))+
  geom_point()+
  theme_classic() +
  labs(x = "Depths of Diamonds",
       y = "Prices",
       title = "Color vs Price")

##Densities of Diamonds based on Cut,Clarity,Color

##Comments abou Densities of Cut,Clarity,Color When we look at the density of diamonds in the data according to their properties, we come across a different graph. Although the cut type, color and clarity of the diamonds are at the desired level, we see that these diamonds show density at low prices. But this is because the diamonds with good properties in our data set are of very low weight. If all diamonds had a fixed carat value, we would see that diamonds with the Cut Type, Color and Clarity at the desired level show high prices.

g <- ggplot(diamonds, aes(price))
g + geom_density(aes(fill=color), color = NA, alpha=.35) + 
    labs(title="Density plot", 
         subtitle="Density Plot Grouped by Number of Color",
         caption="Source: In R studio",
         x="Price",
         fill="# Color")

g <- ggplot(diamonds, aes(price))
g + geom_density(aes(fill=clarity), color = NA, alpha=.35) + 
    labs(title="Density plot", 
         subtitle="Density Plot Grouped by Number of clarity",
         caption="Source: In R studio",
         x="Price",
         fill="# Clarity")

g <- ggplot(diamonds, aes(price))
g + geom_density(aes(fill=cut), color = NA, alpha=.35) + 
    labs(title="Density plot", 
         subtitle="Density Plot Grouped by Number of cut",
         caption="Source: In R studio",
         x="Price",
         fill="# Cut")

A different implementation of Density display

diamonds %>%
  group_by(cut,clarity)%>%
  summarize(count=n())%>%
  ggplot(.,aes(x="",y=count,fill=clarity))+
  geom_bar(width = 1, stat = "identity", position = "fill") +
  coord_polar("y") +
  theme_minimal() +
  theme(plot.title = element_text(vjust = 0.5)) +
  facet_wrap(~cut) +
  labs(title = " Clarity Analyses on Cutting Type",
       fill = "Clarity")
## `summarise()` regrouping output by 'cut' (override with `.groups` argument)

Test&Training Datasets

I determined the test and training data Set according to the format given in the Assignment descripton.

set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

diamonds_train
## # A tibble: 43,143 x 11
##    carat cut       color clarity depth table price     x     y     z diamond_id
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <int>
##  1 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31          2
##  2 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31          3
##  3 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63          4
##  4 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75          5
##  5 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47          7
##  6 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53          8
##  7 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49          9
##  8 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39         10
##  9 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73         11
## 10 0.23  Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46         12
## # ... with 43,133 more rows

Princible Component Analysis (PCA)

I tried to detect the most important components using PCA

diamonds.1<-diamonds%>%
  mutate(color=as.numeric(color),cut=as.numeric(cut),clarity=as.numeric(clarity))
pca.diamond<-princomp(diamonds.1,cor=TRUE,scores = TRUE)
summary(pca.diamond)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.2376039 1.1989134 1.1108068 0.99307294 0.87777143
## Proportion of Variance 0.5006871 0.1437393 0.1233892 0.09861939 0.07704827
## Cumulative Proportion  0.5006871 0.6444264 0.7678156 0.86643500 0.94348327
##                            Comp.6     Comp.7      Comp.8      Comp.9
## Standard deviation     0.59853413 0.35742920 0.200419345 0.164186308
## Proportion of Variance 0.03582431 0.01277556 0.004016791 0.002695714
## Cumulative Proportion  0.97930758 0.99208314 0.996099933 0.998795647
##                            Comp.10
## Standard deviation     0.109743019
## Proportion of Variance 0.001204353
## Cumulative Proportion  1.000000000
plot(pca.diamond,type="l")

Tree

Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. It works for both categorical and continuous input and output variables.

First I determined my formula, then I set up the model and lastly I made prediction.

formula=price~carat+cut+color+depth+clarity+table+x+y+z
diamonds.tree.model<-tree(formula,diamonds_train)
diamonds_test$tree_prediction<-predict(diamonds.tree.model,newdata = diamonds_test)  
diamonds.tree.model
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 43143 6.894e+11  3939  
##    2) carat < 0.995 27894 3.482e+10  1632  
##      4) carat < 0.625 19820 5.247e+09  1050 *
##      5) carat > 0.625 8074 6.397e+09  3060 *
##    3) carat > 0.995 15249 2.346e+11  8158  
##      6) y < 7.195 10296 4.892e+10  6152  
##       12) clarity: I1,SI2,SI1,VS2 7824 1.622e+10  5401 *
##       13) clarity: VS1,VVS2,VVS1,IF 2472 1.435e+10  8527 *
##      7) y > 7.195 4953 5.805e+10 12330  
##       14) y < 7.855 3223 2.811e+10 10940 *
##       15) y > 7.855 1730 1.214e+10 14920 *

You can see how the nodes of my Decision tree model are determined in the above output.

R.squared.tree<-1 - (sum((diamonds_test$tree_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_test$price))^2))

R.squared.tree #0.8785363
## [1] 0.8785363
plot(diamonds.tree.model)
text(diamonds.tree.model)

summary(diamonds.tree.model)
## 
## Regression tree:
## tree(formula = formula, data = diamonds_train)
## Variables actually used in tree construction:
## [1] "carat"   "y"       "clarity"
## Number of terminal nodes:  6 
## Residual mean deviance:  1912000 = 8.246e+10 / 43140 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -9873.0  -529.0  -164.2     0.0   542.8 13170.0