0— title: “Assigment3” author: “Talha Ünlü” date: “09 09 2020” output: html_document: code_folding: hide
toc: yes toc_depth: 3 toc_float: collapsed: no theme: united highlight: tango pdf_document: toc: yes toc_depth: ‘3’ —

Introduction

About the “diamonds” dataset:

The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price (in 2008 $s): the 4 Cs (carat, cut, color, and clarity) , as well as some physical measurements (depth, table, price, x, y, and z).Carat is a unit of mass equal to 200 mg and is used for measuring gemstones and pearls. Cut grade is is an objective measure of a diamond’s light performance, or, what we generally think of as sparkle.

Objectives

In this assigment, we will first examine our data in general terms and then try to get more inside information in the explatory data analysis (EDA) section. Our aim in the part up to now is to know our data and to have information about variables. In the next step, we will try to establish a model. Before starting to set up a model, we will do the PCA (Princible Component Analysis) study, and finally, we will evaluate our results by establishing our model using different algorithms to check each other.

1.Overview of the Data 2.Explotory Data Analysis (EDA) 3.Princible Components Analysis (PCA) 4.Establishing & Evaluating the Models

Data Explanation

Used Libraries

I have used several packages during the analysis of the diamonds dataset in order to make data manipulation and visualization. The list of packages used in this assignment can be seen below:

1.tidyverse 2.readr 3.ggplot2 4.cvms 5.broom 6.tree 7.randomForest

Data

I used to glimpse() function in order to investigate the diamonds dataset.The glimpse() is a function of dplyr package.

diamonds%>%
  glimpse()

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

Summary of Data

summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

Explotory Data Analysis (EDA)

I created a new variable called volume using x, y and z dependent variables. The aim here is to find the relationship between diamond’s volume and price, and try to obtain different information with the variable we created.

diamond.update<-diamonds%>%
  mutate(volume=x*y*z)

ggplot(diamond.update,aes(volume,price,color=color)) +
  geom_point()+
  xlim(25,100)+
  labs(title = "Volume vs Price",
       x="Volume of Diamonds",
       y="Prices of Diamonds" )

## Warning: Removed 29426 rows containing missing values (geom_point).

When we examine the chart above, although it seems that there are exceptions, we observe that the price of the diamond increases as the color of the diamond gets darker in the range where the number of diamonds is intense.

ggplot(diamond.update,aes(clarity,price,color=cut))+
  geom_jitter()+
  theme_minimal() +
  theme( plot.title = element_text(vjust = 0.5)) +
  labs(x = "Clarity Levels",
       y = "Prices",
       title = "Clarity vs Price",
       subtitle = "Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF")

When we examine the chart above, we see that there are expensive and cheap diamonds at almost every clarity level. Also, when the price goes up, there are quite a few diamonds at I1 and IF clarity levels compared to other levels.

After the first two charts, I decided to categorize the price. I categorized Price variables as “Very Low”, “Low”, “Medium”, “High” and “Very High” using the case_when () function.

quant = quantile(diamond.update$price, seq(0, 1, 0.2))

diamonds_price_group <- diamond.update %>%
  mutate(price_group = case_when(
    price < quant[2] ~ "Very Low",
    price < quant[3] ~ "Low",
    price < quant[4] ~ "Medium",
    price < quant[5] ~ "High",
    TRUE ~ "Very High"
  )) %>%
  mutate(price_group = factor(price_group, levels = c("Very Low", "Low", "Medium", "High", "Very High")))

I have created another variable called price_group and now I will continue to examine the price clarity relation using this variable.

diamonds_price_group%>%
  group_by(price_group)%>%
  summarise(count=n())%>%
mutate(percentage=100*count/sum(count))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 3
##   price_group count percentage
##   <fct>       <int>      <dbl>
## 1 Very Low    10783       20.0
## 2 Low         10762       20.0
## 3 Medium      10814       20.0
## 4 High        10793       20.0
## 5 Very High   10788       20

diamonds_price_group %>%
  group_by(clarity, price_group) %>%
  summarize(counter = n())  %>%
  ggplot(., aes(x = '', y = counter, fill = price_group)) + 
  geom_bar(width = 1, stat = "identity", position = "fill") +
  coord_polar("y") +
  theme_void() +
  theme(plot.title = element_text(vjust = 0.5)) +
  facet_wrap(~clarity) +
  labs(title = "Price Group Analyses of Clarity",
       fill = "Price Group")

## `summarise()` regrouping output by 'clarity' (override with `.groups` argument)

As you seen above, pie charts was created in terms of price_group. Consequently, i could observe distribution of clarity levels for each price group.

Establishing Predictive Models

In this assigment, i first performed PC analysis then I created models using linear model and randomForest algorithms.

Preparing the Test&Training Datasets

I have defined our test and train datasets as given in Assigment 3.3 description.

set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

diamonds_train

## # A tibble: 43,143 x 11
##    carat cut       color clarity depth table price     x     y     z diamond_id
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <int>
##  1 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31          2
##  2 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31          3
##  3 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63          4
##  4 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75          5
##  5 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47          7
##  6 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53          8
##  7 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49          9
##  8 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39         10
##  9 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73         11
## 10 0.23  Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46         12
## # ... with 43,133 more rows

Princible Component Analysis (PCA)

Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing you to better visualize the variation present in a dataset with many variables. It is particularly helpful in the case of “wide” datasets, where you have many variables for each sample. Read more

In order to do PC analysis, my variables must be numeric, so we make our variables numeric.

diamonds.all<-diamonds%>%
  mutate(color=as.numeric(color),cut=as.numeric(cut),clarity=as.numeric(clarity))

diamond.pca<-princomp(diamonds.all,cor=TRUE,scores = TRUE)
summary(diamond.pca)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.2376039 1.1989134 1.1108068 0.99307294 0.87777143
## Proportion of Variance 0.5006871 0.1437393 0.1233892 0.09861939 0.07704827
## Cumulative Proportion  0.5006871 0.6444264 0.7678156 0.86643500 0.94348327
##                            Comp.6     Comp.7      Comp.8      Comp.9
## Standard deviation     0.59853413 0.35742920 0.200419345 0.164186308
## Proportion of Variance 0.03582431 0.01277556 0.004016791 0.002695714
## Cumulative Proportion  0.97930758 0.99208314 0.996099933 0.998795647
##                            Comp.10
## Standard deviation     0.109743019
## Proportion of Variance 0.001204353
## Cumulative Proportion  1.000000000

diamond.pca$loadings

## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## carat    0.441                                     0.165         0.733  0.482 
## cut             0.633  0.371  0.220 -0.245  0.586                             
## color    0.137  0.154        -0.818 -0.521         0.106                      
## clarity -0.181  0.279  0.221 -0.493  0.737        -0.219                      
## depth           0.120 -0.839         0.162  0.490                             
## table    0.107 -0.671  0.303 -0.180         0.640                             
## price    0.403  0.141                0.303         0.755        -0.361 -0.122 
## x        0.441                                    -0.236         0.304 -0.803 
## y        0.435                                    -0.381 -0.746 -0.247  0.195 
## z        0.435                                    -0.375  0.651 -0.411  0.260 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings       1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0
## Proportion Var    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1
## Cumulative Var    0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8    0.9
##                Comp.10
## SS loadings        1.0
## Proportion Var     0.1
## Cumulative Var     1.0

As can be seen in Cumulative Proportion, the first five components can explain 94.3% of variation.

ggplot(data.frame(pc=1:7,cum_var=c(0.5006871,0.6444264,0.7678156 ,0.86643500,0.94348327,0.97930758,0.99208314 )),aes(x=pc,y=cum_var)) +
  geom_point() + 
  geom_line()

Linear Regression Modelling

Linear models describe a continuous response variable as a function of one or more predictor variables.So i used linear regression in prediction model.

fmla<-price~carat+cut+color+clarity+depth+table+x+y+z

Diamond.lm.model<-lm(fmla,diamonds_train)
diamonds_test$lmprediction<-predict(Diamond.lm.model,newdata = diamonds_test)

The linear model’s r square value is 92%. He says that our model works pretty well and is a statistically logical predict.

summary(Diamond.lm.model)

## 
## Call:
## lm(formula = fmla, data = diamonds_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21446.0   -593.0   -182.4    378.8  10701.8 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6032.795    440.677   13.690  < 2e-16 ***
## carat       11306.047     55.032  205.445  < 2e-16 ***
## cut.L         587.057     25.144   23.347  < 2e-16 ***
## cut.Q        -300.964     20.114  -14.963  < 2e-16 ***
## cut.C         148.352     17.317    8.567  < 2e-16 ***
## cut^4         -26.568     13.826   -1.922   0.0547 .  
## color.L     -1949.838     19.404 -100.484  < 2e-16 ***
## color.Q      -672.931     17.640  -38.148  < 2e-16 ***
## color.C      -161.554     16.458   -9.816  < 2e-16 ***
## color^4        25.973     15.117    1.718   0.0858 .  
## color^5       -98.819     14.278   -6.921 4.55e-12 ***
## color^6       -59.129     12.977   -4.556 5.22e-06 ***
## clarity.L    4129.825     33.880  121.897  < 2e-16 ***
## clarity.Q   -1954.978     31.602  -61.863  < 2e-16 ***
## clarity.C     997.867     27.030   36.918  < 2e-16 ***
## clarity^4    -382.021     21.572  -17.709  < 2e-16 ***
## clarity^5     242.941     17.611   13.795  < 2e-16 ***
## clarity^6      12.562     15.328    0.820   0.4125    
## clarity^7      87.466     13.525    6.467 1.01e-10 ***
## depth         -65.123      4.992  -13.045  < 2e-16 ***
## table         -29.084      3.250   -8.950  < 2e-16 ***
## x           -1020.664     34.775  -29.350  < 2e-16 ***
## y              -1.438     19.396   -0.074   0.9409    
## z             -38.006     33.858   -1.123   0.2616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 43119 degrees of freedom
## Multiple R-squared:  0.9202, Adjusted R-squared:  0.9202 
## F-statistic: 2.162e+04 on 23 and 43119 DF,  p-value: < 2.2e-16

lm.rsquared<-1 - (sum((diamonds_test$lmprediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_test$price))^2))
lm.rsquared # 0.9207

## [1] 0.9180486

ggplot(diamonds_test,aes(lmprediction,price,alpha=0.6)) +
  geom_point(color="darkblue")+
  geom_abline(color="red",size=1.2)+
  theme_minimal() +
  theme(legend.position = "none", plot.title = element_text(vjust = 0.5))+
  ggtitle("lm-Predictions vs Actual Prices")

We observe that as the price increases, the accuracy of the model decreases. However, our model works very well in the price range that covers a large proportion of our data.

Random Forest Regression

In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model. read more

  Diamond.rf.model<-randomForest(fmla,data=diamonds_train,ntree=200)
  diamonds_test$rfprediction<-predict(Diamond.rf.model,newdata=diamonds_test)
  print(error<-sqrt(mean((diamonds_test$rfprediction-diamonds_test$price)^2)))

## [1] 562.8579

It is expected to perform more accurate prediction since it creates many decision trees in the randomForest algorithm and can include most of the observations in the model. As a matter of fact, the r square value is quite high at 97.9%.

rf.rsquared<-1 - (sum((diamonds_test$rfprediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_test$price))^2))
rf.rsquared #0.9797688

## [1] 0.9797688

ggplot(diamonds_test,aes(rfprediction,price))+
  geom_point(color="darkblue")+
  geom_abline(color="red",size=1)+
  theme_minimal() +
  theme(legend.position = "none", plot.title = element_text(vjust = 0.5))+
  ggtitle("rf-Predictions vs Actual Prices")