Diamonds are the result of carbon exposed to extremely high heat and pressure, deep in the earth. This process results in a variety of both internal and external characteristics(i.e. size, shape, color, etc.) which makes every diamond unique. Therefore, jewelry industry uses a systematic way to evaluate and discus these factors. Diamond professionals use the grading system developed by GIA in the 1950s, which established the use of four important factors to describe and classify diamonds: Clarity, Color, Cut, and Carat Weight. These are known as the 4Cs. The value of a finished diamond is based on this combination. The explanation of 4C can be seen below:
Clarity: The clarity of a diamond is determined by the number, location and type of inclusions it contains. Inclusions can be microscopic cracks, mineral deposits or external markings. Clarity is rated using a scale which contains a combination of letters and numbers to signify the amount and type of inclusions. This scale ranges from FL to I3, FL being Flawless and the most valuable.
Color: Most commercially available diamonds are classified by color, or more appropriately, the most valuable diamonds are those classified as colorless. Color is graded on a letter scale from D to Z, with D representing a colorless diamond. Diamonds that are graded in the D-F range are the rarest and consequently most valuable. In reality, diamonds in the G-K range have such a small amount of color that the untrained eye can not see it.
Cut: Cut is the most important characteristic of the diamond. It determines how the light which enters into the diamond from the above will be reflected back to the eye of observer. A perfect cut diamond reflects light to its optimum.
Carat Weight: In addition to color, clarity and cut, weight provides a further basis in the valuation of a diamond. The weight of diamonds is measured in carats. One carat is equal to 1/5 of a gram. Smaller diamonds are more readily available than larger ones, which results in higher values based on weight. A 2 carat diamond will not be twice the cost of a 1 carat diamond, despite being twice the size. The larger the diamond, the rarer it becomes and as such the price increases exponentially.
In this assignment, the diamonds dataset that we use consists of prices and quality information from about 53,940 diamonds, and is included in the ggplot2 package. The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price. These attributes consist of the 4C and some physical measurements such as depth, table, x, y, and z.
The processes during the assignment:
Firstly, the data will be investigated for preprocessing to improve its quality. Then an exploratory data analysis will be performed by data manipulation and data visualization steps. The main purpose is to identify which variables affect the price of diamonds mostly and come up with a conclusion for the relationship between variables. In addition to these, forecasting models will be studied in order to estimate the price of diamonds with given properties.
The packages used during the assignment can be listed as below:
library(tidyverse)
library(knitr)
library(tinytex)
library(kableExtra)
library(corrplot)
library(RColorBrewer)
library(ggExtra)
library(rpart)
library(rpart.plot)
library(rattle)
library(data.table)
library(caret)
library(e1071)
library(patchwork)
Before starting the analysis with the dataset, it will be useful to have information about the properties of diamonds. There are 10 columns regarding the properties of diamonds.
By using this glimpse() function, we obtain these variables.
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
Like glimpse() function, we can also get similar results by using structure function which is the function of Base R, i.e., str().
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Diamond color type D is known as the best color while J is the worst, however str() output shows that the order is wrong, therefore we should identify the levels of the color variable in diamonds dataset.
To make pre processing, basic examinations will be made.Some of them are
## [1] 0
By using above function we can obtain number of NA values in this dataset. When we check the dataset, There are not any NA values.This means that there is no missing value in any row or column.
## [1] 146
Another important control point is that checking of the duplicated values. There are 146 duplicated lines. Before the analysis, we need to remove these data from the dataset.
## [1] 0
According to the usage of unique function, there is no duplicated line, and we have 53794 rows and 10 columns.
## Rows: 53,794
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Logically, price should be greater than zero. For this reason, we check dataset to control negative price values. According to the results there is no negative price value.
## # A tibble: 1 x 1
## NegativePrice
## <int>
## 1 0
Unrealistic price values:
The dataset must include realistic values, therefore it shouldn’t contain negative values for the numeric values which correspond to a specific characteristics of a diamond. Price should be greater than zero, therefore, we check dataset to control negative price values. According to the results there is no negative price value, as expected.
Missing x, y, z values:
These values correspond to length, width, and depth, therefore they should be positive. We can impute the wrongly filled zero values if we have other dimensions since depth is a value obtained by a formula which include x,y, and z. This explanation also can be obtain from the R package explanation by using ?diamonds.
The calculation is that
total depth percentage = z / mean(x, y) = 2 * z / (x + y)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## # depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## # depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
## # A tibble: 12 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Premium G SI2 59.1 59 3142 6.55 6.48 0
## 2 1.01 Premium H I1 58.1 59 3167 6.66 6.6 0
## 3 1.1 Premium G SI2 63 59 3696 6.5 6.47 0
## 4 1.01 Premium F SI2 59.2 58 3837 6.5 6.47 0
## 5 1.5 Good G I1 64 61 4731 7.15 7.04 0
## 6 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0
## 7 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0
## 8 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0
## 9 2.2 Premium H SI1 61.2 59 17265 8.42 8.37 0
## 10 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0
## 11 2.8 Good G SI2 63.8 58 18788 8.9 8.85 0
## 12 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0
All x and y values are positive, for now we don’t need to do anything. But, there are some missing values in the x,y, and z columns, to detect these values, we can use following codes. When we find the missing values, by using depth formulation we can fill some lines.
diamonds = diamonds %>%
mutate(z = ifelse(z == 0 & x != 0 & y != 0,round(depth * mean(c(x, y)) / 100, 2), z))
However, we need to control one more thing in this data set that is whether the two of these variable are zero or not. In this situation, we do not calculate the values because there two unknown value and one formula. In this situation, we can remove these trials from the data set. For this purpose, the binary combinations are controlled with filter operation.
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 3 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 4 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 5 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 6 0.71 Good F SI2 64.1 60 2130 0 0 0
When we control the data set we obtain trials which have both x and y are zero.We control other combinations.
## # A tibble: 7 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
## 2 1 Very Good H VS2 63.3 53 5139 0 0 0
## 3 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 4 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 5 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 6 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 7 0.71 Good F SI2 64.1 60 2130 0 0 0
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 3 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 4 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 5 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 6 0.71 Good F SI2 64.1 60 2130 0 0 0
As a result there are 7 rows that both x and z values are 0. As we do not calculate values of these variables, we can remove from the data set. Because this number is too small when we compare the all data set.
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## # depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
The last filter shows that, there is no zero values in the x,y, and z variables’ column.Now, we can have more clear data set. But,there is one more examination about the data set which is finding outliers of the data set. For this purpose, we can control numeric variables x,y, and z with range() function.
For x, y variables:
range(diamonds$x)
diamonds$x %>%
unique() %>%
sort(decreasing = TRUE)
range(diamonds$y)
diamonds$y %>%
unique() %>%
sort(decreasing = TRUE)
When we control the output of the above lines, we obtain that there is huge difference between the highest value of the y and the second highest value of the y.The reason behind this huge different, the data can be entered wrong or the calculation may be wrong. For this reason, we can remove this line or we can calculate by using depth formulation.
diamonds %>%
filter(y > 15) %>%
mutate(new_y = (2 * z / depth) / 100 - x) %>%
select(depth, x, z, y, new_y)
## # A tibble: 2 x 5
## depth x z y new_y
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 58.9 8.09 8.06 58.9 -8.09
## 2 61.8 5.15 5.12 31.8 -5.15
when we calculate this value, we can not obtain proper value. For this reason we can remove this data from the data set.
Lastly, we control the z variable range.
Since, we also obtain outlier trial for the z variable. For this according to the manual calculation there is one variable, we can remove or calculate a new value for this trial.
#changing of the outlier with the calculated value.
diamonds$z[diamonds$z == 31.8] <- diamonds$z[diamonds$z == 31.8] / 10
#Now, we also need to check the depth column. Then, we need to change the depth value, with the new calculated depth value.
diamonds$calculated_depth <- 2 * diamonds$z / (diamonds$x + diamonds$y) * 100
By using calculated depth, which is obtained from the x,y and z values with using formulation, and depth that is given in the dataset, we can compare the results. It is expected that, both values have linear relationship.
diamonds %>%
ggplot(., aes(x = calculated_depth, y = depth)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color="blue", size=1.5) +
theme_minimal() +
labs(title = "The Depth vs Calculated Depth",
x = "Calculated Depth",
y = "Depth (given in the dataset)")
According to the results, we can say that almost all variables have linear relationship, but some of them is not. When we analyzed the information about the diamonds, the difference between 56.5 and 65, whereas the ideal range is between 59.5 and 62.9. Hence, we can remove data which does not have this property. According to the web-page,the interval selected as 9. But this criteria can be changed according to the other implementations and variables.
diamonds %>%
filter(!(abs(calculated_depth - depth) <9)) %>%
select(calculated_depth, depth, x, y, z)
## # A tibble: 24 x 5
## calculated_depth depth x y z
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 51.2 62.8 6.26 6.19 3.19
## 2 63.1 43 6.32 6.27 3.97
## 3 65.7 44 6.31 6.24 4.12
## 4 51.7 64 7.15 7.04 3.67
## 5 63.0 43 6.53 6.55 4.12
## 6 49.5 59.2 6.88 6.83 3.39
## 7 16.1 60.6 6.62 6.67 1.07
## 8 65.0 55.9 6.77 6.71 4.38
## 9 56.9 67.3 7.85 5.75 3.87
## 10 20.5 61.9 7.43 7.5 1.53
## # ... with 14 more rows
There are 24 observations. When we compare it with the number of all observations in the dataset, it is very small value. So, we can remove them from the dataset.
diamonds = diamonds %>%
filter((abs(calculated_depth - depth) < 9))
diamonds = subset(diamonds, select = -c(calculated_depth))
## Rows: 53,761
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
According to the depth formulation and web information, for ideal diamonds, there should be relationship between variables x,y, and z. For this reason, as a final interpretation we can compare the x, y and z values with each other.
p1<- diamonds %>%
ggplot(., aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", size=2) +
theme_minimal() + theme(axis.text.x = element_text(angle=90, size=6, vjust=0.5,hjust=1))+
labs(title = "Comparison of \n the x and y variable",
x = "X variable",
y = "Y variable")
p2 <- diamonds %>%
ggplot(., aes(x = z, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", size=2)+
theme_minimal() + theme(axis.text.x = element_text(angle=90, size=6, vjust=0.5,hjust=1))+
labs(title = "Comparison of \n the x and y variable",
x = "Z variable",
y = "Y variable")
p3 <- diamonds %>%
ggplot(., aes(x = x, y = z)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", size=2)+
theme_minimal() + theme(axis.text.x = element_text(angle=90, size=6, vjust=0.5,hjust=1))+
labs(title = "Comparison of \n the x and y variable",
x = "X variable",
y = "Z variable")
(p1 | p2 | p3)
As we mentioned before, the relationship between variables show that there is high correlation. But, we can assume that these x, y and z values are valid values.
After preprocessing of the data, the summary and glimpse of the data can be seen below.In this new data set, we have 53,761 rows.
## carat cut color clarity depth
## Min. :0.2000 Fair : 1592 J: 2800 SI1 :13026 Min. :50.80
## 1st Qu.:0.4000 Good : 4887 I: 5405 VS2 :12221 1st Qu.:61.00
## Median :0.7000 Very Good:12065 H: 8262 SI2 : 9140 Median :61.80
## Mean :0.7974 Premium :13738 G:11252 VS1 : 8151 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21479 F: 9517 VVS2 : 5055 3rd Qu.:62.50
## Max. :5.0100 E: 9772 VVS1 : 3645 Max. :79.00
## D: 6753 (Other): 2523
## table price x y
## Min. :43.00 Min. : 326 Min. : 3.730 Min. : 3.680
## 1st Qu.:56.00 1st Qu.: 951 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3930 Mean : 5.731 Mean : 5.733
## 3rd Qu.:59.00 3rd Qu.: 5322 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :10.540
##
## z
## Min. :2.240
## 1st Qu.:2.910
## Median :3.530
## Mean :3.539
## 3rd Qu.:4.030
## Max. :6.980
##
## Rows: 53,761
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
Since the mean is greater than median for price, the distribution is right-skewed.
diamonds %>%
group_by(cut) %>%
summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
mutate(percentage = 100*round(count/sum(count),3))%>%
arrange((cut)) %>%
select(cut, count, percentage, Min_Price, Average_Price, Max_Price ) %>%
kable(col.names = c("Diamond Cut","Count","Percentage", "Minimum Price", "Average Price", "Maximum Price"))%>%
kable_minimal(full_width = F)
Diamond Cut | Count | Percentage | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|---|
Fair | 1592 | 3.0 | 337 | 4330.16 | 18574 |
Good | 4887 | 9.1 | 327 | 3916.91 | 18707 |
Very Good | 12065 | 22.4 | 336 | 3981.30 | 18818 |
Premium | 13738 | 25.6 | 326 | 4576.58 | 18823 |
Ideal | 21479 | 40.0 | 326 | 3461.60 | 18806 |
diamonds%>% ggplot(., aes(cut, price)) +
geom_boxplot(aes(fill=factor(cut))) +
scale_fill_brewer(palette="Pastel1")+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme_minimal()+
labs(title="Price vs Cut",
subtitle="Diamonds dataset",
x="Diamond Cut",
y="Price",
fill="Diamond Cut")
diamonds%>%
ggplot(., aes(x=price)) +
geom_histogram(binwidth=500, position="dodge", fill="pink2") +
theme_minimal()+facet_wrap(~cut, ncol=5)+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme(strip.background = element_rect(fill="pink3"))+
labs(title="Price Frequency vs Diamond Cut",
subtitle="Diamonds dataset",
x="Price",
y="Frequency",
fill="Diamond Cut")
Although the best cut type is Ideal, its price is the lowest. According to the average prices, the most expensive diamonds belong to Premium and Fair cut types. These results present that cut is not enough to explain response variable price, since price does not increase while cut feature improves.
diamonds %>%
group_by(color) %>%
summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
mutate(percentage = 100*round(count/sum(count),3))%>%
arrange((color)) %>%
select(color, count, percentage, Min_Price, Average_Price, Max_Price ) %>%
kable(col.names = c("Diamond Color","Count","Percentage", "Minimum Price", "Average Price", "Maximum Price"))%>%
kable_minimal(full_width = F)
Diamond Color | Count | Percentage | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|---|
J | 2800 | 5.2 | 335 | 5327.34 | 18710 |
I | 5405 | 10.1 | 334 | 5079.12 | 18823 |
H | 8262 | 15.4 | 337 | 4474.01 | 18803 |
G | 11252 | 20.9 | 354 | 3997.03 | 18818 |
F | 9517 | 17.7 | 342 | 3726.56 | 18791 |
E | 9772 | 18.2 | 326 | 3080.28 | 18731 |
D | 6753 | 12.6 | 357 | 3172.01 | 18693 |
diamonds %>%
group_by(color) %>%
summarise(Average_Price = mean(price)) %>%
ggplot(.,aes(x=color, y = Average_Price, fill= color)) +
geom_col(color="black") +
geom_text(aes(label = format(Average_Price, digits = 3)), position = position_dodge(0.9),vjust = 5) +
geom_line(aes(y = Average_Price), color="black", group=1, size=0.8)+
theme_minimal() +
theme(axis.text.x = element_text(angle = 65))+
scale_fill_brewer(palette="Pastel2")+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
labs(title="Price vs Color",
subtitle="Diamonds dataset",
x="Diamond Color",
y="Average Price",
fill="Diamond color")
diamonds%>%
ggplot(., aes(x=price)) +
geom_histogram(binwidth=500, position="dodge", fill="palegreen3") +
theme_minimal()+facet_wrap(~color, ncol=7)+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme(strip.background = element_rect(fill="seagreen3"))+
labs(title="Price Frequency vs Diamond Cut",
subtitle="Diamonds dataset",
x="Price",
y="Frequency",
fill="Diamond Cut")
Although the best color type is D, its price is one of the lowest. According to the average prices, the most expensive diamonds belong to J and I cut types which are actually the worst two color type in this data set. These results present that color is not enough to explain response variable price, since price does not increase while color feature improves.
diamonds %>%
group_by(clarity) %>%
summarize(count=n(),Min_Price=min(price),Average_Price = round(mean(price),2),Max_Price=max(price) ) %>%
mutate(percentage = 100*round(count/sum(count),3))%>%
arrange((clarity)) %>%
select(clarity, count, percentage, Min_Price, Average_Price, Max_Price ) %>%
kable(col.names = c("Diamond Clarity","Count","Percentage", "Minimum Price", "Average Price", "Maximum Price"))%>%
kable_minimal(full_width = F)
Diamond Clarity | Count | Percentage | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|---|
I1 | 739 | 1.4 | 345 | 3924.19 | 18531 |
SI2 | 9140 | 17.0 | 326 | 5051.59 | 18804 |
SI1 | 13026 | 24.2 | 326 | 3992.93 | 18818 |
VS2 | 12221 | 22.7 | 334 | 3925.96 | 18823 |
VS1 | 8151 | 15.2 | 327 | 3839.87 | 18795 |
VVS2 | 5055 | 9.4 | 336 | 3286.97 | 18768 |
VVS1 | 3645 | 6.8 | 336 | 2523.09 | 18777 |
IF | 1784 | 3.3 | 369 | 2870.57 | 18806 |
diamonds %>%
ggplot(., aes(x = clarity, y = price, color = clarity)) +
scale_color_brewer(palette="Set3")+
geom_jitter(alpha=0.7) +
theme_minimal() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title = "Price vs Clarity",
subtitle = "Diamonds dataset",
x = "Diamond Clarity",
y = "Price",
color = "Diamond Clarity")
diamonds%>%
ggplot(., aes(x=price)) +
geom_histogram(binwidth=500, position="dodge", fill="skyblue3") +
theme_minimal()+facet_wrap(~clarity, ncol=8)+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
theme(strip.background = element_rect(fill="skyblue2"))+
labs(title="Price Frequency vs Diamond Clarity",
subtitle="Diamonds dataset",
x="Price",
y="Frequency",
fill="Diamond Clarity")
Although the best clarity type is IF, its price is the lowest. According to the average prices, the most expensive diamonds belong to SI2 clarity types which is actually the second worst clarity type in this data set. These results present that clarity is not enough to explain response variable price, since price does not increase while clarity feature improves.
diamonds %>%
group_by(carat) %>%
summarise(carat_count = n(),
minPrice = min(price),
AveragePrice = mean(price),
MaxPrice = max(price))%>%
arrange(desc(carat_count)) %>%
kable(col.names = c("Carat", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
kable_styling("striped", full_width = T) %>%
scroll_box(width = "100%", height = "300px")
Carat | Count | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|
0.30 | 2596 | 339 | 680.6367 | 2366 |
1.01 | 2240 | 2017 | 5507.1902 | 16234 |
0.31 | 2238 | 335 | 708.1997 | 1917 |
0.70 | 1981 | 945 | 2516.0717 | 5539 |
0.32 | 1827 | 345 | 719.1916 | 1715 |
1.00 | 1543 | 1681 | 5249.5846 | 16469 |
0.90 | 1484 | 1599 | 3939.5889 | 9182 |
0.41 | 1378 | 467 | 973.0530 | 1745 |
0.40 | 1291 | 484 | 933.0077 | 1908 |
0.71 | 1291 | 1305 | 2674.3106 | 5019 |
0.50 | 1254 | 584 | 1503.8453 | 3378 |
0.33 | 1179 | 366 | 765.1747 | 1389 |
0.51 | 1123 | 826 | 1623.5841 | 4368 |
0.34 | 908 | 413 | 755.7423 | 2346 |
1.02 | 883 | 2344 | 5598.4632 | 17100 |
0.52 | 813 | 870 | 1656.4785 | 3373 |
1.51 | 804 | 3497 | 10552.0211 | 18806 |
1.50 | 787 | 2964 | 10082.9835 | 18691 |
0.72 | 764 | 945 | 2705.7788 | 5086 |
0.53 | 707 | 844 | 1685.5248 | 3827 |
0.42 | 704 | 552 | 994.4545 | 1722 |
0.38 | 669 | 497 | 904.1973 | 1992 |
0.35 | 663 | 409 | 799.5098 | 1672 |
1.20 | 643 | 2360 | 6670.7387 | 16256 |
0.54 | 624 | 935 | 1704.3317 | 4042 |
0.36 | 570 | 410 | 781.4667 | 1718 |
0.91 | 570 | 1570 | 3981.1561 | 9636 |
1.03 | 522 | 1262 | 5583.8966 | 17590 |
0.55 | 495 | 959 | 1725.2283 | 3509 |
0.56 | 492 | 953 | 1831.9980 | 4632 |
0.73 | 492 | 1049 | 2791.2480 | 5154 |
0.43 | 486 | 452 | 995.2737 | 1827 |
1.04 | 474 | 2037 | 5774.6709 | 18542 |
1.21 | 470 | 2396 | 6905.4043 | 17353 |
2.01 | 433 | 5696 | 14772.7136 | 18741 |
0.57 | 429 | 929 | 1798.0443 | 4032 |
0.39 | 397 | 451 | 923.4761 | 1624 |
0.37 | 394 | 451 | 816.0711 | 1764 |
1.52 | 380 | 3105 | 10629.9868 | 17723 |
1.06 | 372 | 2080 | 5947.0081 | 15813 |
1.05 | 361 | 2066 | 5781.6814 | 13060 |
1.07 | 340 | 2260 | 6004.2118 | 18279 |
0.74 | 322 | 1632 | 2821.9006 | 5841 |
0.58 | 310 | 893 | 1798.6484 | 3741 |
1.11 | 308 | 3183 | 6005.5682 | 15464 |
1.22 | 299 | 2699 | 6971.5151 | 14414 |
0.23 | 293 | 326 | 486.1433 | 688 |
1.09 | 285 | 2708 | 6020.0947 | 18231 |
0.80 | 284 | 1232 | 3146.1901 | 6834 |
0.59 | 282 | 903 | 1831.4504 | 4916 |
1.23 | 277 | 3168 | 7290.8881 | 16253 |
1.10 | 276 | 3216 | 5913.0725 | 12748 |
2.00 | 261 | 5051 | 14107.1456 | 18818 |
0.24 | 254 | 336 | 505.1850 | 963 |
0.26 | 253 | 337 | 550.8972 | 814 |
0.76 | 251 | 1140 | 2889.5936 | 5056 |
0.77 | 251 | 1651 | 2884.3984 | 5251 |
1.12 | 251 | 2383 | 6044.4900 | 14407 |
0.75 | 249 | 1013 | 2763.4016 | 5288 |
1.08 | 246 | 2869 | 5774.3415 | 14032 |
1.13 | 246 | 2968 | 6193.4024 | 14525 |
1.24 | 235 | 2940 | 7147.1149 | 15026 |
0.27 | 233 | 361 | 574.7597 | 893 |
0.60 | 226 | 806 | 1956.2434 | 4209 |
0.92 | 226 | 2283 | 4035.2124 | 7544 |
1.53 | 219 | 5631 | 10703.4658 | 17223 |
1.70 | 215 | 5617 | 12110.3209 | 18718 |
0.25 | 212 | 357 | 550.9245 | 1186 |
0.44 | 212 | 648 | 1056.6226 | 2506 |
1.14 | 206 | 2327 | 6354.4078 | 18112 |
0.61 | 204 | 931 | 2069.2451 | 4279 |
0.81 | 200 | 1687 | 3094.4250 | 6338 |
0.28 | 198 | 360 | 580.1212 | 828 |
0.78 | 187 | 1720 | 2918.9091 | 4904 |
1.25 | 187 | 3276 | 7123.2513 | 16823 |
0.46 | 177 | 746 | 1240.2429 | 2139 |
2.02 | 176 | 6346 | 14767.4091 | 18731 |
1.54 | 174 | 4492 | 10865.0287 | 18416 |
1.16 | 172 | 2368 | 6274.2965 | 13595 |
0.79 | 151 | 1809 | 2993.2649 | 4858 |
1.15 | 148 | 3220 | 5962.5811 | 12895 |
1.26 | 146 | 3881 | 7490.7466 | 15247 |
0.93 | 142 | 2374 | 3994.2958 | 5992 |
0.82 | 140 | 2033 | 3083.6500 | 4931 |
0.62 | 135 | 933 | 1973.3704 | 3968 |
1.27 | 134 | 2845 | 7119.1194 | 15377 |
1.31 | 132 | 3697 | 7750.6364 | 17496 |
0.83 | 131 | 1863 | 3181.9924 | 7889 |
0.29 | 130 | 334 | 601.1923 | 1776 |
1.19 | 126 | 2892 | 6483.2381 | 15767 |
1.18 | 123 | 3219 | 6481.1057 | 14847 |
1.55 | 123 | 4965 | 10717.7642 | 18188 |
2.03 | 122 | 6002 | 14710.1148 | 18781 |
1.30 | 121 | 2512 | 7851.6612 | 15874 |
1.71 | 119 | 6320 | 12396.5546 | 18791 |
0.45 | 110 | 672 | 1093.3182 | 1833 |
1.17 | 110 | 2336 | 6224.8818 | 11886 |
1.56 | 108 | 5766 | 10205.9444 | 17455 |
1.28 | 106 | 3355 | 7711.8113 | 18700 |
1.57 | 105 | 5170 | 10444.5143 | 17548 |
0.96 | 103 | 1637 | 3623.4660 | 6807 |
0.63 | 102 | 1065 | 2244.7843 | 6607 |
1.29 | 101 | 2596 | 7033.0198 | 17932 |
0.47 | 99 | 719 | 1272.3131 | 2460 |
1.60 | 95 | 6272 | 11124.2526 | 18780 |
1.32 | 89 | 4140 | 7639.3708 | 15802 |
1.58 | 89 | 4459 | 11427.9888 | 18057 |
1.59 | 89 | 6010 | 10574.3596 | 17552 |
1.33 | 87 | 4634 | 7527.0805 | 18435 |
2.04 | 86 | 7403 | 14840.1744 | 18795 |
0.64 | 80 | 1080 | 2113.5875 | 4281 |
1.35 | 77 | 4082 | 7448.2857 | 14616 |
1.34 | 68 | 4352 | 7758.5000 | 17663 |
0.65 | 65 | 1214 | 2322.6154 | 5052 |
0.95 | 65 | 2195 | 4013.6615 | 6951 |
2.05 | 65 | 7026 | 15647.2462 | 18787 |
0.84 | 64 | 1749 | 3294.9844 | 6399 |
1.61 | 64 | 6915 | 10992.8438 | 18318 |
0.48 | 63 | 841 | 1322.0476 | 2677 |
0.85 | 62 | 1250 | 3204.1774 | 7230 |
1.62 | 61 | 5157 | 11705.8852 | 18281 |
0.94 | 59 | 2571 | 4195.7797 | 7120 |
0.97 | 59 | 1547 | 4026.2373 | 7415 |
2.06 | 59 | 5203 | 14348.6780 | 18779 |
1.72 | 57 | 5240 | 12040.3158 | 18730 |
1.73 | 52 | 6007 | 11875.9615 | 18377 |
2.10 | 51 | 6597 | 14464.3725 | 18648 |
1.36 | 50 | 4158 | 8591.5200 | 15386 |
1.40 | 50 | 4939 | 9108.5200 | 16808 |
1.63 | 50 | 7100 | 11556.3200 | 18179 |
1.75 | 50 | 7276 | 12673.7000 | 17343 |
2.07 | 50 | 6503 | 15384.2800 | 18804 |
0.66 | 48 | 1188 | 2235.7292 | 4311 |
0.67 | 48 | 1244 | 2067.8542 | 3271 |
2.14 | 48 | 5405 | 14905.0417 | 18528 |
1.37 | 46 | 5063 | 8044.0870 | 13439 |
0.49 | 45 | 900 | 1306.3333 | 2199 |
2.09 | 45 | 10042 | 15287.3111 | 18640 |
1.64 | 43 | 4849 | 10738.0000 | 17604 |
2.11 | 43 | 7019 | 14563.4186 | 18575 |
2.08 | 41 | 6150 | 15128.6829 | 18760 |
1.74 | 40 | 4677 | 12270.3250 | 18430 |
1.41 | 39 | 4145 | 9452.2051 | 17216 |
1.39 | 36 | 3914 | 8335.1111 | 13061 |
0.86 | 34 | 1828 | 3289.9706 | 5029 |
1.65 | 32 | 5914 | 10973.8750 | 17729 |
0.87 | 31 | 1827 | 3342.3226 | 5640 |
0.98 | 31 | 1712 | 3469.7742 | 4950 |
2.20 | 31 | 7934 | 15226.8710 | 18364 |
1.66 | 30 | 5310 | 11084.5000 | 16294 |
2.18 | 30 | 10340 | 14901.7000 | 18717 |
1.76 | 28 | 9053 | 12398.5714 | 18178 |
2.22 | 27 | 5607 | 14691.5185 | 18706 |
0.69 | 26 | 1636 | 2206.6154 | 3784 |
1.38 | 26 | 4598 | 8164.3077 | 17598 |
0.68 | 25 | 1341 | 2334.3200 | 4325 |
1.42 | 25 | 6069 | 9487.2000 | 18682 |
1.67 | 25 | 7911 | 12281.4800 | 17351 |
2.12 | 25 | 7508 | 15007.6800 | 18656 |
2.16 | 25 | 8709 | 15860.4800 | 18678 |
1.69 | 24 | 5182 | 11878.0833 | 17803 |
0.88 | 23 | 1964 | 3431.3478 | 5595 |
0.99 | 23 | 1789 | 4406.1739 | 6094 |
2.21 | 23 | 6535 | 15287.9565 | 18483 |
2.15 | 22 | 5430 | 14187.1818 | 18791 |
2.19 | 22 | 10179 | 15521.0909 | 18693 |
0.89 | 21 | 1334 | 3291.2381 | 4854 |
1.47 | 21 | 4898 | 7833.7143 | 12547 |
2.13 | 21 | 12356 | 15220.4286 | 18442 |
1.80 | 20 | 7862 | 12955.1500 | 17273 |
2.28 | 20 | 9068 | 15422.9500 | 18055 |
2.30 | 20 | 7226 | 14861.5500 | 18304 |
1.43 | 19 | 6086 | 9234.9474 | 14429 |
1.68 | 19 | 5765 | 11779.1053 | 16985 |
1.44 | 18 | 4064 | 8021.6111 | 12093 |
1.46 | 18 | 5421 | 8237.4444 | 12261 |
1.83 | 18 | 5083 | 11930.0000 | 18525 |
2.17 | 18 | 6817 | 14416.6111 | 17805 |
1.77 | 17 | 7771 | 11103.1765 | 15278 |
2.29 | 17 | 11502 | 15834.7647 | 18823 |
2.51 | 17 | 14209 | 16616.8824 | 18308 |
2.24 | 16 | 10913 | 15573.9375 | 18299 |
2.25 | 16 | 6653 | 13403.5625 | 18034 |
2.32 | 16 | 6860 | 16042.1875 | 18532 |
2.50 | 16 | 7854 | 14316.1875 | 18447 |
1.45 | 15 | 4320 | 8709.3333 | 17545 |
1.79 | 15 | 9122 | 13144.0667 | 18429 |
2.26 | 15 | 11226 | 15330.2000 | 18426 |
1.82 | 13 | 5993 | 12986.1538 | 15802 |
2.23 | 13 | 7006 | 13947.4615 | 18149 |
2.31 | 13 | 7257 | 15017.3846 | 17715 |
2.40 | 13 | 11988 | 16362.4615 | 18541 |
0.20 | 12 | 345 | 365.1667 | 367 |
1.78 | 12 | 8889 | 12504.5833 | 18128 |
1.91 | 12 | 6632 | 12365.8333 | 17509 |
2.27 | 12 | 5733 | 14198.4167 | 18343 |
3.01 | 12 | 8040 | 15521.6667 | 18710 |
1.49 | 11 | 5407 | 9174.2727 | 18614 |
0.21 | 9 | 326 | 380.2222 | 394 |
1.81 | 9 | 5859 | 11374.4444 | 14177 |
1.86 | 9 | 9791 | 12164.3333 | 17267 |
2.33 | 9 | 8220 | 14089.7778 | 18119 |
2.48 | 9 | 12883 | 16537.1111 | 18692 |
2.52 | 9 | 10076 | 16242.1111 | 18252 |
2.36 | 8 | 11238 | 15756.7500 | 18745 |
2.38 | 8 | 10716 | 15481.3750 | 18559 |
2.53 | 8 | 14659 | 16548.8750 | 18254 |
2.54 | 8 | 12095 | 15125.2500 | 17996 |
3.00 | 8 | 6512 | 12378.5000 | 16970 |
1.48 | 7 | 6216 | 8791.2857 | 15164 |
1.87 | 7 | 9955 | 12933.5714 | 17761 |
1.90 | 7 | 8576 | 11477.1429 | 13919 |
2.35 | 7 | 14185 | 16090.4286 | 17999 |
2.39 | 7 | 15917 | 17182.4286 | 17955 |
2.42 | 7 | 16198 | 16923.5714 | 18615 |
1.93 | 6 | 13278 | 16828.1667 | 18306 |
2.37 | 6 | 14837 | 16525.1667 | 18508 |
2.43 | 6 | 9716 | 15356.0000 | 18692 |
0.22 | 5 | 337 | 391.4000 | 470 |
1.98 | 5 | 12923 | 14667.8000 | 16171 |
2.34 | 5 | 8491 | 11922.0000 | 15818 |
2.41 | 5 | 13563 | 15808.2000 | 17923 |
1.84 | 4 | 7922 | 11108.5000 | 13905 |
1.88 | 4 | 8048 | 10946.2500 | 13607 |
1.89 | 4 | 10055 | 13969.0000 | 17553 |
1.96 | 4 | 5554 | 8642.2500 | 13099 |
1.97 | 4 | 14410 | 15966.0000 | 17535 |
2.44 | 4 | 13027 | 16365.5000 | 18430 |
2.45 | 4 | 11830 | 14593.7500 | 18113 |
1.85 | 3 | 5688 | 11436.0000 | 14375 |
1.94 | 3 | 9344 | 14189.3333 | 18735 |
1.95 | 3 | 5045 | 12340.6667 | 17374 |
1.99 | 3 | 11486 | 14179.0000 | 17713 |
2.46 | 3 | 10470 | 14227.0000 | 16466 |
2.47 | 3 | 14970 | 15644.0000 | 16532 |
2.49 | 3 | 6289 | 13843.0000 | 18325 |
2.55 | 3 | 14351 | 15964.0000 | 18766 |
2.56 | 3 | 15231 | 16723.3333 | 17753 |
2.57 | 3 | 17116 | 17841.6667 | 18485 |
2.58 | 3 | 12500 | 14481.3333 | 16195 |
2.60 | 3 | 17027 | 17535.0000 | 18369 |
2.61 | 3 | 13784 | 16583.0000 | 18756 |
2.63 | 3 | 10437 | 12659.6667 | 16914 |
2.72 | 3 | 6870 | 12088.3333 | 17801 |
2.74 | 3 | 8807 | 14385.0000 | 17184 |
1.92 | 2 | 15364 | 15418.0000 | 15472 |
2.66 | 2 | 16239 | 17367.0000 | 18495 |
2.68 | 2 | 8419 | 9042.0000 | 9665 |
2.75 | 2 | 13156 | 14285.5000 | 15415 |
3.04 | 2 | 15354 | 16956.5000 | 18559 |
4.01 | 2 | 15223 | 15223.0000 | 15223 |
2.59 | 1 | 16465 | 16465.0000 | 16465 |
2.64 | 1 | 17407 | 17407.0000 | 17407 |
2.65 | 1 | 16314 | 16314.0000 | 16314 |
2.67 | 1 | 18686 | 18686.0000 | 18686 |
2.70 | 1 | 14341 | 14341.0000 | 14341 |
2.71 | 1 | 17146 | 17146.0000 | 17146 |
2.77 | 1 | 10424 | 10424.0000 | 10424 |
2.80 | 1 | 15030 | 15030.0000 | 15030 |
3.02 | 1 | 10577 | 10577.0000 | 10577 |
3.05 | 1 | 10453 | 10453.0000 | 10453 |
3.11 | 1 | 9823 | 9823.0000 | 9823 |
3.22 | 1 | 12545 | 12545.0000 | 12545 |
3.24 | 1 | 12300 | 12300.0000 | 12300 |
3.40 | 1 | 15964 | 15964.0000 | 15964 |
3.50 | 1 | 12587 | 12587.0000 | 12587 |
3.51 | 1 | 18701 | 18701.0000 | 18701 |
3.65 | 1 | 11668 | 11668.0000 | 11668 |
3.67 | 1 | 16193 | 16193.0000 | 16193 |
4.00 | 1 | 15984 | 15984.0000 | 15984 |
4.13 | 1 | 17329 | 17329.0000 | 17329 |
4.50 | 1 | 18531 | 18531.0000 | 18531 |
5.01 | 1 | 18018 | 18018.0000 | 18018 |
The most frequent carat weight equals to 0.3 in this data set. From the histogram, we can see the distribution of carat. To see all carats according to their count, following table can be analyzed. It can be said that, most of the carat values in this data set are less than 1.
Price is highly related with x, y, and, z which can be seen below scatter plots. Since depth is a formula obtained by using these variables, the plot belongs to depth is not presented in this part.
p1 <- ggplot(diamonds, aes(x, price,)) +
geom_point(alpha = 0.5, color="seagreen3") +
geom_smooth(method="loess", se=F,color="red", size=2, linetype=2 ) +
theme_minimal()+
labs(title = "Price vs Diamond Length",
subtitle = "Diamonds dataset",
x = "Diamond Length (x)",
y = "Price")
p2 <- ggplot(diamonds, aes(y, price,)) +
geom_point(alpha = 0.5, color="seagreen3") +
geom_smooth(method="loess", se=F,color="red", size=2, linetype=2 ) +
theme_minimal()+
labs(title = "Price vs Diamond Width",
subtitle = "Diamonds dataset",
x = "Diamond Width (y)",
y = "Price")
p3 <- ggplot(diamonds, aes(z, price,)) +
geom_point(alpha = 0.5, color="seagreen3") +
geom_smooth(method="loess", se=F,color="red", size=2, linetype=2 ) +
theme_minimal()+
labs(title = "Price vs Diamond Depth",
subtitle = "Diamonds dataset",
x = "Diamond Depth (z)",
y = "Price")
(p1 | p2 | p3)
To make model and predict the price variable; first, we need to divide data set into two pieces: train data and test data. By using train data, we can create a model for the data set, and then by using test data we can control the accuracy of the model obtained by using train data. Following codes provide the production of the train and test data set. Creating of the train and test data is random process, for this reason we use set.seed() function to get same two data set.
set.seed(503)
#test data
diamonds_test <- diamonds %>%
#to divide data set to, we add diamond id for each row
mutate(diamond_id = row_number()) %>%
# we group data set according to the cut,color and clarity
group_by(cut, color, clarity) %>%
# by using sample_frac() function, we get 20% of the data as test data
sample_frac(0.2) %>%
ungroup()
#train data
#by using anti_join we get different rows from data set.
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
#then to make clear analysis we need to extract diamond_id column by using following codes.
diamonds_train <- diamonds_train[, c(-ncol(diamonds_train))]
diamonds_test <- diamonds_test[, c(-ncol(diamonds_test))]
Train data set:
## Rows: 43,000
## Columns: 10
## $ carat <dbl> 0.21, 0.23, 0.31, 0.24, 0.26, 0.22, 0.30, 0.23, 0.31, 0.20,...
## $ cut <ord> Premium, Good, Good, Very Good, Very Good, Fair, Good, Idea...
## $ color <ord> E, E, J, J, H, E, J, J, J, E, I, J, I, E, J, J, G, I, D, F,...
## $ clarity <ord> SI1, VS1, SI2, VVS2, SI1, VS2, SI1, VS1, SI2, SI2, SI2, SI1...
## $ depth <dbl> 59.8, 56.9, 63.3, 62.8, 61.9, 65.1, 64.0, 62.8, 62.2, 60.2,...
## $ table <dbl> 61, 65, 58, 57, 55, 61, 55, 56, 54, 62, 54, 56, 56, 55, 62,...
## $ price <int> 326, 327, 335, 336, 337, 337, 339, 340, 344, 345, 348, 351,...
## $ x <dbl> 3.89, 4.05, 4.34, 3.94, 4.07, 3.87, 4.25, 3.93, 4.35, 3.79,...
## $ y <dbl> 3.84, 4.07, 4.35, 3.96, 4.11, 3.78, 4.28, 3.90, 4.37, 3.75,...
## $ z <dbl> 2.31, 2.31, 2.75, 2.48, 2.53, 2.49, 2.73, 2.46, 2.71, 2.27,...
Test data set:
## Rows: 10,761
## Columns: 10
## $ carat <dbl> 2.15, 2.00, 0.70, 1.00, 1.72, 1.45, 1.05, 1.02, 0.71, 2.02,...
## $ cut <ord> Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair,...
## $ color <ord> J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J, J,...
## $ clarity <ord> I1, I1, I1, I1, I1, SI2, SI2, SI2, SI2, SI2, SI1, SI1, SI1,...
## $ depth <dbl> 65.5, 66.5, 64.7, 65.0, 68.5, 58.6, 65.8, 65.0, 65.5, 64.6,...
## $ table <dbl> 57.0, 56.0, 59.0, 59.0, 59.0, 68.0, 59.0, 59.0, 61.0, 55.0,...
## $ price <int> 5430, 6796, 1066, 1784, 5240, 4570, 2789, 3119, 1409, 9853,...
## $ x <dbl> 8.01, 7.93, 5.59, 6.27, 7.31, 7.45, 6.41, 6.34, 5.54, 7.93,...
## $ y <dbl> 7.95, 7.80, 5.50, 6.17, 7.24, 7.33, 6.27, 6.24, 5.49, 7.84,...
## $ z <dbl> 5.23, 5.23, 3.59, 4.05, 4.98, 4.34, 4.18, 4.08, 3.61, 5.09,...
After the division of the data set into test and train, we extract the diamonds_id column to make clear analysis. After that point, we can transform factor variable to numeric variable by using as.numeric() function. That is, all the ordered columns, we will use them as numeric values.
#we transform our factor variable to numeric variable,
#then we assign new variable to their own variables by using following codes.
diamonds_train$cut <- as.numeric(diamonds_train$cut)
diamonds_train$clarity <- as.numeric(diamonds_train$clarity)
diamonds_train$color <- as.numeric(diamonds_train$color)
diamonds_test$cut <- as.numeric(diamonds_test$cut)
diamonds_test$clarity <- as.numeric(diamonds_test$clarity)
diamonds_test$color <- as.numeric(diamonds_test$color)
Principal Component Analysis (PCA) is suitable for numeric variables, therefore we choose this type of variables from our data set to continue. It can be seen from summary output that 4 component is enough in order to explain 88% of the data. We should add these components to both train and test data set for the following price estimation models. To make PCCA, we use princomp() function which belongs to base R.
set.seed(157)
diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
summary(diamonds_pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.0743446 1.1893677 1.1061659 0.9926647 0.83123137
## Proportion of Variance 0.4781006 0.1571773 0.1359559 0.1094870 0.07677173
## Cumulative Proportion 0.4781006 0.6352779 0.7712338 0.8807208 0.95749254
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.59250776 0.172964901 0.0341800209 2.040935e-02
## Proportion of Variance 0.03900727 0.003324095 0.0001298082 4.628241e-05
## Cumulative Proportion 0.99649981 0.999823909 0.9999537176 1.000000e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat 0.471 0.126 0.868
## cut -0.109 0.595 0.452 0.240 -0.176 0.584
## color -0.158 -0.203 0.803 0.537
## clarity -0.222 0.208 0.212 -0.506 0.778
## depth 0.209 -0.828 0.117 0.492 0.101
## table 0.129 -0.698 0.226 -0.182 0.642
## x 0.475 0.115 -0.278 -0.699 0.425
## y 0.474 0.120 -0.299 0.714 0.381
## z 0.474 0.103 0.132 -0.282 -0.815
#by using following function we can add this pca variables into data set
#since we select the four component, we add four variables into data set
#Train data set
PCAResults = predict(diamonds_pca, newdata = diamonds_train) # prediction by using data set
diamonds_train$PCA1 <- PCAResults[,1]
diamonds_train$PCA2 <- PCAResults[,2]
diamonds_train$PCA3 <- PCAResults[,3]
diamonds_train$PCA4 <- PCAResults[,4]
#Test data set
PCAResultsTest = predict(diamonds_pca, newdata = diamonds_test) # prediction by using data set
diamonds_test$PCA1 <- PCAResultsTest[,1]
diamonds_test$PCA2 <- PCAResultsTest[,2]
diamonds_test$PCA3 <- PCAResultsTest[,3]
diamonds_test$PCA4 <- PCAResultsTest[,4]
There are different kinds of clustering methods to create a feature, in this assignment K-means is used. First of all, scaling should be applied since if a column have much higher values than the others, it may dominate the results. In order to scale the data, we should find the minimum and maximum values of the train set, and then we will scale both train and test set with the same values.
MinValues = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], min)
MaxValues = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], max)
We get each variables minimum and maximum variables, now we need to implement scaling process into our data sets. In this process, we try to subtract vector form each row of the matrix data set. To make it easy, we can use sweep() function. By using the sweep function we can make calculation with vector and matrix.
diamonds_train_scale <- sweep(sweep(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, MinValues), 2, (MaxValues-MinValues), "/")
For deciding the number of cluster, we can use a for loop from 1 to 15 center, and then select the number of center value.
errors = c()
for (i in (1:15)){
set.seed(157)
cluster<-kmeans(diamonds_train_scale,centers=i)
errors = c(errors, 100*round(1 - (cluster$tot.withinss/cluster$totss), digits = 3))
}
errors_df <- data.frame(x=c(1:15), y=errors)
ggplot(errors_df, aes(x=x, y=y)) +
geom_point(color = "firebrick2") +
geom_line(color="firebrick3") +
geom_text(aes(label = errors), size=3, color = "black", position = position_stack(vjust = 0.95))+
theme_minimal() +
labs(title = "Center Number vs R-Square",
subtitle = "Diamonds dataset",
x = "Cluster Number",
y = "R-Square")
The decrease in errors are slowly changing after the cluster with 8 centers. So, we can say that we should select the model with center equals to 8.
set.seed(157)
best_cluster = kmeans(diamonds_train_scale,centers=8)
diamonds_train$cluster = as.factor(best_cluster$cluster)
diamonds_train %>%
group_by(cluster)%>%
summarise(cluster_count = n()) %>%
ggplot(., aes(x=cluster_count, y = reorder(cluster, cluster_count))) +
geom_col(fill = "paleturquoise4")+
theme_minimal() +
labs(title = "Number of Cluster in Data Set",
st="Diamond dataset",
x = "Number of cluster",
y = "Cluster")
Now, we need to apply clustering process to the test data set.
diamonds_test_scale <- sweep(sweep(diamonds_test[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, MinValues), 2, (MaxValues-MinValues), "/")
closest.cluster <- function(x) {
cluster.dist <- apply(best_cluster$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
diamonds_test$cluster <- as.factor(apply(diamonds_test_scale, 1, closest.cluster))
Now, we have enough data to create models for this data sets. In the next section, by using all outputs and process in the previous sections, the best predicting model is prepared.
Before we start to to prepare prediction model, we can control the correlation between variables.
There are four assumptions to be fulfilled in a linear model:
Since, our data set does not fulfill these assumptions we will construct a generalized linear model which is more flexible than standard linear model in the terms of assumptions. The response variable, price, is continuous therefore Gamma or Gaussian family may fit. The lowest AIC value is obtained by Gamma family with identity link function.
model_glm <- glm(price ~ carat+cut+color+clarity+depth+table+x+y+z+cluster, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm)
##
## Call:
## glm(formula = price ~ carat + cut + color + clarity + depth +
## table + x + y + z + cluster, family = Gamma(link = "identity"),
## data = diamonds_train, start = c(0.5, 0.5, 0.5, 0.5, 0.5,
## 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
## 0.5))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0700 -0.1955 -0.0545 0.1173 6.5060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20286.918 694.509 -29.210 < 2e-16 ***
## carat 6346.172 36.947 171.765 < 2e-16 ***
## cut -22.296 3.901 -5.716 1.10e-08 ***
## color 86.507 1.669 51.825 < 2e-16 ***
## clarity 75.038 1.657 45.276 < 2e-16 ***
## depth 317.350 11.169 28.413 < 2e-16 ***
## table -4.607 1.101 -4.184 2.87e-05 ***
## x 4202.185 90.019 46.681 < 2e-16 ***
## y 678.821 88.742 7.649 2.06e-14 ***
## z -8267.158 242.533 -34.087 < 2e-16 ***
## cluster2 30.883 9.742 3.170 0.001525 **
## cluster3 56.930 15.625 3.644 0.000269 ***
## cluster4 723.762 25.527 28.353 < 2e-16 ***
## cluster5 -167.412 8.508 -19.677 < 2e-16 ***
## cluster6 105.534 9.531 11.073 < 2e-16 ***
## cluster7 611.267 34.613 17.660 < 2e-16 ***
## cluster8 -72.963 9.573 -7.622 2.56e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.09973595)
##
## Null deviance: 42124.3 on 42999 degrees of freedom
## Residual deviance: 3154.8 on 42983 degrees of freedom
## AIC: 679803
##
## Number of Fisher Scoring iterations: 25
The model can be improved by some arrangement in the explanatory variables.
model_glm2 <- glm(price ~ carat*color+carat*clarity+I(carat^2)+cluster+y+depth, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm2)
##
## Call:
## glm(formula = price ~ carat * color + carat * clarity + I(carat^2) +
## cluster + y + depth, family = Gamma(link = "identity"), data = diamonds_train,
## start = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
## 0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.34343 -0.11391 -0.00724 0.09756 2.20066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1447.7988 69.2925 -20.894 < 2e-16 ***
## carat -7164.3182 41.3076 -173.438 < 2e-16 ***
## color -66.6927 1.8990 -35.119 < 2e-16 ***
## clarity -166.2208 2.0255 -82.064 < 2e-16 ***
## I(carat^2) 5000.0573 18.6769 267.714 < 2e-16 ***
## cluster2 31.4677 3.3334 9.440 < 2e-16 ***
## cluster3 263.7606 10.0305 26.296 < 2e-16 ***
## cluster4 490.0523 12.0700 40.601 < 2e-16 ***
## cluster5 -82.5500 4.6767 -17.651 < 2e-16 ***
## cluster6 61.7204 3.4631 17.822 < 2e-16 ***
## cluster7 462.1753 18.2967 25.260 < 2e-16 ***
## cluster8 -19.1216 3.5598 -5.372 7.85e-08 ***
## y 609.2037 5.0484 120.672 < 2e-16 ***
## depth 10.0415 0.8831 11.371 < 2e-16 ***
## carat:color 413.9402 4.8268 85.759 < 2e-16 ***
## carat:clarity 735.9830 5.0657 145.288 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.02846104)
##
## Null deviance: 42124.3 on 42999 degrees of freedom
## Residual deviance: 1206.6 on 42984 degrees of freedom
## AIC: 638149
##
## Number of Fisher Scoring iterations: 25
To make an example with components, following model can be constructed.
model_glm_pca = glm(price ~ PCA1 + PCA2 + PCA3 + PCA4 + cluster, data = diamonds_train, family = Gamma(link = "sqrt"))
summary(model_glm_pca)
##
## Call:
## glm(formula = price ~ PCA1 + PCA2 + PCA3 + PCA4 + cluster, family = Gamma(link = "sqrt"),
## data = diamonds_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.10222 -0.24752 -0.05761 0.15491 1.83183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.64345 0.12940 468.647 < 2e-16 ***
## PCA1 12.08463 0.03760 321.403 < 2e-16 ***
## PCA2 3.00609 0.03860 77.880 < 2e-16 ***
## PCA3 2.24874 0.04084 55.058 < 2e-16 ***
## PCA4 0.70079 0.05198 13.481 < 2e-16 ***
## cluster2 -0.93044 0.14448 -6.440 1.21e-10 ***
## cluster3 -8.00277 0.23054 -34.714 < 2e-16 ***
## cluster4 -1.02700 0.23782 -4.318 1.58e-05 ***
## cluster5 -4.35581 0.16242 -26.819 < 2e-16 ***
## cluster6 -11.68687 0.12298 -95.034 < 2e-16 ***
## cluster7 -8.23586 0.28306 -29.096 < 2e-16 ***
## cluster8 -5.13711 0.14563 -35.276 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.1102179)
##
## Null deviance: 42124 on 42999 degrees of freedom
## Residual deviance: 4306 on 42988 degrees of freedom
## AIC: 693361
##
## Number of Fisher Scoring iterations: 8
PCA helps us to use less feature in a model. But, as expected, with fewer feature we will get more AIC value in the model which means the explanatory power of the model decreases. Since we do not have so many features, we do not need to decrease the feature number.We may also use all the principal components instead of the variables to deal with multicollinearity however in this study this model is not investigated. For the best glm model, we can calculate the R^2 value. It can be calculated easily with:
model_glm2_prediction<-predict(model_glm2, newdata=diamonds_test)
model_glm2_r2 <- 1 - (sum((model_glm2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm2_r2
## [1] 0.8613199
To visually see the prediction vs actual price values, we can check the below plot.
pred_vs_real_glm <- as.data.frame(cbind(model_glm2_prediction, diamonds_test$price))
colnames(pred_vs_real_glm) = c("prediction", "actual")
pred_vs_real_glm %>%
ggplot(aes(prediction, actual)) +
geom_point(color="seagreen3") +
theme_minimal()+
geom_abline(intercept = 0, slope = 1, color="red3", size=2, linetype=2) +
labs(title = "Prediction vs Actual Values for the Generalized Linear Model",
subtitle = "Diamonds dataset",
x = "Predictions",
y = "Real Values")
Another supervised model is Classification and Regression Tree models, i.e., CART. It is a predicted model and it is a model that explains how an outcome variable’s values can be predicted based on other values. To create CART model, we use rpart() function which belongs to base R. To create a CARt model, again we use diamonds_train dataset. After the creating of the model, there are several visualization methods like prp(), fancyRpartPlot, and so on. In this example, fancyRpartPlot plot is used.
By using obtain tree, we can examine nodes and define the divisions. According to the results, some variables are used as division parameters. In this output, we shows that y and clarity are main parameters or affecting values in the data set. In other words, these variables are better features to reduce the variance in the data set with default argument. This tree is obtain by using default variables, by changing the default values like cp, we can obtain different models and we can select the best of them.In this analysis, the default values are used.
Now we create two different models, one of them obtain from the glm function and the other one is obtained from the CART, tree analysis. However, we also need to test our models by using test data set for each model. First, We predict glm model, then we control the CART model. Then by using error, we can compare the result and select the best model.
#GLM prediction model
#To predict price we can use predict function with the best glm model
model_glm2Prediction <-predict(model_glm2, newdata=diamonds_test)
#PRS: Prediction R Square
GLModel2PRS <- 1 - (sum((model_glm2Prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
GLModel2PRS
## [1] 0.8613199
According to the results, we obtain R Square value as 0.86. This means that our model explains our data set with 81%. We need to make same calculation for CART model.
ModelCARTPrediction <-predict(model_cart, newdata=diamonds_test)
#PRS: Prediction R Square
ModelCARTPRS <- 1 - (sum((ModelCARTPrediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
ModelCARTPRS
## [1] 0.8768922
Now, we can compare the generalized linear model and CART. the result shows that the CART model R Square value is higher than the GLM. We can conclude by saying CART model is better than Generalized Linear Model. This means that, the CART model explains the data set instead of GLM.