Diamonds are one of the most expensive gemstone in the world. There is a saying that diamonds are rare. According to this article they are more expensive when the stone is rare.
According to this site, the price of a diamond is related with it’s shape and 4Cs..
In this link, you can find some shapes of a diamond. At the end of descriptions of shapes, you can find the price table. For every diamonds, better cut is more expensive than the worse cut if all the others are equal.
In this assignment, objectives can be listed as following:
To fulfill these objectives, (i) data is analyzed and prepared for the analysis, (ii) the meaningful EDA is presented by using some useful packages such as ggplot2
, dplyr
, data.table
etc.
pti <- c("data.table", "tidyverse", "kableExtra", "knitr", "RColorBrewer", "caret", "e1071", "rpart", "rpart.plot", "rattle", "corrplot")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
install.packages(pti)
}
library(tidyverse)
library(data.table)
library(knitr)
library(kableExtra)
library(RColorBrewer)
library(caret)
library(e1071)
library(rpart)
library(rpart.plot)
library(rattle)
library(corrplot)
The dataset in this analysis is obtained from the ggplot2
packages. Before the load this data set. By using ?diamonds
we can obtain information about this dataset. This data set contains 53940 rows and 10 variables. As we mentioned before, there is no shape column in this dataset. By using this glimpse()
function, we obtain these variables.
#variables of the dataset
diamonds %>%
glimpse()
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
In the previous section, detailed information about investigated dataset is presented. In this section, the data is analyzed or pre-processed before the detailed analyses. First step is to check null values.
#to control NA values in the dataset
sum(any(is.na(diamonds)))
## [1] 0
When we check the dataset, there is 0 NA value. This means that there is no missing value in any row or column. Second step is to check for duplicated rows.
sum(as.numeric(duplicated(diamonds)))
## [1] 146
By using above function we can obtain number of exactly the same rows in this dataset. There are 146 duplicated lines. Before the analysis, we need to extract these data from the data frame. To solve this problem, we can use unique()
function.
#taking only unique rows form the data set
diamonds <- unique(diamonds)
#control of the duplicated lines after removing of the duplicated lines
sum(as.numeric(duplicated(diamonds)))
## [1] 0
After the usage of the unique function there is no duplicated line. After this process, we have 53794 rows and 10 columns. Third step could check the types of these variables like numeric, factor etc. To do so, we can use str()
function
str(diamonds)
## tibble [53,794 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53794] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53794] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53794] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53794] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53794] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53794] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53794] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
When we look at the results, we see that color
column is ordered wrongly. We need to correct the order.
diamonds$color = factor(diamonds$color, levels = c("J", "I", "H", "G", "F", "E", "D"))
The data is ready to make further corrections like accuracy of values. After these processes, we can use summary()
function and head()
function to get more information about the dataset.
summary(diamonds) # summary of each variable in the dataset
## carat cut color clarity depth
## Min. :0.2000 Fair : 1598 J: 2802 SI1 :13032 Min. :43.00
## 1st Qu.:0.4000 Good : 4891 I: 5407 VS2 :12229 1st Qu.:61.00
## Median :0.7000 Very Good:12069 H: 8272 SI2 : 9150 Median :61.80
## Mean :0.7978 Premium :13748 G:11262 VS1 : 8156 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21488 F: 9520 VVS2 : 5056 3rd Qu.:62.50
## Max. :5.0100 E: 9776 VVS1 : 3647 Max. :79.00
## D: 6755 (Other): 2524
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 951 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5327 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.030
## Max. :31.800
##
head(diamonds) #first 6 rows of the data
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
We also control the negative price values in the data set.
diamonds %>%
filter(price <= 0) %>%
summarise(NegativePrice = n())
## # A tibble: 1 x 1
## NegativePrice
## <int>
## 1 0
Logically, price can not be less than or equal to zero. For this reason, we check dataset to control negative of zero price values. According to the results price values are positive, there is nothing to do for price
column
Now we can search for not positive x
, y
and z
values.
diamonds %>%
filter(x <= 0 & y > 0 & z > 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## # depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
diamonds %>%
filter(y <= 0 & x > 0 & z > 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## # depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
diamonds %>%
filter(z <= 0 & x > 0 & y > 0)
## # A tibble: 12 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Premium G SI2 59.1 59 3142 6.55 6.48 0
## 2 1.01 Premium H I1 58.1 59 3167 6.66 6.6 0
## 3 1.1 Premium G SI2 63 59 3696 6.5 6.47 0
## 4 1.01 Premium F SI2 59.2 58 3837 6.5 6.47 0
## 5 1.5 Good G I1 64 61 4731 7.15 7.04 0
## 6 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0
## 7 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0
## 8 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0
## 9 2.2 Premium H SI1 61.2 59 17265 8.42 8.37 0
## 10 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0
## 11 2.8 Good G SI2 63.8 58 18788 8.9 8.85 0
## 12 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0
All x
and y
values are positive, for now we don’t need to do anything. But, for z
values, there are 12 0 values in z
column. We can fill these values with the help of x
, y
and depth
column. Because z = depth *100 * mean(x, y)
.
diamonds = diamonds %>%
mutate(z = ifelse(z == 0 & x != 0 & y != 0,round(depth * mean(c(x, y)) / 100, 2), z))
From now on, we searched for x, y or z values that are not positive when the others are positive. If we find any row, we could calculate it as we did above. Now, we need to check for rows that two of these three columns are not positive.
diamonds %>%
filter(x <= 0 & y <= 0)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 3 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 4 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 5 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 6 0.71 Good F SI2 64.1 60 2130 0 0 0
diamonds %>%
filter(x <= 0 & z <= 0)
## # A tibble: 7 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
## 2 1 Very Good H VS2 63.3 53 5139 0 0 0
## 3 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 4 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 5 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 6 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 7 0.71 Good F SI2 64.1 60 2130 0 0 0
diamonds %>%
filter(y <= 0 & z <= 0)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 3 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 4 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 5 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 6 0.71 Good F SI2 64.1 60 2130 0 0 0
When we look at the result, we have 7 rows that both x and z values are 0. If we remove these rows from the data, we would have more accurate data. So, we need to apply this code chunk.
diamonds = diamonds %>%
filter(!(x == 0 & z == 0))
diamonds %>%
filter(x == 0 | y == 0 | z == 0)
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
## # depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
After this process, we have all positive values in x, y and z columns. Now, we need to check the range of x, y and z columns for outliers.
range(diamonds$x)
## [1] 3.73 10.74
diamonds$x %>% unique() %>% sort(decreasing = TRUE) %>% head(20)
## [1] 10.74 10.23 10.14 10.02 10.01 10.00 9.86 9.66 9.65 9.54 9.53 9.51
## [13] 9.49 9.44 9.42 9.41 9.38 9.36 9.35 9.32
There is nothing wrong with length of these diamonds.
range(diamonds$y)
## [1] 3.68 58.90
diamonds$y %>% unique() %>% sort(decreasing = TRUE) %>% head(20)
## [1] 58.90 31.80 10.54 10.16 10.10 9.94 9.85 9.81 9.63 9.59 9.48 9.46
## [13] 9.42 9.40 9.38 9.37 9.34 9.32 9.31 9.26
There are diamonds$y %>% unique() %>% sort(decreasing = TRUE) %>% head(20) %>% nrow()
values that are very big respect to other values. So, we need to remove these rows from the data or recalculate the y values. To avoid to reduce the data, we will try to recalculate them with the help of depth
, x
and z
values.
diamonds %>%
filter(y > 15) %>%
mutate(new_y = (2 * z / depth) / 100 - x) %>%
select(depth, x, z, y, new_y)
## # A tibble: 2 x 5
## depth x z y new_y
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 58.9 8.09 8.06 58.9 -8.09
## 2 61.8 5.15 5.12 31.8 -5.15
These results are not suitable values for y
column, so we need to remove them from the data.
diamonds = diamonds %>%
filter(y < 15)
range(diamonds$z)
## [1] 1.07 31.80
diamonds$z %>% unique() %>% sort(decreasing = TRUE) %>% head(20)
## [1] 31.80 6.98 6.72 6.43 6.38 6.31 6.27 6.24 6.17 6.16 6.13 6.03
## [13] 5.98 5.97 5.92 5.91 5.90 5.86 5.85 5.79
For z
column, there is one row that has a high value respect to other rows. So, as we did before, we need to calculate the z value like we did before.
diamonds %>%
filter(z == 31.8) %>%
mutate(new_z = depth * mean(c(5.12, 5.15)) / 100) %>%
select(z, new_z)
## # A tibble: 1 x 2
## z new_z
## <dbl> <dbl>
## 1 31.8 3.17
So, we can say that this value should be divided by 10. We can compare the other z values that have similar x and y values with this z value.
diamonds[abs(diamonds$x - 5) < 1 & abs(diamonds$y - 5) < 1 & abs(diamonds$z - 3.18) < 1, "z"]
## # A tibble: 30,800 x 1
## z
## <dbl>
## 1 2.31
## 2 2.63
## 3 2.75
## 4 2.53
## 5 2.73
## 6 2.71
## 7 2.68
## 8 2.68
## 9 2.7
## 10 2.71
## # ... with 30,790 more rows
There are many z values near to 3.18. So, we can say that this value should be divided by 10. Also, we need to update the new_depth
column
diamonds$z[diamonds$z == 31.8] = diamonds$z[diamonds$z == 31.8] / 10
Now, we need to check the depth
column. To do so, we can calculate a new column called new_depth
and compare it with the depth
column.
diamonds$new_depth = 2 * diamonds$z / (diamonds$x + diamonds$y) * 100
The depth
and new_depth
columns should be equal. To be easily see that, we can use scatter plot and add a line.
# diamonds[, calculate := 2 * z / (x + y)]
diamonds %>%
ggplot(., aes(x = new_depth, y = depth)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color="red", size=1.5)
From the plot, we can say that most of the depth values are almost equal. But, there are some rows that their depth
and new_depth
values are very far from each others. We can say that their difference should be less than 11 (This is an assumption respect to range of depths in this link. In this page, it says that " (For a princess cut diamond) A very good cut can be between 56 to 67 percent or 75 to 76 percent.". So, maximum depth range can be 11 among all other shapes.)
diamonds %>%
filter(!(abs(new_depth - depth) < 11)) %>%
select(new_depth, depth, x, y, z)
## # A tibble: 20 x 5
## new_depth depth x y z
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 51.2 62.8 6.26 6.19 3.19
## 2 63.1 43 6.32 6.27 3.97
## 3 65.7 44 6.31 6.24 4.12
## 4 51.7 64 7.15 7.04 3.67
## 5 63.0 43 6.53 6.55 4.12
## 6 16.1 60.6 6.62 6.67 1.07
## 7 20.5 61.9 7.43 7.5 1.53
## 8 19.4 60.7 7.31 7.22 1.41
## 9 40.3 59.4 8.49 8.45 3.41
## 10 78.3 59 6.16 6.15 4.82
## 11 41.4 61.3 8.52 8.42 3.51
## 12 78.3 65.6 7.89 7.84 6.16
## 13 79.1 60 6.29 6.25 4.96
## 14 41.8 61.2 8.42 8.37 3.51
## 15 45.0 62.7 8.02 7.95 3.59
## 16 41.2 63.8 8.9 8.85 3.66
## 17 84.3 61.2 4.51 6.02 4.44
## 18 43.9 60.9 4.71 4.68 2.06
## 19 42.3 61 5.31 5.34 2.25
## 20 101. 63.7 5.01 5.04 5.06
There are 20 observations. When we compare it with the number of all observations in the dataset, it is very small value. So, we can remove them from the dataset.
diamonds = diamonds %>%
filter((abs(new_depth - depth) < 11))
diamonds = subset(diamonds, select = -c(new_depth))
Now, we can compare the x, y and z values with each other. These values should be highly correlated with each other.
diamonds %>%
ggplot(., aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm")
diamonds %>%
ggplot(., aes(x = z, y = y)) +
geom_point() +
geom_smooth(method = "lm")
As expected, these values are highly correlated. We can assume that these x, y and z values are valid values.
After the pre-processing of the data, we make EDA for the diamonds
dataset by using different variables.
As we mentioned before, cut is one of the most important feature of diamonds. In this data set, quality of cut is given as categorical variables. Fair is the lowest quality, whereas, ideal is the highest quality. For this reason, to examine diamond dataset by using this feature, we can see the percentages of cut types in the dataset.
diamonds %>%
group_by(cut) %>%
summarise(count = n()) %>%
mutate(percentage = 100*count/sum(count)) %>%
ggplot(., aes(x = '', y = count, fill = cut)) +
geom_bar(width = 1, stat = "identity", alpha = 0.8) +
coord_polar("y") +
theme_void() +
theme(plot.title = element_text(vjust = 0.5)) +
geom_text(aes(label = paste(format(percentage,digits=2), "%")), size=4, color = "black", position = position_stack(vjust = 0.3)) +
labs(title = "Percentage of Quality of Cut ",
subtitle = st,
fill = "Quality of the Cut")
The pie chart illustrates that, most of the diamonds are cut as ideal form. It was expected, because if the cut type of a diamonds is ideal, it will be more expensive than the other cut types (when every other properties are equal). Moreover, percentage of premium and very good are almost equal. On the other hand, there is a little fair cutting type. We can see the count, minimum, maximum and average prices from the table below.
diamonds %>%
group_by(cut) %>%
summarise(cut_count = n(),
MinPrice = min(price),
AveragePrice = mean(price),
MaxPrice = max(price)) %>%
kable(col.names = c("Cut", "Number of Cut Quality", "Minimum Price", "Average Price", "Maximum Price")) %>%
kable_minimal(full_width = F)
Cut | Number of Cut Quality | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|
Fair | 1594 | 337 | 4334.275 | 18574 |
Good | 4888 | 327 | 3916.279 | 18707 |
Very Good | 12065 | 336 | 3981.305 | 18818 |
Premium | 13738 | 326 | 4576.581 | 18823 |
Ideal | 21480 | 326 | 3461.698 | 18806 |
While, the pie chart presents the percentage of the dataset according to the cut type, the Above table shows the number of trial with cut type. Furthermore, the minimum, average and maximum prices are addressed in this table. Although the most popular cutting type is Ideal cut, its price is not the highest one. According to the average prices, the most expensive diamonds are belongs to Premium if we just look for the cut variable. So, it means that we can not explain the response (price) variable only with the cut type of a diamond.
Colors are the another important feature of the diamonds and also it affects the price of the diamonds. First, according to trials, we can group according to the color. After this grouping process we can number of diamonds with this group.
diamonds %>%
group_by(color)%>%
summarise(count = n()) %>%
ggplot(., aes(x=color, y = count, fill = count)) +
geom_col() +
scale_fill_gradient("count", low="yellow", high="red") +
geom_line(aes(y = count), size = 1.2, color="black", group = 1) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Classification Of Diamonds According to the Colors",
subtitle = st,
x = "Diamond Color",
y = "Number of Color")
The bar chart illustrates that there is at most G color diamond, whereas, there is at least J color. Their distribution looks like normal. We can obtain exact value of each color and price values in data set by using below table.
diamonds %>%
group_by(color)%>%
summarise(color_count = n(),
MinPrice = min(price),
AveragePrice = mean(price),
MaxPrice = max(price)) %>%
kable(col.names = c("Color", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
kable_minimal(full_width = F)
Color | Count | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|
J | 2800 | 335 | 5327.337 | 18710 |
I | 5405 | 334 | 5079.123 | 18823 |
H | 8263 | 337 | 4474.449 | 18803 |
G | 11253 | 354 | 3997.174 | 18818 |
F | 9517 | 342 | 3726.559 | 18791 |
E | 9773 | 326 | 3080.044 | 18731 |
D | 6754 | 357 | 3172.589 | 18693 |
According to the diamond information given above D-F color interval is the highest purity. Then, G-F color scale follows. Results illustrates that the number of G color is the highest. The most beautiful color, i.e., actually colorless, is D-F. Moreover, by using this table output, we can obtain minimum,maximum and average prices and compare the price of diamonds with each other by using color classification. According to the average prices, the most expensive diamonds are belongs to J if we just look for the color variable. So, it means that we can not explain the response (price) variable only with the color type of a diamond.
Clarity gives information about diamonds whether it contains stain or not. In this report, to find clarity classification of the data set, a kind of scatter plot is used.
diamonds %>%
group_by(clarity) %>%
summarise(clarity_count = n()) %>%
ggplot(.,aes(x=clarity, y = clarity_count, color= clarity_count)) +
geom_point(size=9) +
geom_segment(aes(x=clarity,
xend=clarity,
y=0,
yend=clarity_count))+
theme_minimal() +
theme(axis.text.x = element_text(angle = 90), legend.position = "none")+
labs(title = "Classification Of Diamonds According to the Clarity",
subtitle = st,
x = "Clarity",
y = "Number of Diamonds by Clarity")
According to the results,
To get more clear analysis, the following table, which shows the number of diamonds according to clarity, can be analyzed. Moreover, the minimum, maximum, and average prices can be also obtained.
diamonds %>%
mutate(clarity = factor(clarity)) %>%
group_by(clarity) %>%
summarise(clarity_count = n(),
MinPrice = min(price),
AveragePrice = mean(price),
MaxPrice = max(price)) %>%
arrange(desc(clarity_count)) %>%
kable(col.names = c("Clarity", "Count","Minimum Price", "Average Price", "Maximum Price")) %>%
kable_minimal(full_width = F)
Clarity | Count | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|
SI1 | 13026 | 326 | 3992.932 | 18818 |
VS2 | 12223 | 334 | 3925.842 | 18823 |
SI2 | 9140 | 326 | 5051.592 | 18804 |
VS1 | 8153 | 327 | 3840.792 | 18795 |
VVS2 | 5055 | 336 | 3286.971 | 18768 |
VVS1 | 3645 | 336 | 2523.088 | 18777 |
IF | 1784 | 369 | 2870.570 | 18806 |
I1 | 739 | 345 | 3924.185 | 18531 |
According to the average prices, the most expensive diamonds are belongs to SI2 if we just look for the clarity variable. So, it means that we can not explain the response (price) variable only with the clarity type of a diamond.
Carat has two meanings: (i) the purity of the diamonds and (ii) the weight of the diamonds. In this data set, the carat illustrates the weight of the carat. To see the most used carat, the number of data is group according to the carat variable.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(diamonds$carat)
## [1] 0.3
According to the data set, the most preferable carat is 0.3 carat.
diamonds %>%
mutate(carat = factor(carat)) %>%
group_by(carat) %>%
summarise(carat_count = n())%>%
arrange(desc(carat_count)) %>%
head(10) %>%
ggplot(., aes(y=carat_count, x = reorder(carat, -carat_count), fill = carat)) +
geom_col() +
geom_text(aes(label = carat_count), size=3, color = "black", position = position_stack(vjust = 0.95)) +
theme_minimal() +
scale_fill_brewer(palette = c("Paired")) +
theme(legend.position = "none") +
labs(title = "The Most 10 Popular Carat",
subtitle = st,
x = "Carat",
y = "Number of Count")
From the histogram, you can find the distribution of the carat. To see all carat according to the count, following table can be analyzed. In addition to the number of diamonds according to the carat classification, the price intervals and average price also can be investigated.
diamonds %>%
group_by(carat) %>%
summarise(carat_count = n(),
inPrice = min(price),
AveragePrice = mean(price),
MaxPrice = max(price))%>%
kable(col.names = c("Carat", "Number of Carats","Minimum Price", "Average Price", "Maximum Price")) %>%
kable_styling("striped", full_width = T) %>%
scroll_box(width = "100%", height = "300px")
Carat | Number of Carats | Minimum Price | Average Price | Maximum Price |
---|---|---|---|---|
0.20 | 12 | 345 | 365.1667 | 367 |
0.21 | 9 | 326 | 380.2222 | 394 |
0.22 | 5 | 337 | 391.4000 | 470 |
0.23 | 293 | 326 | 486.1433 | 688 |
0.24 | 254 | 336 | 505.1850 | 963 |
0.25 | 212 | 357 | 550.9245 | 1186 |
0.26 | 253 | 337 | 550.8972 | 814 |
0.27 | 233 | 361 | 574.7597 | 893 |
0.28 | 198 | 360 | 580.1212 | 828 |
0.29 | 130 | 334 | 601.1923 | 1776 |
0.30 | 2596 | 339 | 680.6367 | 2366 |
0.31 | 2238 | 335 | 708.1997 | 1917 |
0.32 | 1827 | 345 | 719.1916 | 1715 |
0.33 | 1179 | 366 | 765.1747 | 1389 |
0.34 | 909 | 413 | 755.8086 | 2346 |
0.35 | 663 | 409 | 799.5098 | 1672 |
0.36 | 570 | 410 | 781.4667 | 1718 |
0.37 | 394 | 451 | 816.0711 | 1764 |
0.38 | 669 | 497 | 904.1973 | 1992 |
0.39 | 397 | 451 | 923.4761 | 1624 |
0.40 | 1291 | 484 | 933.0077 | 1908 |
0.41 | 1378 | 467 | 973.0530 | 1745 |
0.42 | 704 | 552 | 994.4545 | 1722 |
0.43 | 486 | 452 | 995.2737 | 1827 |
0.44 | 212 | 648 | 1056.6226 | 2506 |
0.45 | 110 | 672 | 1093.3182 | 1833 |
0.46 | 177 | 746 | 1240.2429 | 2139 |
0.47 | 99 | 719 | 1272.3131 | 2460 |
0.48 | 63 | 841 | 1322.0476 | 2677 |
0.49 | 45 | 900 | 1306.3333 | 2199 |
0.50 | 1254 | 584 | 1503.8453 | 3378 |
0.51 | 1123 | 826 | 1623.5841 | 4368 |
0.52 | 813 | 870 | 1656.4785 | 3373 |
0.53 | 707 | 844 | 1685.5248 | 3827 |
0.54 | 624 | 935 | 1704.3317 | 4042 |
0.55 | 495 | 959 | 1725.2283 | 3509 |
0.56 | 492 | 953 | 1831.9980 | 4632 |
0.57 | 429 | 929 | 1798.0443 | 4032 |
0.58 | 310 | 893 | 1798.6484 | 3741 |
0.59 | 282 | 903 | 1831.4504 | 4916 |
0.60 | 226 | 806 | 1956.2434 | 4209 |
0.61 | 204 | 931 | 2069.2451 | 4279 |
0.62 | 135 | 933 | 1973.3704 | 3968 |
0.63 | 102 | 1065 | 2244.7843 | 6607 |
0.64 | 80 | 1080 | 2113.5875 | 4281 |
0.65 | 65 | 1214 | 2322.6154 | 5052 |
0.66 | 48 | 1188 | 2235.7292 | 4311 |
0.67 | 48 | 1244 | 2067.8542 | 3271 |
0.68 | 25 | 1341 | 2334.3200 | 4325 |
0.69 | 26 | 1636 | 2206.6154 | 3784 |
0.70 | 1981 | 945 | 2516.0717 | 5539 |
0.71 | 1291 | 1305 | 2674.3106 | 5019 |
0.72 | 764 | 945 | 2705.7788 | 5086 |
0.73 | 492 | 1049 | 2791.2480 | 5154 |
0.74 | 322 | 1632 | 2821.9006 | 5841 |
0.75 | 249 | 1013 | 2763.4016 | 5288 |
0.76 | 251 | 1140 | 2889.5936 | 5056 |
0.77 | 251 | 1651 | 2884.3984 | 5251 |
0.78 | 187 | 1720 | 2918.9091 | 4904 |
0.79 | 151 | 1809 | 2993.2649 | 4858 |
0.80 | 284 | 1232 | 3146.1901 | 6834 |
0.81 | 200 | 1687 | 3094.4250 | 6338 |
0.82 | 140 | 2033 | 3083.6500 | 4931 |
0.83 | 131 | 1863 | 3181.9924 | 7889 |
0.84 | 64 | 1749 | 3294.9844 | 6399 |
0.85 | 62 | 1250 | 3204.1774 | 7230 |
0.86 | 34 | 1828 | 3289.9706 | 5029 |
0.87 | 31 | 1827 | 3342.3226 | 5640 |
0.88 | 23 | 1964 | 3431.3478 | 5595 |
0.89 | 21 | 1334 | 3291.2381 | 4854 |
0.90 | 1484 | 1599 | 3939.5889 | 9182 |
0.91 | 570 | 1570 | 3981.1561 | 9636 |
0.92 | 226 | 2283 | 4035.2124 | 7544 |
0.93 | 142 | 2374 | 3994.2958 | 5992 |
0.94 | 59 | 2571 | 4195.7797 | 7120 |
0.95 | 65 | 2195 | 4013.6615 | 6951 |
0.96 | 103 | 1637 | 3623.4660 | 6807 |
0.97 | 59 | 1547 | 4026.2373 | 7415 |
0.98 | 31 | 1712 | 3469.7742 | 4950 |
0.99 | 23 | 1789 | 4406.1739 | 6094 |
1.00 | 1544 | 1681 | 5250.7720 | 16469 |
1.01 | 2240 | 2017 | 5507.1902 | 16234 |
1.02 | 883 | 2344 | 5598.4632 | 17100 |
1.03 | 522 | 1262 | 5583.8966 | 17590 |
1.04 | 474 | 2037 | 5774.6709 | 18542 |
1.05 | 361 | 2066 | 5781.6814 | 13060 |
1.06 | 372 | 2080 | 5947.0081 | 15813 |
1.07 | 340 | 2260 | 6004.2118 | 18279 |
1.08 | 246 | 2869 | 5774.3415 | 14032 |
1.09 | 285 | 2708 | 6020.0947 | 18231 |
1.10 | 276 | 3216 | 5913.0725 | 12748 |
1.11 | 308 | 3183 | 6005.5682 | 15464 |
1.12 | 251 | 2383 | 6044.4900 | 14407 |
1.13 | 246 | 2968 | 6193.4024 | 14525 |
1.14 | 206 | 2327 | 6354.4078 | 18112 |
1.15 | 149 | 3220 | 5959.9060 | 12895 |
1.16 | 172 | 2368 | 6274.2965 | 13595 |
1.17 | 110 | 2336 | 6224.8818 | 11886 |
1.18 | 123 | 3219 | 6481.1057 | 14847 |
1.19 | 126 | 2892 | 6483.2381 | 15767 |
1.20 | 643 | 2360 | 6670.7387 | 16256 |
1.21 | 470 | 2396 | 6905.4043 | 17353 |
1.22 | 299 | 2699 | 6971.5151 | 14414 |
1.23 | 277 | 3168 | 7290.8881 | 16253 |
1.24 | 235 | 2940 | 7147.1149 | 15026 |
1.25 | 187 | 3276 | 7123.2513 | 16823 |
1.26 | 146 | 3881 | 7490.7466 | 15247 |
1.27 | 134 | 2845 | 7119.1194 | 15377 |
1.28 | 106 | 3355 | 7711.8113 | 18700 |
1.29 | 101 | 2596 | 7033.0198 | 17932 |
1.30 | 121 | 2512 | 7851.6612 | 15874 |
1.31 | 132 | 3697 | 7750.6364 | 17496 |
1.32 | 89 | 4140 | 7639.3708 | 15802 |
1.33 | 87 | 4634 | 7527.0805 | 18435 |
1.34 | 68 | 4352 | 7758.5000 | 17663 |
1.35 | 77 | 4082 | 7448.2857 | 14616 |
1.36 | 50 | 4158 | 8591.5200 | 15386 |
1.37 | 46 | 5063 | 8044.0870 | 13439 |
1.38 | 26 | 4598 | 8164.3077 | 17598 |
1.39 | 36 | 3914 | 8335.1111 | 13061 |
1.40 | 50 | 4939 | 9108.5200 | 16808 |
1.41 | 39 | 4145 | 9452.2051 | 17216 |
1.42 | 25 | 6069 | 9487.2000 | 18682 |
1.43 | 19 | 6086 | 9234.9474 | 14429 |
1.44 | 18 | 4064 | 8021.6111 | 12093 |
1.45 | 15 | 4320 | 8709.3333 | 17545 |
1.46 | 18 | 5421 | 8237.4444 | 12261 |
1.47 | 21 | 4898 | 7833.7143 | 12547 |
1.48 | 7 | 6216 | 8791.2857 | 15164 |
1.49 | 11 | 5407 | 9174.2727 | 18614 |
1.50 | 787 | 2964 | 10082.9835 | 18691 |
1.51 | 804 | 3497 | 10552.0211 | 18806 |
1.52 | 380 | 3105 | 10629.9868 | 17723 |
1.53 | 219 | 5631 | 10703.4658 | 17223 |
1.54 | 174 | 4492 | 10865.0287 | 18416 |
1.55 | 123 | 4965 | 10717.7642 | 18188 |
1.56 | 108 | 5766 | 10205.9444 | 17455 |
1.57 | 106 | 5170 | 10422.7075 | 17548 |
1.58 | 89 | 4459 | 11427.9888 | 18057 |
1.59 | 89 | 6010 | 10574.3596 | 17552 |
1.60 | 95 | 6272 | 11124.2526 | 18780 |
1.61 | 64 | 6915 | 10992.8438 | 18318 |
1.62 | 61 | 5157 | 11705.8852 | 18281 |
1.63 | 50 | 7100 | 11556.3200 | 18179 |
1.64 | 43 | 4849 | 10738.0000 | 17604 |
1.65 | 32 | 5914 | 10973.8750 | 17729 |
1.66 | 30 | 5310 | 11084.5000 | 16294 |
1.67 | 25 | 7911 | 12281.4800 | 17351 |
1.68 | 19 | 5765 | 11779.1053 | 16985 |
1.69 | 24 | 5182 | 11878.0833 | 17803 |
1.70 | 215 | 5617 | 12110.3209 | 18718 |
1.71 | 119 | 6320 | 12396.5546 | 18791 |
1.72 | 57 | 5240 | 12040.3158 | 18730 |
1.73 | 52 | 6007 | 11875.9615 | 18377 |
1.74 | 40 | 4677 | 12270.3250 | 18430 |
1.75 | 50 | 7276 | 12673.7000 | 17343 |
1.76 | 28 | 9053 | 12398.5714 | 18178 |
1.77 | 17 | 7771 | 11103.1765 | 15278 |
1.78 | 12 | 8889 | 12504.5833 | 18128 |
1.79 | 15 | 9122 | 13144.0667 | 18429 |
1.80 | 20 | 7862 | 12955.1500 | 17273 |
1.81 | 9 | 5859 | 11374.4444 | 14177 |
1.82 | 13 | 5993 | 12986.1538 | 15802 |
1.83 | 18 | 5083 | 11930.0000 | 18525 |
1.84 | 4 | 7922 | 11108.5000 | 13905 |
1.85 | 3 | 5688 | 11436.0000 | 14375 |
1.86 | 9 | 9791 | 12164.3333 | 17267 |
1.87 | 7 | 9955 | 12933.5714 | 17761 |
1.88 | 4 | 8048 | 10946.2500 | 13607 |
1.89 | 4 | 10055 | 13969.0000 | 17553 |
1.90 | 7 | 8576 | 11477.1429 | 13919 |
1.91 | 12 | 6632 | 12365.8333 | 17509 |
1.92 | 2 | 15364 | 15418.0000 | 15472 |
1.93 | 6 | 13278 | 16828.1667 | 18306 |
1.94 | 3 | 9344 | 14189.3333 | 18735 |
1.95 | 3 | 5045 | 12340.6667 | 17374 |
1.96 | 4 | 5554 | 8642.2500 | 13099 |
1.97 | 4 | 14410 | 15966.0000 | 17535 |
1.98 | 5 | 12923 | 14667.8000 | 16171 |
1.99 | 3 | 11486 | 14179.0000 | 17713 |
2.00 | 261 | 5051 | 14107.1456 | 18818 |
2.01 | 433 | 5696 | 14772.7136 | 18741 |
2.02 | 176 | 6346 | 14767.4091 | 18731 |
2.03 | 122 | 6002 | 14710.1148 | 18781 |
2.04 | 86 | 7403 | 14840.1744 | 18795 |
2.05 | 65 | 7026 | 15647.2462 | 18787 |
2.06 | 59 | 5203 | 14348.6780 | 18779 |
2.07 | 50 | 6503 | 15384.2800 | 18804 |
2.08 | 41 | 6150 | 15128.6829 | 18760 |
2.09 | 45 | 10042 | 15287.3111 | 18640 |
2.10 | 51 | 6597 | 14464.3725 | 18648 |
2.11 | 43 | 7019 | 14563.4186 | 18575 |
2.12 | 25 | 7508 | 15007.6800 | 18656 |
2.13 | 21 | 12356 | 15220.4286 | 18442 |
2.14 | 48 | 5405 | 14905.0417 | 18528 |
2.15 | 22 | 5430 | 14187.1818 | 18791 |
2.16 | 25 | 8709 | 15860.4800 | 18678 |
2.17 | 18 | 6817 | 14416.6111 | 17805 |
2.18 | 30 | 10340 | 14901.7000 | 18717 |
2.19 | 22 | 10179 | 15521.0909 | 18693 |
2.20 | 31 | 7934 | 15226.8710 | 18364 |
2.21 | 23 | 6535 | 15287.9565 | 18483 |
2.22 | 27 | 5607 | 14691.5185 | 18706 |
2.23 | 13 | 7006 | 13947.4615 | 18149 |
2.24 | 16 | 10913 | 15573.9375 | 18299 |
2.25 | 16 | 6653 | 13403.5625 | 18034 |
2.26 | 15 | 11226 | 15330.2000 | 18426 |
2.27 | 12 | 5733 | 14198.4167 | 18343 |
2.28 | 20 | 9068 | 15422.9500 | 18055 |
2.29 | 17 | 11502 | 15834.7647 | 18823 |
2.30 | 20 | 7226 | 14861.5500 | 18304 |
2.31 | 13 | 7257 | 15017.3846 | 17715 |
2.32 | 16 | 6860 | 16042.1875 | 18532 |
2.33 | 9 | 8220 | 14089.7778 | 18119 |
2.34 | 5 | 8491 | 11922.0000 | 15818 |
2.35 | 7 | 14185 | 16090.4286 | 17999 |
2.36 | 8 | 11238 | 15756.7500 | 18745 |
2.37 | 6 | 14837 | 16525.1667 | 18508 |
2.38 | 8 | 10716 | 15481.3750 | 18559 |
2.39 | 7 | 15917 | 17182.4286 | 17955 |
2.40 | 13 | 11988 | 16362.4615 | 18541 |
2.41 | 5 | 13563 | 15808.2000 | 17923 |
2.42 | 7 | 16198 | 16923.5714 | 18615 |
2.43 | 6 | 9716 | 15356.0000 | 18692 |
2.44 | 4 | 13027 | 16365.5000 | 18430 |
2.45 | 4 | 11830 | 14593.7500 | 18113 |
2.46 | 3 | 10470 | 14227.0000 | 16466 |
2.47 | 3 | 14970 | 15644.0000 | 16532 |
2.48 | 9 | 12883 | 16537.1111 | 18692 |
2.49 | 3 | 6289 | 13843.0000 | 18325 |
2.50 | 16 | 7854 | 14316.1875 | 18447 |
2.51 | 17 | 14209 | 16616.8824 | 18308 |
2.52 | 9 | 10076 | 16242.1111 | 18252 |
2.53 | 8 | 14659 | 16548.8750 | 18254 |
2.54 | 8 | 12095 | 15125.2500 | 17996 |
2.55 | 3 | 14351 | 15964.0000 | 18766 |
2.56 | 3 | 15231 | 16723.3333 | 17753 |
2.57 | 3 | 17116 | 17841.6667 | 18485 |
2.58 | 3 | 12500 | 14481.3333 | 16195 |
2.59 | 1 | 16465 | 16465.0000 | 16465 |
2.60 | 3 | 17027 | 17535.0000 | 18369 |
2.61 | 3 | 13784 | 16583.0000 | 18756 |
2.63 | 3 | 10437 | 12659.6667 | 16914 |
2.64 | 1 | 17407 | 17407.0000 | 17407 |
2.65 | 1 | 16314 | 16314.0000 | 16314 |
2.66 | 2 | 16239 | 17367.0000 | 18495 |
2.67 | 1 | 18686 | 18686.0000 | 18686 |
2.68 | 2 | 8419 | 9042.0000 | 9665 |
2.70 | 1 | 14341 | 14341.0000 | 14341 |
2.71 | 1 | 17146 | 17146.0000 | 17146 |
2.72 | 3 | 6870 | 12088.3333 | 17801 |
2.74 | 3 | 8807 | 14385.0000 | 17184 |
2.75 | 2 | 13156 | 14285.5000 | 15415 |
2.77 | 1 | 10424 | 10424.0000 | 10424 |
2.80 | 1 | 15030 | 15030.0000 | 15030 |
3.00 | 8 | 6512 | 12378.5000 | 16970 |
3.01 | 12 | 8040 | 15521.6667 | 18710 |
3.02 | 1 | 10577 | 10577.0000 | 10577 |
3.04 | 2 | 15354 | 16956.5000 | 18559 |
3.05 | 1 | 10453 | 10453.0000 | 10453 |
3.11 | 1 | 9823 | 9823.0000 | 9823 |
3.22 | 1 | 12545 | 12545.0000 | 12545 |
3.24 | 1 | 12300 | 12300.0000 | 12300 |
3.40 | 1 | 15964 | 15964.0000 | 15964 |
3.50 | 1 | 12587 | 12587.0000 | 12587 |
3.51 | 1 | 18701 | 18701.0000 | 18701 |
3.65 | 1 | 11668 | 11668.0000 | 11668 |
3.67 | 1 | 16193 | 16193.0000 | 16193 |
4.00 | 1 | 15984 | 15984.0000 | 15984 |
4.01 | 2 | 15223 | 15223.0000 | 15223 |
4.13 | 1 | 17329 | 17329.0000 | 17329 |
4.50 | 1 | 18531 | 18531.0000 | 18531 |
5.01 | 1 | 18018 | 18018.0000 | 18018 |
Up until now, there are most important features of the diamonds are presented as a count analysis. After that point, relationship between variables and prices are illustrated.
diamonds %>%
group_by(cut, color) %>%
summarise(avg_price = mean(price)) %>%
ggplot(aes(x=cut, y= avg_price, fill = cut)) +
geom_col() +
facet_wrap(~color) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90), legend.position = "none") +
labs(title = "Relationship Analysis Between Price and Cut by Color",
subtitle = st,
y = "Average Price",
x = "Quality of Cut")
When we look at the plots, we can say that these two variables can not explain the price variable because the worst type of colors have the highest prices in the best type of cuts.
diamonds %>%
group_by(clarity, color) %>%
summarise(MeanPrice = mean(price)) %>%
ggplot(aes(x=clarity, y = MeanPrice, fill = color)) +
geom_col(alpha = 0.8) +
theme_minimal() +
facet_wrap(~color) +
labs(title = "Relationship Analysis Between Price and Clarity by Color",
subtitle = st,
x = "Clarity",
y = "Average Price",
fill = "Color")
When we look at the best option of these two types, it has the highest average price among the others. But, when we look at the others, prices are decreasing when we increase the clarity for all types of color. So, it means that we can not still explain the price with these two variables.
diamonds %>%
group_by(clarity, cut) %>%
summarise(MeanPrice = mean(price)) %>%
ggplot(aes(x=clarity, y = MeanPrice, fill = cut)) +
geom_col(alpha = 0.8) +
theme_minimal() +
facet_wrap(~cut) +
labs(title = "Relationship Analysis Between Price and Clarity by Cut",
subtitle = st,
x = "Clarity",
y = "Average Price",
fill = "Cut")
We have the same results like the last plot. SI2, which is the second worst type of clarity, has the highest average price in all cut types. So, we can not explain the price variable with these two types.
quant = quantile(diamonds$price, seq(0, 1, 0.2))
diamonds_price_group <- diamonds %>%
mutate(price_group = case_when(
price < quant[2] ~ "Very Low",
price < quant[3] ~ "Low",
price < quant[4] ~ "Medium",
price < quant[5] ~ "High",
TRUE ~ "Very High"
)) %>%
mutate(price_group = factor(price_group, levels = c("Very Low", "Low", "Medium", "High", "Very High")))
diamonds_price_group %>%
group_by(cut, price_group) %>%
summarise(count=n()) %>%
mutate(percentage = 100 * count / sum(count)) %>%
ggplot(., aes(x = '', y = count, fill = price_group)) +
geom_bar(width = 1, stat = "identity", position = "fill") +
coord_polar("y") +
theme_void() +
theme(plot.title = element_text(vjust = 0.5)) +
facet_wrap(~cut) +
labs(title = "Price Group Analyses of Cut",
subtitle = st,
fill = "Price Group")
All cut types are equally distributed except Fair type. We have only 0.03 percent Fair cut types in the data. So, there could be some grouping in the price groups. Also, we can see the results in a table below.
price_group_vs_cut = diamonds_price_group %>%
group_by(cut, price_group) %>%
summarise(count=n()) %>%
mutate(percentage = 100 * count / sum(count)) %>%
select(cut, price_group, count, percentage)%>%
pivot_wider(id_cols = cut, names_from = price_group, values_from = count)
cut_percentage = diamonds_price_group %>%
group_by(cut) %>%
summarise(count=n()) %>%
mutate(percentage = round(100 * count / sum(count),2)) %>%
select(cut,percentage)
price_group_vs_cut %>%
inner_join(cut_percentage, by = "cut") %>%
kable(col.names = c("Cut", "Very Low","Low", "Medium", "High", "Very High", "Percentage"))%>%
kable_minimal(full_width = F)
Cut | Very Low | Low | Medium | High | Very High | Percentage |
---|---|---|---|---|---|---|
Fair | 68 | 221 | 546 | 463 | 296 | 2.96 |
Good | 953 | 692 | 1032 | 1363 | 848 | 9.09 |
Very Good | 2735 | 1782 | 2448 | 2697 | 2403 | 22.44 |
Premium | 2246 | 2568 | 2297 | 3200 | 3427 | 25.55 |
Ideal | 4745 | 5491 | 4435 | 3024 | 3785 | 39.95 |
diamonds %>%
ggplot(aes(x=price)) +
geom_histogram(aes(y=..density..), position="identity", alpha=0.8, fill = "seashell3", color="seashell4") +
geom_density(alpha=1, size = 1)+
theme_minimal() +
labs(title = "Histogram of Price",
subtitle = st,
x = "Price",
y = "Count")
The histogram shows that the price in diamonds dataset is right skewed.
diamonds %>%
ggplot(aes(x=log(price))) +
geom_histogram(aes(y=..density..), position="identity", alpha=0.8, fill = "seashell3", color="seashell4") +
geom_density(alpha=1, size = 1)+
theme_minimal() +
labs(title = "Histogram of Price",
subtitle = st,
x = "Price",
y = "Count")
When we plot the log transformation of the prices, we didn’t get the shape of the normal distribution. So, we should not try linear models for this data. Because of that we can try glm, decision trees etc.
At first, we need to divide the data to train and test sets.
set.seed(503)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
diamonds_train = diamonds_train[, c(-ncol(diamonds_train))]
diamonds_test = diamonds_test[, c(-ncol(diamonds_test))]
All the ordered columns, we will use them as numeric values.
diamonds_train$cut = as.numeric(diamonds_train$cut)
diamonds_train$clarity = as.numeric(diamonds_train$clarity)
diamonds_train$color = as.numeric(diamonds_train$color)
diamonds_test$cut = as.numeric(diamonds_test$cut)
diamonds_test$clarity = as.numeric(diamonds_test$clarity)
diamonds_test$color = as.numeric(diamonds_test$color)
There are some numerical variables. For this reason we can apply PCA to decrease the number of columns in this dataset.
diamonds_pca <- princomp(as.matrix(diamonds_train[,sapply(diamonds_train, class) == "numeric"]),cor=T)
summary(diamonds_pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.0743448 1.1892870 1.1060252 0.9927188 0.83129152
## Proportion of Variance 0.4781007 0.1571560 0.1359213 0.1094989 0.07678284
## Cumulative Proportion 0.4781007 0.6352566 0.7711779 0.8806769 0.95745974
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.59265505 0.172985644 0.0353090908 2.125041e-02
## Proportion of Variance 0.03902667 0.003324893 0.0001385258 5.017554e-05
## Cumulative Proportion 0.99648641 0.999811299 0.9999498245 1.000000e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat 0.471 0.126 0.868
## cut -0.109 0.596 0.451 0.239 -0.175 0.585
## color -0.158 -0.203 0.802 0.537
## clarity -0.222 0.208 0.212 -0.506 0.778
## depth 0.207 -0.828 0.118 0.492 0.101
## table 0.129 -0.697 0.226 -0.183 0.642
## x 0.475 0.115 -0.277 -0.722 0.387
## y 0.474 0.120 -0.299 0.692 0.420
## z 0.474 0.103 0.132 -0.282 -0.815
From the summary of PCA model that we create, we can see that four components can almost explain 88% of the data. We can choose to use four components for creating a model. Now, we need to add these components to both train and test datasets.
pca_results = predict(diamonds_pca, newdata = diamonds_train)
diamonds_train$pca1 = pca_results[,1]
diamonds_train$pca2 = pca_results[,2]
diamonds_train$pca3 = pca_results[,3]
diamonds_train$pca4 = pca_results[,4]
pca_results_test = predict(diamonds_pca, newdata = diamonds_test)
diamonds_test$pca1 = pca_results_test[,1]
diamonds_test$pca2 = pca_results_test[,2]
diamonds_test$pca3 = pca_results_test[,3]
diamonds_test$pca4 = pca_results_test[,4]
We can create a feature with using the clustering methods. For now, we can use KMeans algorithm. To be able to use the KMeans algorithm, we need to scale all data. Because if a column would have much higher values respect to other columns, it can dominate the rest. To be able to scale the data we need the minimum and maximum values of the train dataset. We will scale both train and test set with the same values.
min_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], min)
max_vals = sapply(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], max)
diamonds_train_scale <- sweep(sweep(diamonds_train[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")
To be able to get similar clusters every time, we will use the same seed. For selecting the cluster number, we can loop over 1 to 15 for center argument. We can select the center value with respect to the error plot.
errors = c()
for (i in (1:15)){
set.seed(1234) #provide getting same results with random function
cluster<-kmeans(diamonds_train_scale,centers=i) # application of the k-means function with i number of group size
errors = c(errors, 100*round(1 - (cluster$tot.withinss/cluster$totss), digits = 3)) # calculation of the fulfillment of the clusters to data.
}
errors_df <- data.frame(x=c(1:15), y=errors) # creating data frame with errors.
ggplot(errors_df, aes(x=x, y=y)) +
geom_point(color = "seashell4") +
geom_line(color="seashell4") +
geom_text(aes(label = errors), size=3, color = "black", position = position_stack(vjust = 0.95))+
theme_minimal() +
labs(title = "Classification of Observations",
subtitle = st,
x = "X",
y = "Y")
The decrease in errors are slowly changing after the cluster with 7 centers. So, we can say that we should select the model with center equals to 7.
set.seed(1234)
best_cluster = kmeans(diamonds_train_scale,centers=7)
diamonds_train$cluster = as.factor(best_cluster$cluster)
Now, we need to apply clustering process to test dataset. To be able to do this I used the method in this link
diamonds_test_scale <- sweep(sweep(diamonds_test[,c("cut", "clarity", "color", "carat", "depth", "table", "x", "y", "z")], 2, min_vals), 2, max_vals - min_vals, "/")
closest.cluster <- function(x) {
cluster.dist <- apply(best_cluster$centers, 1, function(y) sqrt(sum((x-y)^2)))
return(which.min(cluster.dist)[1])
}
diamonds_test$cluster <- as.factor(apply(diamonds_test_scale, 1, closest.cluster))
After these processes, we have enough data to create a model.
At the beginning, we need to check the correlation of features with price
column.
diamonds_cor<-cor(diamonds_train[-c(11:15)])
corrplot(diamonds_cor, method="number")
From the plot above, we can say that carat
, x
, y
and z
columns are highly correlated with price column. So, we can say that we should use this columns in any model. But, at the beginning, we will try for all columns in the model.
For simplicity, we can start with linear regression models. At the beginning, we will not consider the pca columns in the dataset.
model_lm1 = lm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train)
summary(model_lm1)
##
## Call:
## lm(formula = price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23380.8 -624.4 -111.3 487.6 10016.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23516.947 1284.448 -18.309 < 2e-16 ***
## carat 11553.086 65.487 176.419 < 2e-16 ***
## cut 168.773 12.865 13.119 < 2e-16 ***
## color 283.070 5.670 49.920 < 2e-16 ***
## clarity 501.646 4.475 112.094 < 2e-16 ***
## depth 358.929 19.878 18.056 < 2e-16 ***
## table -16.708 3.386 -4.934 8.07e-07 ***
## x -347.272 147.355 -2.357 0.0184 *
## y 3718.053 151.287 24.576 < 2e-16 ***
## z -7446.634 320.787 -23.214 < 2e-16 ***
## cluster2 -48.540 25.392 -1.912 0.0559 .
## cluster3 76.119 42.557 1.789 0.0737 .
## cluster4 -42.650 35.186 -1.212 0.2255
## cluster5 250.682 35.456 7.070 1.57e-12 ***
## cluster6 234.953 35.374 6.642 3.13e-11 ***
## cluster7 440.818 26.276 16.776 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1190 on 42987 degrees of freedom
## Multiple R-squared: 0.9106, Adjusted R-squared: 0.9106
## F-statistic: 2.92e+04 on 15 and 42987 DF, p-value: < 2.2e-16
We created a linear model. For, we will not try to improve the model and check for the linear model assumptions. To do so, we need to plot the residuals.
plot(model_lm1,1)
plot(model_lm1,2)
In the first plot, we can see that there is a variance in the response variable. It means that this data set is not suitable for linear regression. For better models we can use generalized linear models (GLM) rather than linear models.
This data has a continuous response variable. It means that we can use Gamma and Gaussian families.
For easy implementation, we can use the code chunks that we learned in IE508.
for(link in c("identity", "log", "inverse", "sqrt")){
fitgamma = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = Gamma(link = link), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
print(AIC(fitgamma))
}
## [1] 673749.4
## [1] 1096312
## [1] 711941.7
## [1] 644408.7
for(link in c("identity", "log", "inverse")){
fitgaussian = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = gaussian(link = link), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
print(AIC(fitgaussian))
}
## [1] 731124.9
## [1] 5602415
## [1] 864186
When we compare the results of glm models, we see that the best model is the one with gamma family and sqrt link function. So, we will continue with these argument for glm.
model_glm = glm(price ~ . - pca1 - pca2 - pca3 - pca4, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm)
##
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4, family = Gamma(link = "sqrt"),
## data = diamonds_train, start = c(0.5, 0.5, 0.5, 0.5, 0.5,
## 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.27202 -0.12157 -0.02088 0.09054 2.29577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -75.03830 4.80955 -15.602 < 2e-16 ***
## carat 52.76661 0.34888 151.246 < 2e-16 ***
## cut 0.53402 0.04191 12.743 < 2e-16 ***
## color 1.19880 0.01677 71.493 < 2e-16 ***
## clarity 2.01002 0.01348 149.150 < 2e-16 ***
## depth 0.72617 0.07474 9.716 < 2e-16 ***
## table 0.01401 0.01082 1.295 0.195384
## x 16.04449 0.61542 26.071 < 2e-16 ***
## y -3.56428 0.62223 -5.728 1.02e-08 ***
## z -12.25907 1.42896 -8.579 < 2e-16 ***
## cluster2 -0.12907 0.06921 -1.865 0.062195 .
## cluster3 -1.58676 0.14697 -10.797 < 2e-16 ***
## cluster4 -2.87269 0.13135 -21.870 < 2e-16 ***
## cluster5 -0.10937 0.10869 -1.006 0.314332
## cluster6 0.35828 0.10104 3.546 0.000392 ***
## cluster7 1.38220 0.10099 13.686 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.03404458)
##
## Null deviance: 42126.8 on 43002 degrees of freedom
## Residual deviance: 1393.5 on 42987 degrees of freedom
## AIC: 644409
##
## Number of Fisher Scoring iterations: 8
Table is not significant in this model. So we can remove it from the model.
model_glm2 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm2)
##
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4 - table,
## family = Gamma(link = "sqrt"), data = diamonds_train, start = c(0.5,
## 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
## 0.5, 0.5, 0.5))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.27264 -0.12163 -0.02084 0.09062 2.28622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -73.19219 4.56920 -16.019 < 2e-16 ***
## carat 52.81087 0.34721 152.102 < 2e-16 ***
## cut 0.50918 0.03704 13.748 < 2e-16 ***
## color 1.19915 0.01676 71.540 < 2e-16 ***
## clarity 2.00950 0.01347 149.222 < 2e-16 ***
## depth 0.71171 0.07371 9.655 < 2e-16 ***
## x 16.05383 0.61528 26.092 < 2e-16 ***
## y -3.63697 0.61902 -5.875 4.25e-09 ***
## z -12.17752 1.42666 -8.536 < 2e-16 ***
## cluster2 -0.12735 0.06919 -1.841 0.065677 .
## cluster3 -1.61398 0.14540 -11.101 < 2e-16 ***
## cluster4 -2.87373 0.13132 -21.884 < 2e-16 ***
## cluster5 -0.14129 0.10604 -1.332 0.182730
## cluster6 0.32846 0.09832 3.341 0.000836 ***
## cluster7 1.38189 0.10096 13.687 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.0340283)
##
## Null deviance: 42126.8 on 43002 degrees of freedom
## Residual deviance: 1393.6 on 42988 degrees of freedom
## AIC: 644408
##
## Number of Fisher Scoring iterations: 8
Also, we can test to remove the cluster column.
model_glm3 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table - cluster, data = diamonds_train, family = Gamma(link = "sqrt"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
summary(model_glm3)
##
## Call:
## glm(formula = price ~ . - pca1 - pca2 - pca3 - pca4 - table -
## cluster, family = Gamma(link = "sqrt"), data = diamonds_train,
## start = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.25890 -0.12326 -0.02196 0.08701 2.25745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -86.25420 4.62895 -18.634 < 2e-16 ***
## carat 50.94390 0.34313 148.469 < 2e-16 ***
## cut 0.47216 0.01724 27.390 < 2e-16 ***
## color 1.33121 0.01096 121.467 < 2e-16 ***
## clarity 2.04246 0.01187 172.057 < 2e-16 ***
## depth 0.89650 0.07468 12.005 < 2e-16 ***
## x 17.25907 0.60617 28.472 < 2e-16 ***
## y -2.30449 0.60142 -3.832 0.000127 ***
## z -15.66734 1.44552 -10.839 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.03557524)
##
## Null deviance: 42126.8 on 43002 degrees of freedom
## Residual deviance: 1441.6 on 42994 degrees of freedom
## AIC: 645863
##
## Number of Fisher Scoring iterations: 9
So we can say that cluster makes our model better. Now, we need to check whether we can replace numeric values with PCA values.
model_glm4 = glm(price ~ pca1 + pca2 + pca3 + pca4 + cluster, data = diamonds_train, family = Gamma(link = "sqrt"))
summary(model_glm4)
##
## Call:
## glm(formula = price ~ pca1 + pca2 + pca3 + pca4 + cluster, family = Gamma(link = "sqrt"),
## data = diamonds_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.19335 -0.26935 -0.06091 0.16983 1.92149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.58457 0.12355 457.991 < 2e-16 ***
## pca1 11.83082 0.04129 286.500 < 2e-16 ***
## pca2 4.07612 0.04255 95.803 < 2e-16 ***
## pca3 3.14036 0.04449 70.588 < 2e-16 ***
## pca4 -0.30761 0.06086 -5.054 4.34e-07 ***
## cluster2 -3.32685 0.14251 -23.345 < 2e-16 ***
## cluster3 -2.57743 0.27792 -9.274 < 2e-16 ***
## cluster4 -7.05281 0.27171 -25.957 < 2e-16 ***
## cluster5 2.92040 0.17373 16.810 < 2e-16 ***
## cluster6 3.86938 0.15565 24.859 < 2e-16 ***
## cluster7 1.72285 0.21070 8.177 2.99e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.1343288)
##
## Null deviance: 42126.8 on 43002 degrees of freedom
## Residual deviance: 5343.6 on 42992 degrees of freedom
## AIC: 702856
##
## Number of Fisher Scoring iterations: 12
PCA helps us to use less feature in a model. But, as expected, with fewer feature we will get more AIC value in the model. We have 13 numeric feature. So, we don’t need to decrease the feature number.
To be able to compare glm model with other models, we need to predict for test data and calculate the R squared value.
model_glm2_prediction<-predict(model_glm2,newdata=diamonds_test)
model_glm2_r2 <- 1 - (sum((model_glm2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm2_r2
## [1] -0.9323168
The result is negative. It means that we have a very bad model. So, we can change the family/link function of the glm to get a better model. The second best model have Gamma family with identity link function. So we need to create the same model with this family and link function.
model_glm5 = glm(price ~ . - pca1 - pca2 - pca3 - pca4 - table, data = diamonds_train, family = Gamma(link = "identity"), start = c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5))
model_glm5_prediction<-predict(model_glm5,newdata=diamonds_test)
model_glm5_r2 <- 1 - (sum((model_glm5_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_glm5_r2
## [1] 0.8122059
This is a far more better model. Also we can plot the real values vs predictions.
pred_vs_real_glm = as.data.frame(cbind(model_glm5_prediction, diamonds_test$price))
colnames(pred_vs_real_glm) = c("predictions", "real")
pred_vs_real_glm %>%
ggplot(aes(predictions, real)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color="red", size=1.5) +
labs(title = "Prediction vs Real Values For Best GLM Model",
subtitle = st,
x = "Predictions",
y = "Real Values")
model_rpart <- rpart(price~., data=diamonds_train)
prp(model_rpart)
From the plot of a tree, we can see that the nodes are divided with using only carat
, y
and clarity
columns. It means that these are better features to reduce the variance in the dataset with default argument. We can make a cross validation for cp argument. To do so, we need to apply this code chunk (these lines are taken from the Analytics Edge Course:
# Number of folds
tr.control = trainControl(method = "cv", number = 10)
# cp values
cp.grid = expand.grid( .cp = (1:10)*0.001)
# Cross-validation
tr = train(price~. - pca1 - pca2 - pca3 - pca4, data = diamonds_train, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
tr
## CART
##
## 43003 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 38703, 38704, 38702, 38701, 38703, 38702, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.001 944.8551 0.9436613 577.6079
## 0.002 1037.4473 0.9320865 628.6129
## 0.003 1079.4290 0.9264358 653.3161
## 0.004 1143.8428 0.9174002 689.0738
## 0.005 1146.8377 0.9169674 690.2158
## 0.006 1241.1488 0.9026995 809.9428
## 0.007 1277.7140 0.8968892 845.0753
## 0.008 1277.7140 0.8968892 845.0753
## 0.009 1277.7140 0.8968892 845.0753
## 0.010 1334.7119 0.8871967 870.7893
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.001.
We can see that we need to use 0.001 for cp when we compare with other cp values.
model_rpart2 <- rpart(price~. - pca1 - pca2 - pca3 - pca4, data=diamonds_train, cp = 0.001)
prp(model_rpart2)
This is a more detailed tree. From the plot of this tree, we can say that we will define the price with respect to carat
, y
, clarity
and color
values. To be able to compare these two trees, we need to calculate the R squared values. To do so, firstly we need to calculate the predictions for test data and then calculate the R squared values.
model_rpart_prediction<-predict(model_rpart,newdata=diamonds_test)
model_rpart_r2 <- 1 - (sum((model_rpart_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart_r2
## [1] 0.881602
model_rpart2_prediction<-predict(model_rpart2,newdata=diamonds_test)
model_rpart2_r2 <- 1 - (sum((model_rpart2_prediction-diamonds_test$price)^2)/sum((diamonds_test$price-mean(diamonds_train$price))^2))
model_rpart2_r2
## [1] 0.9456136
As we can from the results, we can see that second model is better than the first model.
If we want to compare the best tree model with best glm model, we need to compare the R squared values for these models.
model_rpart2_r2 > model_glm5_r2
## [1] TRUE
So, our CART model is better than the glm model.
We prepared and checked the accuracy of the data for creating a model. Before creating models, we applied the PCA and KMeans algorithms. As we expect, we get worse result with columns that we craeted by PCA algorithm. PCA is better to use for data that have many features. With KMeans algorithm, we created a feature for our models after scaling the numerical columns (which removes the importance of all columns). This feature makes all models better. After creating the best linear, generalized linear and decision tree models, the decision tree model performed better than the others.
Kaggle Notebook1 - Predicting Diamond Prices with Linear Regression
Kaggle Notebook2 - Diamond Exploration Price Modeling
EDA Example with Diamonds data set
Diamonds-Wikipedia
Color Cheatsheet
Geom Histogram
Ideal Cut
Depth Percentages
Mode Function
Division of Vector and Matrix Cluster for Test Data R2 for GLM