Assignment3-Mef Big Data Msc

##Loading Packages

Lading Diamonds Data And First Glance

I have loaded the data from ggplot library. After that using glimpse function to get summary information about the data.

Luckily, there is no null value in the diamonds data.

diamonds<-ggplot2::diamonds

glimpse(diamonds)

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

data.frame(Total_Missing=sapply(diamonds,function(x){sum(is.na(x))}))

##         Total_Missing
## carat               0
## cut                 0
## color               0
## clarity             0
## depth               0
## table               0
## price               0
## x                   0
## y                   0
## z                   0

diamonds <-rename(diamonds,purity=clarity)

Variable Types and Descriptions of Variables

Variable Types

There are three categorical variable namely, cut, color and purity.

Besides, there are numerical and integer values which are price and carat,table and dimension of diamonds : (x,y,z).

Variables Description

carat weight of the diamond (0.2–5.01)

cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color diamond colour, from J (worst) to D (best)

clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

table width of top of diamond relative to widest point (43–95)

price price in US dollars ($326–$18,823)

x length in mm (0–10.74)

y width in mm (0–58.9)

z depth in mm (0–31.8)

vis_dat(diamonds)

print("Cut Levels")

## [1] "Cut Levels"

levels(diamonds$cut)

## [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"

print("Color Levels")

## [1] "Color Levels"

levels(diamonds$color)

## [1] "D" "E" "F" "G" "H" "I" "J"

print("Purity Levels")

## [1] "Purity Levels"

levels(diamonds$purity)

## [1] "I1"   "SI2"  "SI1"  "VS2"  "VS1"  "VVS2" "VVS1" "IF"

One-Hot Encoding

Ensuring the data frame including only numerical value, categorical values need to one-Hot encoding process

diamonds_numeric <- diamonds  %>% dplyr:: mutate(colorNum = ifelse(color == "D", 1, ifelse(color == "E", 2, ifelse(color == "F", 3, ifelse(color == "G", 4, ifelse(color == "H",5, ifelse(color == "I" , 6,7)))))), cutNum = ifelse(cut=="Fair", 1, ifelse(cut=="Good", 2, ifelse(cut == "Very Good", 3, ifelse(cut=="Premium", 4, 5)))), purityNum = ifelse(purity == "I1", 1, ifelse(purity == "SI2", 2, ifelse(purity == "SI1",3, ifelse(purity=="VS2", 4, ifelse(purity=="VS1",5, ifelse(purity=="VVS2",6,ifelse(purity == "VVS1", 7, 8)))))))) %>% dplyr::select(-color,-cut, -purity)
#diamonds_numeric <- diamonds_numeric %>% select(-price)
diamonds_numeric <- as.data.frame(diamonds_numeric)
head(diamonds_numeric)

##   carat depth table price    x    y    z colorNum cutNum purityNum
## 1  0.23  61.5    55   326 3.95 3.98 2.43        2      5         2
## 2  0.21  59.8    61   326 3.89 3.84 2.31        2      4         3
## 3  0.23  56.9    65   327 4.05 4.07 2.31        2      2         5
## 4  0.29  62.4    58   334 4.20 4.23 2.63        6      4         4
## 5  0.31  63.3    58   335 4.34 4.35 2.75        7      2         2
## 6  0.24  62.8    57   336 3.94 3.96 2.48        7      3         6

Distribution Of Continous Variables

Price and carat have right tail skewed shape and also comparing with others these two table have wider range distribution.

par(mfrow=c(1,1)) # to divide the grid area

# Plotting cut vs price
a <- ggplot(data =diamonds_numeric)+scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  geom_density(aes(x=price),alpha=1.5,fill="blue")+
  xlab("Price")+
  ggtitle("Dist. Of Price")

# Plotting color vs price
b <- ggplot(data = diamonds_numeric)+
  geom_density(aes(x=carat),alpha=1,fill="red")+
  xlab("Carat")+
  ggtitle("Dist Of Carat")

# Plotting clarity vs price
c <- ggplot(data = diamonds_numeric)+
  geom_density(aes(x=depth),alpha=1,fill="green")+
  xlab("Depth")+
  ggtitle("Dist. Of Depth")

# Plotting carat vs price
d <- ggplot(data = diamonds_numeric)+
  geom_density(aes(x=table),alpha=1,fill="orange")+
  xlab("Table")+
  ggtitle("Dist. Of Table")

# Plotting depth vs price
e <- ggplot(data = diamonds_numeric)+
  geom_density(aes(x=x),alpha=1,fill="pink")+
  xlab("x")+
  ggtitle("Dist. Of X")

# Plotting table vs price
f <- ggplot(data = diamonds_numeric)+
  geom_density(aes(x=y),alpha=1,fill="purple")+
  xlab("y")+
  ggtitle("Dist. Of Y")

# Plotting X vs price
g <- ggplot(data = diamonds_numeric)+
  geom_density(aes(x=z),alpha=1,fill="purple")+
  xlab("z")+
  ggtitle("Dist. Of Z")


grid.arrange(a, b, c,d,e,f,g, ncol = 3, nrow = 3)

##Correlation Matrix

As seen in the table below. Price,carat,x,y,z has nonignorable correlation degree each other. Although it could be also considered to eleminate some variable less correlation than 0.50 of each value, I decided to keep all of them because there are only 9 variables.

par(mfrow=c(1,1)) # to divide the grid area

CorrMat <- round(cor(diamonds_numeric),3)
ggcorrplot(CorrMat, hc.order = TRUE, 
           #type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of Numeric Values", 
           ggtheme=theme_bw)

#Data Cleaning part 1

Thanks to the code on linked below, It helps me fistly finding 3 values beyond the 75th quartile of y,z and changing them mean of these variables.After then, It also is detecting 20 values which has zeros in any dimension x,y,z dimension. The code also changing replacing these zeros with means of regarding dimensions. Getting rid of 23 outliers with two process mentioned above, getting higher metric results in the following Random Forest algorithm deployment.

https://rpubs.com/ankurmehta/diamond_outliers

diamonds_corrected <- diamonds_numeric

diamonds_corrected %>% dplyr::filter(y>11 | z>11)

##   carat depth table price    x     y     z colorNum cutNum purityNum
## 1  2.00  58.9  57.0 12210 8.09 58.90  8.06        5      4         2
## 2  0.51  61.8  54.7  1970 5.12  5.15 31.80        2      3         5
## 3  0.51  61.8  55.0  2075 5.15 31.80  5.12        2      5         5

diamonds_corrected <- diamonds_corrected %>%dplyr:: mutate(y=case_when(y>50 ~ 8.09, y>30 ~ 5.15, TRUE ~ y),z=case_when(z>30 ~ 0.6*5.15, TRUE ~ z))

diamonds_corrected %>% filter(x==0 |y==0 | z==0)

##    carat depth table price    x    y z colorNum cutNum purityNum
## 1   1.00  59.1    59  3142 6.55 6.48 0        4      4         2
## 2   1.01  58.1    59  3167 6.66 6.60 0        5      4         1
## 3   1.10  63.0    59  3696 6.50 6.47 0        4      4         2
## 4   1.01  59.2    58  3837 6.50 6.47 0        3      4         2
## 5   1.50  64.0    61  4731 7.15 7.04 0        4      2         1
## 6   1.07  61.6    56  4954 0.00 6.62 0        3      5         2
## 7   1.00  63.3    53  5139 0.00 0.00 0        5      3         4
## 8   1.15  59.2    56  5564 6.88 6.83 0        4      5         4
## 9   1.14  57.5    67  6381 0.00 0.00 0        4      1         5
## 10  2.18  59.4    61 12631 8.49 8.45 0        5      4         2
## 11  1.56  62.2    54 12800 0.00 0.00 0        4      5         4
## 12  2.25  61.3    58 15397 8.52 8.42 0        6      4         3
## 13  1.20  62.1    59 15686 0.00 0.00 0        1      4         7
## 14  2.20  61.2    59 17265 8.42 8.37 0        5      4         3
## 15  2.25  62.8    59 18034 0.00 0.00 0        5      4         2
## 16  2.02  62.7    53 18207 8.02 7.95 0        5      4         4
## 17  2.80  63.8    58 18788 8.90 8.85 0        4      2         2
## 18  0.71  64.1    60  2130 0.00 0.00 0        3      2         2
## 19  0.71  64.1    60  2130 0.00 0.00 0        3      2         2
## 20  1.12  60.4    59  2383 6.71 6.67 0        4      4         1

diamonds_corrected <- diamonds_corrected %>% group_by(carat,cutNum) %>% 
  mutate(x=case_when(x==0 ~ mean(x), TRUE ~ x), y=case_when(y==0 ~ mean(y), TRUE ~ y))
diamonds_corrected %>% filter(x==0 |y==0 | z==0)

## # A tibble: 20 x 10
## # Groups:   carat, cutNum [17]
##    carat depth table price     x     y     z colorNum cutNum purityNum
##    <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>  <dbl>     <dbl>
##  1  1     59.1    59  3142  6.55  6.48     0        4      4         2
##  2  1.01  58.1    59  3167  6.66  6.6      0        5      4         1
##  3  1.1   63      59  3696  6.5   6.47     0        4      4         2
##  4  1.01  59.2    58  3837  6.5   6.47     0        3      4         2
##  5  1.5   64      61  4731  7.15  7.04     0        4      2         1
##  6  1.07  61.6    56  4954  6.53  6.62     0        3      5         2
##  7  1     63.3    53  5139  6.35  6.38     0        5      3         4
##  8  1.15  59.2    56  5564  6.88  6.83     0        4      5         4
##  9  1.14  57.5    67  6381  3.28  3.27     0        4      1         5
## 10  2.18  59.4    61 12631  8.49  8.45     0        5      4         2
## 11  1.56  62.2    54 12800  7.28  7.29     0        4      5         4
## 12  2.25  61.3    58 15397  8.52  8.42     0        6      4         3
## 13  1.2   62.1    59 15686  6.80  6.74     0        1      4         7
## 14  2.2   61.2    59 17265  8.42  8.37     0        5      4         3
## 15  2.25  62.8    59 18034  7.28  7.22     0        5      4         2
## 16  2.02  62.7    53 18207  8.02  7.95     0        5      4         4
## 17  2.8   63.8    58 18788  8.9   8.85     0        4      2         2
## 18  0.71  64.1    60  2130  5.59  5.60     0        3      2         2
## 19  0.71  64.1    60  2130  5.59  5.60     0        3      2         2
## 20  1.12  60.4    59  2383  6.71  6.67     0        4      4         1

diamonds_corrected %>% filter(carat==1.14, cutNum==0)

## # A tibble: 0 x 10
## # Groups:   carat, cutNum [0]
## # ... with 10 variables: carat <dbl>, depth <dbl>, table <dbl>, price <int>,
## #   x <dbl>, y <dbl>, z <dbl>, colorNum <dbl>, cutNum <dbl>, purityNum <dbl>

diamonds_corrected <- diamonds_corrected %>% group_by(carat,cutNum) %>% 
  mutate(x=case_when(x==0 ~ mean(x[x!=0]), TRUE ~ x), y=case_when(y==0 ~ mean(y[y!=0]), TRUE ~ y))
diamonds_corrected %>% filter(x==0 |y==0 | z==0)

## # A tibble: 20 x 10
## # Groups:   carat, cutNum [17]
##    carat depth table price     x     y     z colorNum cutNum purityNum
##    <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>  <dbl>     <dbl>
##  1  1     59.1    59  3142  6.55  6.48     0        4      4         2
##  2  1.01  58.1    59  3167  6.66  6.6      0        5      4         1
##  3  1.1   63      59  3696  6.5   6.47     0        4      4         2
##  4  1.01  59.2    58  3837  6.5   6.47     0        3      4         2
##  5  1.5   64      61  4731  7.15  7.04     0        4      2         1
##  6  1.07  61.6    56  4954  6.53  6.62     0        3      5         2
##  7  1     63.3    53  5139  6.35  6.38     0        5      3         4
##  8  1.15  59.2    56  5564  6.88  6.83     0        4      5         4
##  9  1.14  57.5    67  6381  3.28  3.27     0        4      1         5
## 10  2.18  59.4    61 12631  8.49  8.45     0        5      4         2
## 11  1.56  62.2    54 12800  7.28  7.29     0        4      5         4
## 12  2.25  61.3    58 15397  8.52  8.42     0        6      4         3
## 13  1.2   62.1    59 15686  6.80  6.74     0        1      4         7
## 14  2.2   61.2    59 17265  8.42  8.37     0        5      4         3
## 15  2.25  62.8    59 18034  7.28  7.22     0        5      4         2
## 16  2.02  62.7    53 18207  8.02  7.95     0        5      4         4
## 17  2.8   63.8    58 18788  8.9   8.85     0        4      2         2
## 18  0.71  64.1    60  2130  5.59  5.60     0        3      2         2
## 19  0.71  64.1    60  2130  5.59  5.60     0        3      2         2
## 20  1.12  60.4    59  2383  6.71  6.67     0        4      4         1

diamonds_corrected <- diamonds_corrected %>% mutate(z=case_when(z==0 ~ 0.3*(x+y),TRUE ~ z))
diamonds_corrected %>% filter(x==0 |y==0 | z==0)

## # A tibble: 0 x 10
## # Groups:   carat, cutNum [0]
## # ... with 10 variables: carat <dbl>, depth <dbl>, table <dbl>, price <int>,
## #   x <dbl>, y <dbl>, z <dbl>, colorNum <dbl>, cutNum <dbl>, purityNum <dbl>

Data Cleaning part 2

In this time, I am going to apply minimum covariance determinant technique. It calculates means and covariance matrix based on the most central subset of the data. In this way, I am going to eleminate 4065 outliers from total row of 53965 complete dataset.

diamonds_cleaning<-diamonds_corrected
output95 <- cov.mcd(diamonds_cleaning, quantile.used = nrow(diamonds_cleaning)*.95)
mhmcd95 <- mahalanobis(diamonds_cleaning, output95$center, output95$cov)
alpha <- .001
cutoff <- (qchisq(p = 1 - alpha, df = ncol(diamonds_cleaning)))
names_outlier_MCD95 <- which(mhmcd95 > cutoff)
excluded_mcd95 <- names_outlier_MCD95
data_clean_mcd <- diamonds_cleaning[-excluded_mcd95, ]
nrow(diamonds_cleaning[excluded_mcd95, ])

## [1] 4071

nrow(data_clean_mcd)

## [1] 49869

df<-data_clean_mcd

Convert Price and Carat to Log-based Type

One last step before diving to Random Forest, I consider to apply log function on carat and price variables. Because they were spread wide range area with the shape of Left-skewed tail.

df<-df %>% mutate_at(vars(matches("price","carat")), log)

RandomForest

Split Data

First we need to split data respect to trained(%75) and test(%25) format. Our goal value is y_train, y_test and rest of them are assigned to X_train and x_test

# Create features and target
X <- df %>% dplyr::select(carat, depth, table, purityNum, cutNum, colorNum,x,y,z)
y <- df$price

# Split data into training and test sets
index <- createDataPartition(y, p=0.75, list=FALSE)
X_train <- X[ index, ]

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

X_test <- X[-index, ]
y_train <- y[index]
y_test<-y[-index]

Apply Random Forest

In the final step, randomForest is applied with standard paramaters such as maxnodes equal to ten

and ntree equals to ten.

Using with ggplot,It is easier to see that predicted price matching with actual price direction very-well on y-axis.

Due to the getting high R2 metric score which is ~ %93, there is no need to further implementation like Tuning parameters of Random Forest function.

regr <- randomForest(x = X_train, y = y_train , maxnodes = 10, ntree = 10)

predictions <- predict(regr, X_test)

result <- X_test
result['price'] <- y_test
result['prediction']<-  predictions

head(result)

## # A tibble: 6 x 11
## # Groups:   carat, cutNum [6]
##   carat depth table purityNum cutNum colorNum     x     y     z price prediction
##   <dbl> <dbl> <dbl>     <dbl>  <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
## 1  0.21  59.8    61         3      4        2  3.89  3.84  2.31  5.79       6.43
## 2  0.31  63.3    58         2      2        7  4.34  4.35  2.75  5.81       6.42
## 3  0.23  62.8    56         5      5        7  3.93  3.9   2.46  5.83       6.46
## 4  0.22  60.4    61         3      4        3  3.88  3.84  2.33  5.83       6.43
## 5  0.3   62.2    57         4      3        7  4.28  4.3   2.67  5.88       6.45
## 6  0.33  61.8    55         2      5        6  4.49  4.51  2.78  6.00       6.46

ggplot(  ) + 
  geom_point( aes(x = X_test$carat, y = y_test, color = 'red', alpha = 0.5) ) + 
  geom_point( aes(x = X_test$carat , y = predictions, color = 'blue',  alpha = 0.5)) + 
  labs(x = "Carat", y = "Price", color = "", alpha = 'Transperency') +
  scale_color_manual(labels = c( "Predicted", "Real"), values = c("blue", "red"))

print(paste0('R2: ' ,caret::postResample(predictions , y_test)['Rsquared'] ))

## [1] "R2: 0.938689332349006"