At first, we need to load the required packages.
pti <- c("data.table", "tidyverse", "lubridate", "knitr", "tinytex", "rpart", "randomForest", "rpart.plot", "caret")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
install.packages(pti)
}
library(tidyverse)
library(lubridate)
library(knitr)
library(tinytex)
library(data.table)
library(randomForest)
library(rpart)
library(rpart.plot)
We download the preprocessed data. You can see the preprocessing steps from this link.
#data uploading
carmarket = readRDS(gzcon(url("https://github.com/pjournal/boun01g-data-mine-r-s/blob/gh-pages/Project/turkey_car_market_EDA?raw=true")))
This is the structure of the data.
## Classes 'data.table' and 'data.frame': 8834 obs. of 17 variables:
## $ Date : Date, format: "2020-04-18" "2020-04-18" ...
## $ Year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ Month : Factor w/ 5 levels "3","4","5","6",..: 2 2 2 3 2 3 3 2 2 3 ...
## $ Brand : Factor w/ 36 levels "Acura","Alfa Romeo",..: 24 3 3 21 21 30 30 24 24 30 ...
## $ Vehicle_Type_Group: Factor w/ 387 levels "1 Series","100 Series",..: 200 319 319 281 281 285 285 324 200 285 ...
## $ Vehicle_Type : Factor w/ 1716 levels "-","0.9","0.9 Icon",..: 1369 1671 1671 1660 1660 1660 1662 1681 1350 1662 ...
## $ Model_Year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ Fuel_Type : Factor w/ 4 levels "Diesel","Electricity",..: 1 2 2 1 1 4 4 1 1 4 ...
## $ Gear : Factor w/ 3 levels "Automatic","Manual",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CCM : Factor w/ 14 levels "Don't Know","1300 cc and below",..: 7 2 2 7 7 7 7 7 7 7 ...
## $ Horse_Power : Factor w/ 15 levels "Don't Know","100 HP and below",..: 4 9 9 6 6 8 8 4 4 8 ...
## $ Color : Factor w/ 28 levels "Amaranth","Beige",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Body_Type : Factor w/ 13 levels "Crossover","Glass Van",..: 7 13 13 7 7 4 4 11 7 4 ...
## $ Seller : Factor w/ 3 levels "Authority","Galery",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Seller_Status : Factor w/ 4 levels "0 km","2nd Hand",..: 1 2 1 1 1 1 1 1 1 1 ...
## $ Kilometers : int 0 250 0 0 0 0 0 0 0 0 ...
## $ Price : int 2720000 2550000 2500000 2425000 2390000 2375000 2375000 2350000 2335000 2310000 ...
## - attr(*, ".internal.selfref")=<externalptr>
While setting up the model, we encountered an error when the level number of factor type variables exceeded 32. So we overcame this problem by adding some variables to dummy variables.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
dmy <- dummyVars(" ~ .", data = carmarket[, .(Brand, Fuel_Type, Gear, Seller_Status)])
carmarket_encoded <- data.frame(predict(dmy, newdata = carmarket[, .(Brand, Fuel_Type, Gear, Seller_Status)]))
carmarket_encoded$Kilometers = carmarket$Kilometers
carmarket_encoded$Model_Year = carmarket$Model_Year
carmarket_encoded$CCM = carmarket$CCM
carmarket_encoded$Horse_Power = carmarket$Horse_Power
carmarket_encoded$Price = carmarket$Price
carmarket_encoded$Seller = carmarket$Seller
We divided our data into train and test. We will set up our model using the train part, then we will try to estimate the car prices using the model we have set up in test data. Consequently, we will push our model by looking at various metrics.
Manually, we evaluated our formulas, then we specified the formula to use in the algorithm below.
Random Forest algorithm builds the multiple decision trees which are known as forest and glue them together to urge a more accurate and stable prediction.So we used randomForest()
carmarket.train.rf<-randomForest(fmla,data=carmarket.train,ntree=116)
carmarket.test$rfpredict<-predict(carmarket.train.rf,carmarket.test)
print(error<-sqrt(mean((carmarket.test$rfpredict-carmarket.test$Price)^2)))
## [1] 72356.63
The r-square value is a metric that shows how well our prediction and actual observations fit each other. Our r-squared value is 0.8313471. This value shows that our model explains real data well.
R_squared <- 1 - (sum((carmarket.test$rfpredict-carmarket.test$Price)^2)/sum((carmarket.test$Price-mean(carmarket.test$Price))^2))
R_squared # 0.8313471
## [1] 0.8315713
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.Our rmse value is 72404.77, which clearly shows that the prediction we made is acceptable.
residuals<-carmarket.test$Price-carmarket.test$rfpredict
relative_error <- residuals/carmarket.test$Price
print(rmse<- sqrt(mean(residuals^2))) #72404.77
## [1] 72356.63
## [1] 0.4610714
#MSE = (1/n) * Σ(actual – prediction)2
mse.rf<-(1/nrow(carmarket))*sum(residuals^2)
mse.rf # 1581517991
## [1] 1579415675
We used varImpPlot() function to see how effective it is in which variable model.
We visualized the regression we set up
ggplot(carmarket.test,aes(rfpredict,Price))+
geom_point()+
geom_abline(color="blue")+
coord_cartesian(xlim = c(0, 350000), ylim = c(0, 350000))+
ggtitle("Prediction Price vs Actual Price")
Below is a basic table of the prediction and actual values we have made.
## Price rfpredict
## 1: 2720000 1663762
## 2: 2375000 1799627
## 3: 2350000 1113409
## 4: 2310000 1799627
## 5: 2250000 2089320
## 6: 1980000 1882624
## 7: 1945000 1565244
## 8: 1795000 1458017
## 9: 1715000 1192977
## 10: 1690000 1734953
Although the above results are satisfactory, we wanted to check our model above using another algorithm. That’s why we used the rpart package and reached the following results.
set.seed(492)
ind<-sample(2,nrow(carmarket_encoded),replace = TRUE,prob = c(0.7,0.3))
rptrain<-carmarket[ind==1, ]
rptest<-carmarket[ind==2, ]
fmla1<-Price~Gear+Horse_Power+Color+Kilometers+Model_Year+Fuel_Type+Body_Type+Seller_Status
We established our model using the rpart library.
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
We calculated the r square value of the model we built. We found the value as 0.7391716. We have observed that our model does not give a good result as a random forest.
Rpart_r.squared <- 1 - (sum((rptest$prediction-rptest$Price)^2)/sum((rptest$Price-mean(rptest$Price))^2))
Rpart_r.squared # 0.7391716
## [1] 0.7391716
#MSE = (1/n) * Σ(actual – prediction)2
mse.rpart<-(1/nrow(rptest))*sum((rptest$Price-rptest$prediction)^2)
mse.rpart #8107658376
## [1] 8107658376
We visualized the our rpart model
ggplot(rptest,aes(prediction,Price))+
geom_point()+
geom_abline(color="blue")+
coord_cartesian(xlim = c(0, 350000), ylim = c(0, 350000))+
ggtitle("Prediction Price vs Actual Price")
The table which display actual price versus prediction is below
## Price prediction
## 1: 2720000 2291666.67
## 2: 2375000 2375000.00
## 3: 2350000 2291666.67
## 4: 2310000 2375000.00
## 5: 2250000 2294500.00
## ---
## 2661: 10000 30347.11
## 2662: 10000 30347.11
## 2663: 10000 30347.11
## 2664: 10000 30347.11
## 2665: 9900 30347.11
As a third method, we wanted to establish a Linear model. For this, we quickly made the necessary data preparations and set up our model.
model <- lm(Price ~ Brand + as.factor(Model_Year) + Fuel_Type + Gear + CCM + Horse_Power + Color + Body_Type + Seller + Seller_Status, data = carmarket)
anova(model)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Brand 31 1.1762e+14 3.7943e+12 324.7658 < 2.2e-16 ***
## as.factor(Model_Year) 45 6.0360e+13 1.3413e+12 114.8091 < 2.2e-16 ***
## Fuel_Type 3 4.4216e+12 1.4739e+12 126.1520 < 2.2e-16 ***
## Gear 2 9.8737e+11 4.9368e+11 42.2557 < 2.2e-16 ***
## CCM 11 1.3373e+13 1.2157e+12 104.0559 < 2.2e-16 ***
## Horse_Power 12 3.0655e+12 2.5546e+11 21.8653 < 2.2e-16 ***
## Color 27 2.7239e+12 1.0089e+11 8.6351 < 2.2e-16 ***
## Body_Type 12 6.9846e+11 5.8205e+10 4.9820 2.655e-08 ***
## Seller 2 5.1788e+11 2.5894e+11 22.1636 2.506e-10 ***
## Seller_Status 3 6.9278e+11 2.3093e+11 19.7655 9.158e-13 ***
## Residuals 8685 1.0147e+14 1.1683e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
carmarket <- carmarket[carmarket$Model_Year != 1961 & carmarket$Model_Year != 1977 & carmarket$Model_Year != 1978,]
carmarket <- carmarket[carmarket$CCM != "Don't Know" & carmarket$CCM != "4501-5000 cc" & carmarket$CCM != "5501-6000 cc" & carmarket$CCM != "6001 cc and above",]
carmarket <- carmarket[carmarket$Color != "Amaranth" & carmarket$Color != "Magenta" & carmarket$Color != "Pink",]
pred_length <- nrow(carmarket)
fit_lm_error <- c()
fit_lm_sq_error <- c()
We use for loop in order to find the best linear model.
for(i in 1:pred_length){
fit_lm <- lm(Price ~ Brand + as.factor(Model_Year) + Fuel_Type + Gear + CCM + Color + Body_Type + Seller + Seller_Status + Kilometers, data = carmarket[-i,])
fit_lm_pred <- (predict(fit_lm, carmarket[i,]))^2
fit_lm_error[i] <- carmarket$Price[i] - fit_lm_pred
fit_lm_sq_error[i] = (carmarket$Price[i] - fit_lm_pred)^2
}
As can be seen from the histogram, the error of our model is very low.
## [1] 136667798459