At first, we need to load the required packages.

pti <- c("data.table", "tidyverse", "lubridate", "knitr", "tinytex", "rpart", "randomForest", "rpart.plot", "caret")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
    install.packages(pti)
}

library(tidyverse)
library(lubridate)
library(knitr)
library(tinytex)
library(data.table)
library(randomForest)
library(rpart)
library(rpart.plot)

Reading the Data

We download the preprocessed data. You can see the preprocessing steps from this link.

#data uploading
carmarket = readRDS(gzcon(url("https://github.com/pjournal/boun01g-data-mine-r-s/blob/gh-pages/Project/turkey_car_market_EDA?raw=true")))

This is the structure of the data.

str(carmarket)
## Classes 'data.table' and 'data.frame':   8834 obs. of  17 variables:
##  $ Date              : Date, format: "2020-04-18" "2020-04-18" ...
##  $ Year              : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ Month             : Factor w/ 5 levels "3","4","5","6",..: 2 2 2 3 2 3 3 2 2 3 ...
##  $ Brand             : Factor w/ 36 levels "Acura","Alfa Romeo",..: 24 3 3 21 21 30 30 24 24 30 ...
##  $ Vehicle_Type_Group: Factor w/ 387 levels "1 Series","100 Series",..: 200 319 319 281 281 285 285 324 200 285 ...
##  $ Vehicle_Type      : Factor w/ 1716 levels "-","0.9","0.9 Icon",..: 1369 1671 1671 1660 1660 1660 1662 1681 1350 1662 ...
##  $ Model_Year        : num  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ Fuel_Type         : Factor w/ 4 levels "Diesel","Electricity",..: 1 2 2 1 1 4 4 1 1 4 ...
##  $ Gear              : Factor w/ 3 levels "Automatic","Manual",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ CCM               : Factor w/ 14 levels "Don't Know","1300 cc and below",..: 7 2 2 7 7 7 7 7 7 7 ...
##  $ Horse_Power       : Factor w/ 15 levels "Don't Know","100 HP and below",..: 4 9 9 6 6 8 8 4 4 8 ...
##  $ Color             : Factor w/ 28 levels "Amaranth","Beige",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Body_Type         : Factor w/ 13 levels "Crossover","Glass Van",..: 7 13 13 7 7 4 4 11 7 4 ...
##  $ Seller            : Factor w/ 3 levels "Authority","Galery",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Seller_Status     : Factor w/ 4 levels "0 km","2nd Hand",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ Kilometers        : int  0 250 0 0 0 0 0 0 0 0 ...
##  $ Price             : int  2720000 2550000 2500000 2425000 2390000 2375000 2375000 2350000 2335000 2310000 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Data Preparation

While setting up the model, we encountered an error when the level number of factor type variables exceeded 32. So we overcame this problem by adding some variables to dummy variables.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
dmy <- dummyVars(" ~ .", data = carmarket[, .(Brand, Fuel_Type, Gear, Seller_Status)])
carmarket_encoded <- data.frame(predict(dmy, newdata = carmarket[, .(Brand, Fuel_Type, Gear, Seller_Status)]))
carmarket_encoded$Kilometers = carmarket$Kilometers
carmarket_encoded$Model_Year = carmarket$Model_Year
carmarket_encoded$CCM = carmarket$CCM
carmarket_encoded$Horse_Power = carmarket$Horse_Power
carmarket_encoded$Price = carmarket$Price
carmarket_encoded$Seller = carmarket$Seller

We divided our data into train and test. We will set up our model using the train part, then we will try to estimate the car prices using the model we have set up in test data. Consequently, we will push our model by looking at various metrics.

set.seed(492)
ind<-sample(2,nrow(carmarket_encoded),replace = TRUE,prob = c(0.7,0.3))
carmarket.test<-carmarket[ind==2,]
carmarket.train<-carmarket[ind==1, ] 

Determining the Formula

Manually, we evaluated our formulas, then we specified the formula to use in the algorithm below.

fmla<-Price~Gear+Horse_Power+Color+Kilometers+Model_Year+Fuel_Type+Body_Type #Seller Status ekleyince düştü
fmla1<-Price~Gear+Color+Kilometers+Model_Year+Fuel_Type+Body_Type+Seller_Status

Using the Random Forest Algorithm

Random Forest algorithm builds the multiple decision trees which are known as forest and glue them together to urge a more accurate and stable prediction.So we used randomForest()

  carmarket.train.rf<-randomForest(fmla,data=carmarket.train,ntree=116)
  carmarket.test$rfpredict<-predict(carmarket.train.rf,carmarket.test)
  print(error<-sqrt(mean((carmarket.test$rfpredict-carmarket.test$Price)^2)))
## [1] 72356.63

The r-square value is a metric that shows how well our prediction and actual observations fit each other. Our r-squared value is 0.8313471. This value shows that our model explains real data well.

R_squared <- 1 - (sum((carmarket.test$rfpredict-carmarket.test$Price)^2)/sum((carmarket.test$Price-mean(carmarket.test$Price))^2))
R_squared # 0.8313471
## [1] 0.8315713

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.Our rmse value is 72404.77, which clearly shows that the prediction we made is acceptable.

residuals<-carmarket.test$Price-carmarket.test$rfpredict
relative_error <- residuals/carmarket.test$Price

print(rmse<- sqrt(mean(residuals^2)))  #72404.77
## [1] 72356.63
print(rmse.rel<-sqrt(mean(relative_error^2))) # 0.4448891
## [1] 0.4610714
#MSE = (1/n) * Σ(actual – prediction)2

mse.rf<-(1/nrow(carmarket))*sum(residuals^2)
mse.rf # 1581517991
## [1] 1579415675

We used varImpPlot() function to see how effective it is in which variable model.

varImpPlot(carmarket.train.rf)

We visualized the regression we set up

ggplot(carmarket.test,aes(rfpredict,Price))+
  geom_point()+
  geom_abline(color="blue")+
  coord_cartesian(xlim = c(0, 350000), ylim = c(0, 350000))+
  ggtitle("Prediction Price vs Actual Price")

Below is a basic table of the prediction and actual values we have made.

results<- carmarket.test[,17:18]
head(results,n=10)
##       Price rfpredict
##  1: 2720000   1663762
##  2: 2375000   1799627
##  3: 2350000   1113409
##  4: 2310000   1799627
##  5: 2250000   2089320
##  6: 1980000   1882624
##  7: 1945000   1565244
##  8: 1795000   1458017
##  9: 1715000   1192977
## 10: 1690000   1734953

Using the Rpart

Although the above results are satisfactory, we wanted to check our model above using another algorithm. That’s why we used the rpart package and reached the following results.

set.seed(492)
ind<-sample(2,nrow(carmarket_encoded),replace = TRUE,prob = c(0.7,0.3))

rptrain<-carmarket[ind==1, ] 
rptest<-carmarket[ind==2, ]

fmla1<-Price~Gear+Horse_Power+Color+Kilometers+Model_Year+Fuel_Type+Body_Type+Seller_Status

We established our model using the rpart library.

rp = rpart(fmla1, data=rptrain, control=rpart.control(minsplit=3,minbucket=1,cp=0.001))
prp(rp)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

rptest$prediction<-predict(rp,newdata=rptest)

We calculated the r square value of the model we built. We found the value as 0.7391716. We have observed that our model does not give a good result as a random forest.

Rpart_r.squared <- 1 - (sum((rptest$prediction-rptest$Price)^2)/sum((rptest$Price-mean(rptest$Price))^2))
Rpart_r.squared # 0.7391716
## [1] 0.7391716
#MSE = (1/n) * Σ(actual – prediction)2

mse.rpart<-(1/nrow(rptest))*sum((rptest$Price-rptest$prediction)^2)
mse.rpart #8107658376
## [1] 8107658376

We visualized the our rpart model

ggplot(rptest,aes(prediction,Price))+
  geom_point()+
  geom_abline(color="blue")+
  coord_cartesian(xlim = c(0, 350000), ylim = c(0, 350000))+
  ggtitle("Prediction Price vs Actual Price")

The table which display actual price versus prediction is below

PricevsPrediction_table<-rptest[,17:18]
PricevsPrediction_table
##         Price prediction
##    1: 2720000 2291666.67
##    2: 2375000 2375000.00
##    3: 2350000 2291666.67
##    4: 2310000 2375000.00
##    5: 2250000 2294500.00
##   ---                   
## 2661:   10000   30347.11
## 2662:   10000   30347.11
## 2663:   10000   30347.11
## 2664:   10000   30347.11
## 2665:    9900   30347.11

Linear Modelling

As a third method, we wanted to establish a Linear model. For this, we quickly made the necessary data preparations and set up our model.

model <- lm(Price ~ Brand + as.factor(Model_Year) + Fuel_Type + Gear + CCM + Horse_Power + Color + Body_Type + Seller + Seller_Status, data = carmarket)
anova(model)
## Analysis of Variance Table
## 
## Response: Price
##                         Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Brand                   31 1.1762e+14 3.7943e+12 324.7658 < 2.2e-16 ***
## as.factor(Model_Year)   45 6.0360e+13 1.3413e+12 114.8091 < 2.2e-16 ***
## Fuel_Type                3 4.4216e+12 1.4739e+12 126.1520 < 2.2e-16 ***
## Gear                     2 9.8737e+11 4.9368e+11  42.2557 < 2.2e-16 ***
## CCM                     11 1.3373e+13 1.2157e+12 104.0559 < 2.2e-16 ***
## Horse_Power             12 3.0655e+12 2.5546e+11  21.8653 < 2.2e-16 ***
## Color                   27 2.7239e+12 1.0089e+11   8.6351 < 2.2e-16 ***
## Body_Type               12 6.9846e+11 5.8205e+10   4.9820 2.655e-08 ***
## Seller                   2 5.1788e+11 2.5894e+11  22.1636 2.506e-10 ***
## Seller_Status            3 6.9278e+11 2.3093e+11  19.7655 9.158e-13 ***
## Residuals             8685 1.0147e+14 1.1683e+10                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
carmarket <- carmarket[carmarket$Model_Year != 1961 & carmarket$Model_Year != 1977 & carmarket$Model_Year != 1978,]
carmarket <- carmarket[carmarket$CCM != "Don't Know" & carmarket$CCM != "4501-5000 cc" & carmarket$CCM != "5501-6000 cc" & carmarket$CCM != "6001 cc and above",]
carmarket <- carmarket[carmarket$Color != "Amaranth" & carmarket$Color != "Magenta" & carmarket$Color != "Pink",]
pred_length <- nrow(carmarket)
fit_lm_error <- c()
fit_lm_sq_error <- c()

We use for loop in order to find the best linear model.

for(i in 1:pred_length){
  fit_lm <- lm(Price ~ Brand + as.factor(Model_Year) + Fuel_Type + Gear + CCM + Color + Body_Type + Seller + Seller_Status + Kilometers, data = carmarket[-i,])
  fit_lm_pred <- (predict(fit_lm, carmarket[i,]))^2
  fit_lm_error[i] <- carmarket$Price[i] - fit_lm_pred
  fit_lm_sq_error[i] = (carmarket$Price[i] - fit_lm_pred)^2
}

As can be seen from the histogram, the error of our model is very low.

hist(fit_lm_error, breaks = 20)

rmse_fit_lm <- sqrt(mean(fit_lm_sq_error))
rmse_fit_lm #204853658239
## [1] 136667798459