TURKEY CAR MARKET 2020

1. Data

Our data is gathered from the Kaggle. Also you can find the csv in our Github Repository. The data are about the online car advertisements in Turkey market. There are 15 columns in the dataset. These are

  1. Date (the date that the advertisement is published)
  2. Brand (the brand of the car)
  3. Vehicle_Type_Group (the vehicle type group of the car like Ranger, Sorento etc.)
  4. Vehicle_Type (the vehicle type of the car like 1.6, Broadway etc.)
  5. Model_Year (the year the car is produced)
  6. Fuel_Type (fuel type that the car has like Diesel, Electricity etc.)
  7. Gear (gear type the car has like Automatic, Manual etc.)
  8. CCM (the indicator of the motor capacity)
  9. Horse_Power (the indicator of the motor power)
  10. Color (the color of the car)
  11. Body_Type (the type of the body of the car like Sedan, Minivan etc.)
  12. Seller (the owner of the car)
  13. Seller_Status (the status of the usage of the car like 0 km, 2nd Hand etc.)
  14. Kilometers (the kilometer that the car traveled)
  15. Price (the price of the car in that advertisement)

This data is prepared from the Turkish sites, so all values sand columns are in Turkish.

1.1 Packages for Preprocessing Steps

Before the preprocessing, we need to upload required packages which are given in the below:

  1. tidyverse
  2. shiny
  3. ggplot2
  4. lubridate
  5. knitr
  6. tinytex
#Required packages
pti <- c("data.table", "tidyverse", "lubridate", "knitr", "tinytex", "scales")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
    install.packages(pti)
}

library(tidyverse)
library(lubridate)
library(knitr)
library(tinytex)
library(data.table)
library(scales)
#library(kableExtra)

1.2 Preprocessing of the Data

At first, we need to upload the dataset. To do so, you can run the following command:

#data uploading
carmarket = fread('https://raw.githubusercontent.com/pjournal/boun01g-data-mine-r-s/gh-pages/data.table/turkey_car_market.csv')

Our data set which we obtain from the kaggle is created by using Turkish language. For this reason, before the exploratory analysis, we transform our name of the columns to English. For this, we use the following code.

#changing column names by using English meaning
col_names = c('Date', 'Brand', 'Vehicle_Type_Group', 'Vehicle_Type', 'Model_Year', 'Fuel_Type', 'Gear', 'CCM', 'Horse_Power', 'Color', 'Body_Type', 'Seller', 'Seller_Status', 'Kilometers', 'Price')
names(carmarket) = col_names

For the values, we use the following code.

carmarket = carmarket %>%
                      #Brand type
                      mutate(Brand = as.character(Brand),
                             Brand = if_else(Brand == 'TofaÅŸ', 'Tofas', Brand),
                             
                             Brand = as.factor(Brand))%>%
                      #Seller
                      mutate(Seller = as.character(Seller),
                             Seller = if_else(Seller == 'Galeriden', 'Galery', if_else(Seller == 'Sahibinden', 'Owner', 'Authority')),
                             Seller = as.factor(Seller)) %>%  #unique(carmarket$Seller) #Turkish seller names are changed by English version
  
                      #fuel type
                      mutate(Fuel_Type = as.character(Fuel_Type),
                             Fuel_Type = if_else(Fuel_Type == 'Dizel', 'Diesel', 
                                                 if_else(Fuel_Type == 'Benzin/LPG', 'Gasoline', 
                                                         if_else(Fuel_Type == 'Hibrit', 'Hybrid', 'Electricity'))),
                             Fuel_Type = as.factor(Fuel_Type)) %>%
  
                      #Gear
                      mutate(Gear = as.character(Gear),
                             Gear = if_else(Gear == 'Otomatik Vites', 'Automatic', 
                                              if_else(Gear == 'Yarı Otomatik Vites', 'Semi Automatic', 'Manual')),
                             Gear = as.factor(Gear))%>%
  
                      #Seller Status
                      mutate(Seller_Status = as.character(Seller_Status),
                             Seller_Status = if_else(Seller_Status == '2. El', '2nd Hand', 
                                                 if_else(Seller_Status == 'Klasik', 'Classic', 
                                                             if_else(Seller_Status == "0 km", "0 km", "Damaged"))),
                             Seller_Status = as.factor(Seller_Status))%>%
                      
                      #Vehicle Type Group
                      mutate(Vehicle_Type_Group = as.character(Vehicle_Type_Group),
                             Vehicle_Type_Group = case_when(Vehicle_Type_Group == 'DiÄŸer' ~ 'Other',
                                        Vehicle_Type_Group == 'A Serisi' ~ 'A Series',
                                        Vehicle_Type_Group == '100 Serisi' ~ '100 Series',
                                        Vehicle_Type_Group == '5 Serisi' ~ '5 Series',
                                        Vehicle_Type_Group == 'S Serisi' ~ 'S Series',
                                        Vehicle_Type_Group == '7 Serisi' ~ '7 Series',
                                        Vehicle_Type_Group == '3 Serisi' ~ '3 Series',
                                        Vehicle_Type_Group == '1 Serisi' ~ '1 Series',
                                        Vehicle_Type_Group == '4 Serisi' ~ '4 Series',
                                        Vehicle_Type_Group == '2 Serisi' ~ '2 Series',
                                        Vehicle_Type_Group == '6 Serisi' ~ '6 Series',
                                        Vehicle_Type_Group == 'Z Serisi' ~ 'Z Series',
                                        Vehicle_Type_Group == 'M Serisi' ~ 'M Series',
                                        Vehicle_Type_Group == '500 Ailesi' ~ '500 Family',
                                        Vehicle_Type_Group == 'E Serisi' ~ 'E Series',
                                        Vehicle_Type_Group == 'B Serisi' ~ 'B Series',
                                        Vehicle_Type_Group == 'G Serisi' ~ 'G Series',
                                        Vehicle_Type_Group == 'Å\u009eahin' ~ 'Sahin',
                                        Vehicle_Type_Group == 'Serçe' ~ 'Serce',
                                        Vehicle_Type_Group == 'DoÄŸan' ~ 'Dogan',
                                        TRUE ~ Vehicle_Type_Group),
                              Vehicle_Type_Group = as.factor(Vehicle_Type_Group)) %>%
  
                        #Vehicle Type
                        mutate(Vehicle_Type = as.character(Vehicle_Type),
                               Vehicle_Type = case_when(Vehicle_Type == 'DiÄŸer' ~ 'Other',
                                  Vehicle_Type == '1.6 BlueHDI 130HP INTENSIVE EAT6 7 KiÅŸilik' ~ '1.6 BlueHDI 130HP INTENSIVE EAT6 7 Seats',
                                  Vehicle_Type == '1.3 CDTI Enjoy 111.Yıl' ~ '1.3 CDTI Enjoy 111',
                                  Vehicle_Type == '1.3 GLS Otm.' ~ '1.3 GLS Auto',
                                  Vehicle_Type == '1.4 120.Yıl' ~ '1.4 120',
                                  Vehicle_Type == '1.4 CVVT Select Otm.' ~ '1.4 CVVT Select Auto',
                                  Vehicle_Type == '1.4 CVVT Style Otm.' ~ '1.4 CVVT Style Auto',
                                  Vehicle_Type == '0.9\n' ~ '0.9',
                                  Vehicle_Type == '1.0 TFSI Sportback\n' ~ '1.0 TFSI Sportback',
                                  Vehicle_Type == '1.0\n' ~ '1.0',
                                  Vehicle_Type == '1.2 D-CVVT\n' ~ '1.2 D-CVVT',
                                  Vehicle_Type == '1.2 DOHC\n' ~ '1.2 DOHC',
                                  Vehicle_Type == '1.2 TSI Sport Tourer\n' ~ '1.2 TSI Sport Tourer',
                                  Vehicle_Type == '1.2 TSI\n' ~ '1.2 TSI',
                                  Vehicle_Type == '1.2 VTi\n' ~ '1.2 VTi',
                                  Vehicle_Type == '1.2\n' ~ '1.2',
                                  Vehicle_Type == '1.3 CDTI \n' ~ '1.3 CDTI',
                                  Vehicle_Type == '1.3 Mjet\n' ~ '1.3 Mjet',
                                  Vehicle_Type == '1.3 Multijet Active Plus\n' ~ '1.3 Multijet Active Plus',
                                  Vehicle_Type == '1.3 Multijet Emotion Plus\n' ~ '1.3 Multijet Emotion Plus',
                                  Vehicle_Type == '1.3 Multijet\n' ~ '1.3 Multijet',
                                  Vehicle_Type == '1.3\n' ~ '1.3',
                                  Vehicle_Type == '1.4 CRDi' ~ '1.4 CRDi',
                                  Vehicle_Type == '1.4 CVVT\n' ~ '1.4 CVVT',
                                  Vehicle_Type == '1.4 Fire\n' ~ '1.4 Fire',
                                  Vehicle_Type == '1.4 HDi X\n' ~ '1.4 HDi X',
                                  Vehicle_Type == '1.4 HDi\n' ~ '1.4 HDI',
                                  Vehicle_Type == '1.4 ie SX\n' ~ '1.4 ie SX',
                                  Vehicle_Type == '1.4 PopStar\n' ~ '1.4 PopStar',
                                  Vehicle_Type == '1.4 T\n' ~ '1.4 T',
                                  Vehicle_Type == '1.4 TDCi\n' ~ '1.4 TDCi',
                                  Vehicle_Type == '1.4 TDI\n' ~ '1.4 TDI',
                                  Vehicle_Type == '1.4 TSI Sport Coupe Cupra\n' ~ '1.4 TSI Sport Coupe Cupra',
                                  Vehicle_Type == '1.4 Twinport\n' ~ '1.4 Twinport',
                                  Vehicle_Type == '1.4\n' ~ '1.4',
                                  Vehicle_Type == '1.5 CRDi \n' ~ '1.5 CRDi',
                                  Vehicle_Type == '1.5 DCI Ambiance' ~ '1.5 Dci Ambiance',
                                  Vehicle_Type == '1.5 dCi Ambiance\n' ~ '1.5 dCi Ambiance',
                                  Vehicle_Type == '1.5 dCi Authentique' ~ '1.5 DCi Authentique',
                                  Vehicle_Type == '1.5 DCi Extreme' ~ '1.5 dCi Extreme',
                                  Vehicle_Type == '1.5 dCi Grandtour Authentique\n' ~ '1.5 dCi Grandtour Authentique Edition',
                                  Vehicle_Type == '1.5 DCi Expression' ~ '1.5 dCi Expression',
                                  Vehicle_Type == '1.5 dCi Grandtour Executive\n' ~ '1.5 dCi Grandtour Executive',
                                  Vehicle_Type == '1.5 dCi GT-Line\n' ~ '1.5 dCi GT Line',
                                  Vehicle_Type == '1.5 DCi Privilege' ~ '1.5 dCi Privilege',
                                  Vehicle_Type == '1.5 DCI Tekna' ~ '1.5 dCi Tekna',
                                  Vehicle_Type == '1.5 DCI Visia' ~ '1.5 dCi Visia',
                                  Vehicle_Type == '1.5 dCi\n' ~ '1.5 dCi',
                                  Vehicle_Type == '1.5 TDCi Titanium' ~ '1.5 TDCI Titanium',
                                  Vehicle_Type == '1.5\n' ~ '1.5',
                                  Vehicle_Type == '1.6 BlueHDI 130HP INTENSIVE EAT6 7 KiÅŸilik' ~ '1.6 BlueHDI 130HP INTENSIVE EAT6 7 Seats',
                                  Vehicle_Type == '1.6 BlueHDI Active' ~ '1.6 BlueHDi Active',
                                  Vehicle_Type == '1.6 CDTI  Grand Sport 120.Yıl' ~ '1.6 CDTI  Grand Sport 120',
                                  Vehicle_Type == '1.6 CDTI  Cosmo' ~ '1.6 CDTI Cosmo',
                                  Vehicle_Type == '1.6 CDTI  Business' ~ '1.6 CDTIBusiness',
                                  Vehicle_Type == '1.6 CDTI  Design' ~ '1.6 CDTI Design',
                                  Vehicle_Type == '1.6 CDTI  Elite' ~ '1.6 CDTI Elite',
                                  Vehicle_Type == '1.6 CDTI  Enjoy' ~ '1.6 CDTI Enjoy',
                                  Vehicle_Type == '1.6 CDTI  Sport' ~ '1.6 CDTI Sport',
                                  Vehicle_Type == '1.6 CDTI\n' ~ '1.6 CDTI',
                                  Vehicle_Type == '1.6 CR TDI Elegance\n' ~ '1.6 CR TDI Elegance',
                                  Vehicle_Type == '1.6 CRDI Concept Plus' ~ '1.6 CRDi Concept Plus',
                                  Vehicle_Type == '1.6 CRDi Elite\n' ~ '1.6 CRDi Elite',
                                  Vehicle_Type == '1.6 D Türkiye Paketi' ~ '1.6 D Turkey Package',
                                  Vehicle_Type == '1.6 dCi Tekna Sky Pack\n' ~ '1.6 dCi Tekna Sky Pack',
                                  Vehicle_Type == '1.6 e-HDi\n' ~ '1.6 e-HDi',
                                  Vehicle_Type == '1.6 Enjoy 111. Yıl' ~ '1.6 Enjoy 111',
                                  Vehicle_Type == '1.6 GDI\n' ~ '1.6 GDI',
                                  Vehicle_Type == '1.6 HDi \n' ~ '1.6 HDi',
                                  Vehicle_Type == '1.6 HDi\n' ~ '1.6 HDi',
                                  Vehicle_Type == '1.6 HDi Confort' ~ '1.6 HDi Comfort',
                                  Vehicle_Type == '1.6 HDI SX PK' ~ '1.6 HDi SX PK',
                                  Vehicle_Type == '1.6 Invite Otm' ~ '1.6 Invite Auto',
                                  Vehicle_Type == '1.6 LT\n' ~ '1.6 LT',
                                  Vehicle_Type == '1.6 Mjet\n' ~ '1.6 Mjet',
                                  Vehicle_Type == '1.6 Multijet Emotion Plus\n' ~ '1.6 Multijet Emotion Plus',
                                  Vehicle_Type == '1.6 Multijet\n' ~ '1.6 Multijet',
                                  Vehicle_Type == '1.6 Privelege' ~ '1.6 Privilege',
                                  Vehicle_Type == '1.6 Türkiye Paketi' ~ '1.6 Turkey Package',
                                  Vehicle_Type == '1.6 TDCÄ° Ghia' ~ '1.6 TDCi Ghia',
                                  Vehicle_Type == '1.6 TDCi\n' ~ '1.6 TDCi',
                                  Vehicle_Type == '1.6 TDI \n' ~ '1.6 TDI',
                                  Vehicle_Type == '1.6 TDI Ambiente' ~ '1.6 TDi Ambiente',
                                  Vehicle_Type == '1.6 TDI Ambition Sportback\n' ~ '1.6 TDI Ambition Sportback',
                                  Vehicle_Type == '1.6 TDI Attraction Sportback\n' ~ '1.6 TDI Attraction Sportback',
                                  Vehicle_Type == '1.6 TDI Elegance' ~ '1.6 TDi Elegance',
                                  Vehicle_Type == '1.6 TDI GreenLine Elegance\n' ~ '1.6 TDI GreenLine Elegance',
                                  Vehicle_Type == '1.6 TDI Sportback\n' ~ '1.6 TDI Sportback',
                                  Vehicle_Type == '1.6 TDI\n' ~ '1.6 TDI',
                                  Vehicle_Type == '1.6 Ti-VCT\n' ~ '1.6 Ti-VCT',
                                  Vehicle_Type == '1.6 VTES ES' ~ '1.6 VTEC ES',
                                  Vehicle_Type == '1.6 XHT Cosmo\n' ~ '1.6 XHT Cosmo',
                                  Vehicle_Type == '1.6\n' ~ '1.6',
                                  Vehicle_Type == '1.7\n' ~ '1.7',
                                  Vehicle_Type == '1.8\n' ~ '1.8',
                                  Vehicle_Type == 'Serçe' ~ 'Serce',
                                  Vehicle_Type == 'Kartal 5 Vites' ~ 'Kartal 5 Transmissions',
                                  Vehicle_Type == 'EVO 1.4 Fire Active' ~ 'EVO 1.4 Active Fire',
                                  Vehicle_Type == 'Country 4x2\n' ~ 'Country 4x2',
                                  Vehicle_Type == 'Å\u009eahin 5 vites' ~ 'Sahin 5 Transmissions',
                                  Vehicle_Type == '4x2\n' ~ '4x2',
                                  Vehicle_Type == '3.0 SDV6\n' ~ '3.0 SDV6',
                                  Vehicle_Type == '200 BlueTEC AMG\n' ~ '200 BlueTEC AMG',
                                  Vehicle_Type == '200 BlueEfficiency Avantgarde\n' ~ '200 BlueEfficiency Avantgarde',
                                  Vehicle_Type == '2.5 Çift Kabin 4x4' ~ '2.5 Double Cabinet 4x4',
                                  Vehicle_Type == '2.0i sDrive\n' ~ '2.0i sDrive',
                                  Vehicle_Type == '2.0\n' ~ '2.0',
                                  Vehicle_Type == '2.0 TDI\n' ~ '2.0 TDI',
                                  Vehicle_Type == '2.0 TDCi\n' ~ '2.0 TDCi',
                                  Vehicle_Type == '2.0 SRTi \n' ~ '2.0 SRTi',
                                  Vehicle_Type == '2.0 Quattro\n' ~ '2.0 Quattro',
                                  Vehicle_Type == '2.0 Ghia Otm.' ~ '2.0 Ghia Auto',
                                  Vehicle_Type == '2.0 CRDi GLS\n' ~ '2.0 CRDi GLS',
                                  Vehicle_Type == '2.0 24V LT Otm.' ~ '2.0 24V LT Auto',
                                  Vehicle_Type == '180 CDI AMG\n' ~ '180 CDI AMG',
                                  Vehicle_Type == '180 BlueEfficiency AMG Sport\n' ~ '180 BlueEfficiency AMG Sport',
                                  Vehicle_Type == '180 BlueEfficiency Fascination\n' ~ '180 BlueEfficiency Fascination',
                                  Vehicle_Type == '150 Otm.' ~ '150 Auto',
                                  Vehicle_Type == '1.9 TDI\n' ~ '1.9 TDI',
                                  Vehicle_Type == '1.8\n' ~ '1.8',
                                  Vehicle_Type == '1.7\n' ~ '1.7',
                                  Vehicle_Type == '1.6\n' ~ '1.6',
                                  TRUE ~ Vehicle_Type),
                            Vehicle_Type = as.factor(Vehicle_Type)) %>%
  
                  #CCM
                  mutate(CCM = as.character(CCM),
                         CCM = case_when(CCM == '1300 cc ve altı' ~ '1300 cc and below',
                                         CCM == '6001 cc ve üzeri' ~ '6001 cc and above',
                                         CCM == 'Bilmiyorum' | CCM == '-' ~ 'Don\'t Know',
                                         TRUE ~ CCM),
                         CCM = factor(CCM, levels = c('Don\'t Know', '1300 cc and below', '1301-1600 cc', '1601-1800 cc', '1801-2000 cc', '2001-2500 cc', '2501-3000 cc', '3001-3500 cc', '3501-4000 cc', '4001-4500 cc', '4501-5000 cc', '5001-5500 cc', '5501-6000 cc', '6001 cc and above'))) %>%
  
                  #Horse Power
                  mutate(Horse_Power = as.character(Horse_Power),
                         Horse_Power = case_when(Horse_Power == '100 BG ve altı' ~ '100 HP and below',
                                                 Horse_Power == '50 BG ve altı' ~ '100 HP and below',
                                                 Horse_Power == '51-75 BG' ~ '100 HP and below',
                                                 Horse_Power == '601 BG ve üzeri' ~ '601 HP and above',
                                                 Horse_Power == '101-125 BG' ~ '101-125 HP',
                                                 Horse_Power == '126-150 BG' ~ '126-150 HP',
                                                 Horse_Power == '151-175 BG' ~ '151-175 HP',
                                                 Horse_Power == '176-200 BG' ~ '176-200 HP',
                                                 Horse_Power == '201-225 BG' ~ '201-225 HP',
                                                 Horse_Power == '226-250 BG' ~ '226-250 HP',
                                                 Horse_Power == '251-275 BG' ~ '251-275 HP',
                                                 Horse_Power == '276-300 BG' ~ '276-300 HP',
                                                 Horse_Power == '301-325 BG' ~ '301-325 HP',
                                                 Horse_Power == '326-350 BG' ~ '326-350 HP',
                                                 Horse_Power == '376-400 BG' ~ '376-400 HP',
                                                 Horse_Power == '451-475 BG' ~ '451-475 HP',
                                                 Horse_Power == '76-100 BG' ~ '100 HP and below',
                                                 Horse_Power == 'Bilmiyorum' | Horse_Power == '-' ~ 'Don\'t Know',
                                                 TRUE ~ Horse_Power),
                         Horse_Power = factor(Horse_Power, levels = c('Don\'t Know', '100 HP and below', '101-125 HP', '126-150 HP', '151-175 HP', '176-200 HP', '201-225 HP', '226-250 HP', '251-275 HP', '276-300 HP', '301-325 HP', '326-350 HP', '376-400 HP', '451-475 HP', '601 HP and above'))) %>%
  
                  #Color
                  mutate(Color = as.character(Color),
                         Color = case_when(Color == 'Å\u009eampanya' ~ 'Champagne',
                                           Color == 'Altın' ~ 'Gold',
                                           Color == 'Amarant' ~ 'Amaranth',
                                           Color == 'Bal Rengi' ~ 'Honey',
                                           Color == 'Bej' ~ 'Beige',
                                           Color == 'Beyaz' ~ 'White',
                                           Color == 'Bordo' ~ 'Burgundy',
                                           Color == 'DiÄŸer' ~ 'Other',
                                           Color == 'Füme' ~ 'Smoked',
                                           Color == 'Gümüş' ~ 'Silver',
                                           Color == 'Gümüş Gri' ~ 'Silver Gray',
                                           Color == 'Gri' ~ 'Gray',
                                           Color == 'Ihlamur' ~ 'Linden',
                                           Color == 'Kırmızı' ~ 'Red',
                                           Color == 'Kahverengi' ~ 'Brown',
                                           Color == 'Krem' ~ 'Cream',
                                           Color == 'Kum Rengi' ~ 'Sand Color',
                                           Color == 'Lacivert' ~ 'Dark Blue',
                                           Color == 'Mavi' ~ 'Blue',
                                           Color == 'Mor' ~ 'Purple',
                                           Color == 'Pembe' ~ 'Pink',
                                           Color == 'Sarı' ~ 'Yellow',
                                           Color == 'Siyah' ~ 'Black',
                                           Color == 'Turkuaz' ~ 'Turquoise',
                                           Color == 'Turuncu' ~ 'Orange',
                                           Color == 'YeÅŸil' ~ 'Green',
                                           Color == 'Zeytin Gri' ~ 'Olive Gray',
                                           Color == 'Eflatun' ~ 'Magenta',
                                           TRUE ~ Color),
                         Color = as.factor(Color)) %>%
  
                  #Body Type
                  mutate(Body_Type = as.character(Body_Type),
                         Body_Type = case_when(Body_Type == 'Üstü Açık / Cabriolet' ~ 'Open Top / Cabriolet',
                                               Body_Type == 'Arazi Aracı' ~ 'Off-road Vehicle',
                                               Body_Type == 'Camlı Van' ~ 'Glass Van',
                                               Body_Type == 'DiÄŸer' ~ 'Other',
                                               Body_Type == 'Hatchback 3 Kapı' ~ 'Hatchback 3 Doors',
                                               Body_Type == 'Hatchback 5 Kapı' ~ 'Hatchback 5 Doors',
                                               Body_Type == 'Spor / Coupe' ~ 'Sport / Coupe',
                                               TRUE ~ Body_Type),
                         Body_Type = as.factor(Body_Type)) 

We can add the year and month of advertisements in Year and Month column. To be able to see all columns in order, we put the columns about the date to the beginning.

carmarket[, Date := dmy(Date)]

carmarket = carmarket[, Year:= year(Date)]
carmarket = carmarket[, Month:= month(Date)]
carmarket$Month = as.factor(carmarket$Month)

setcolorder(carmarket, c("Date", "Year", "Month", col_names[2:15]))

From now on, we have all columns ready for future analysis and models.

1.3 Data Information

#To see column variable clearly
carmarket %>% 
  glimpse()
## Rows: 9,044
## Columns: 17
## $ Date               <date> 2020-05-27, 2020-06-16, 2020-06-14, 2020-06-11,...
## $ Year               <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month              <fct> 5, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 4, 4, 4, ...
## $ Brand              <fct> Jaguar, Acura, Acura, Acura, Acura, Acura, Acura...
## $ Vehicle_Type_Group <fct> XF, CL, CL, CL, CL, CL, CL, CL, RSX, RSX, CL, CL...
## $ Vehicle_Type       <fct> 2.0 D Prestige Plus, -, 2.2, -, 2.2, 2.2, 2.2, -...
## $ Model_Year         <dbl> 2017, 2015, 1994, 2013, 2010, 1999, 2014, 2014, ...
## $ Fuel_Type          <fct> Diesel, Diesel, Gasoline, Diesel, Diesel, Diesel...
## $ Gear               <fct> Automatic, Semi Automatic, Manual, Manual, Autom...
## $ CCM                <fct> 1801-2000 cc, 1301-1600 cc, 1301-1600 cc, 1301-1...
## $ Horse_Power        <fct> 176-200 HP, 101-125 HP, 101-125 HP, 100 HP and b...
## $ Color              <fct> Dark Blue, Blue, Turquoise, Brown, White, Dark B...
## $ Body_Type          <fct> Hatchback 5 Doors, Sedan, Sedan, Sedan, Sedan, R...
## $ Seller             <fct> Galery, Owner, Owner, Owner, Owner, Galery, Gale...
## $ Seller_Status      <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers         <int> 26100, 127000, 175000, 325, 207000, 320000, 183,...
## $ Price              <int> 634500, 151500, 19750, 52000, 148750, 42500, 170...

As we added two more columns, we have 17 columns. We have some categorical columns and numerical columns.

1.4 Handling “Don’t Know” Values

As always, we need to check for null values in the data.

sum(any(is.na(carmarket)))
## [1] 0

When we look for null values, the data don’t have any null values. But, there are some rows that contains Don’t Know value in CCM and Horse_Power columns.

carmarket[CCM == 'Don\'t Know', .N]
## [1] 109
carmarket[Horse_Power == 'Don\'t Know', .N]
## [1] 5550

For 109 Don’t Know values in CCM column, we can make imputation process. CCM is the volume of the motor of the car. It means that cars that have the same Brand, Fuel_Type and Model_Year values would have the same CCM value (we assume that all cars in the advertisements are produced for Turkey and even if these three variables covers almost 80% of the information about CCM, we will fill this column with respect to these three values.)

keys = c("Brand", "Model_Year", "Fuel_Type")
carmarket.CCM.DK = data.table(carmarket[CCM == 'Don\'t Know'], key = keys)
carmarket.CCM.FULL = data.table(carmarket[CCM != 'Don\'t Know'], key = keys)
carmarket.CCM.DK %>% glimpse()
## Rows: 109
## Columns: 17
## $ Date               <date> 2020-04-20, 2020-03-21, 2020-06-15, 2020-03-25,...
## $ Year               <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month              <fct> 4, 3, 6, 3, 6, 3, 5, 3, 6, 4, 5, 5, 6, 4, 4, 6, ...
## $ Brand              <fct> Acura, Acura, Audi, Audi, Audi, Audi, Audi, Audi...
## $ Vehicle_Type_Group <fct> CL, TL, A4, A3, A4, A4, A4, A3, 3 Series, X5, Ch...
## $ Vehicle_Type       <fct> 2.2, Other, 2.0 TDI, 1.4 TFSI, 2.0 TDI, 2.0 TDI,...
## $ Model_Year         <dbl> 2012, 2016, 2009, 2010, 2013, 2013, 2014, 2014, ...
## $ Fuel_Type          <fct> Diesel, Diesel, Diesel, Electricity, Diesel, Die...
## $ Gear               <fct> Manual, Manual, Semi Automatic, Manual, Automati...
## $ CCM                <fct> Don't Know, Don't Know, Don't Know, Don't Know, ...
## $ Horse_Power        <fct> Don't Know, Don't Know, 126-150 HP, 126-150 HP, ...
## $ Color              <fct> Silver Gray, White, Smoked, White, Black, White,...
## $ Body_Type          <fct> Off-road Vehicle, Sedan, Sedan, Hatchback 3 Door...
## $ Seller             <fct> Galery, Owner, Owner, Owner, Galery, Owner, Gale...
## $ Seller_Status      <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers         <int> 270, 172000, 302000, 99000, 315000, 200000, 2830...
## $ Price              <int> 82500, 92000, 117500, 95000, 175000, 178000, 225...
carmarket.CCM.FULL %>% glimpse()
## Rows: 8,935
## Columns: 17
## $ Date               <date> 2020-06-14, 2020-06-04, 2020-05-24, 2020-04-27,...
## $ Year               <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month              <fct> 6, 6, 5, 4, 4, 4, 4, 3, 4, 4, 6, 5, 4, 4, 4, 4, ...
## $ Brand              <fct> Acura, Acura, Acura, Acura, Acura, Acura, Acura,...
## $ Vehicle_Type_Group <fct> CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, ...
## $ Vehicle_Type       <fct> 2.2, 2.2, -, -, -, -, Other, 2.2, -, -, 2.2, -, ...
## $ Model_Year         <dbl> 1994, 1999, 2001, 2004, 2004, 2004, 2004, 2008, ...
## $ Fuel_Type          <fct> Gasoline, Diesel, Gasoline, Diesel, Gasoline, Ga...
## $ Gear               <fct> Manual, Manual, Manual, Manual, Manual, Manual, ...
## $ CCM                <fct> 1301-1600 cc, 1801-2000 cc, 1301-1600 cc, 1301-1...
## $ Horse_Power        <fct> 101-125 HP, 101-125 HP, 101-125 HP, 100 HP and b...
## $ Color              <fct> Turquoise, Dark Blue, Black, White, Dark Blue, B...
## $ Body_Type          <fct> Sedan, Roadster, Hatchback 5 Doors, Hatchback 5 ...
## $ Seller             <fct> Owner, Galery, Owner, Galery, Owner, Owner, Owne...
## $ Seller_Status      <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers         <int> 175000, 320000, 252000, 230000, 18000, 240000, 1...
## $ Price              <int> 19750, 42500, 36750, 32000, 34500, 50000, 55000,...

We divided the data into 2 groups with respect to CCM column.

calculate_mode = function(x){
  uniques = unique(x)
  uniques[which.max(tabulate(match(x, uniques)))]
}

We created a function to get the mode of the CCM column of data.table which has no Don’t Know values in that column.

for (i in 1: nrow(carmarket.CCM.DK)){
  brand = carmarket.CCM.DK[i, Brand]
  model_year = carmarket.CCM.DK[i, Model_Year]
  fuel_type = carmarket.CCM.DK[i, Fuel_Type]
  carmarket.CCM.DK[i, CCM := calculate_mode(carmarket.CCM.FULL[Brand == brand & Model_Year == model_year & Fuel_Type == fuel_type, .(Brand, Model_Year, Fuel_Type, CCM)])$CCM]
}

We applied this function for all the rows of the carmarket.CCM.DK data.table.

carmarket.CCM.DK[is.na(CCM), .N]
## [1] 9

After this process, we filled 100 rows and couldn’t fill 9 rows. With respect to number of observations in the data frame, which is 9044, we can drop these rows and add these two data.tables.

carmarket = rbind(na.omit(carmarket.CCM.DK, 'CCM'), carmarket.CCM.FULL)

Now, we need to consider these steps for Horse_Power. Horse_Power shows the measurement unit of power that engine has. It means that cars that have the same Brand and CCM values would have the same CCM value. We can do the same step in one chunk.

keys = c("Brand", "CCM")
carmarket.HP.DK = data.table(carmarket[Horse_Power == 'Don\'t Know'], key = keys)
carmarket.HP.FULL = data.table(carmarket[Horse_Power != 'Don\'t Know'], key = keys)
carmarket.HP.DK %>% glimpse()
## Rows: 5,544
## Columns: 17
## $ Date               <date> 2020-04-14, 2020-04-20, 2020-04-18, 2020-04-18,...
## $ Year               <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month              <fct> 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ Brand              <fct> Acura, Acura, Alfa Romeo, Alfa Romeo, Alfa Romeo...
## $ Vehicle_Type_Group <fct> CL, CL, Other, 156, Giulietta, Other, Giulietta,...
## $ Vehicle_Type       <fct> -, 2.2, Other, 1.6 TS, 1.6 JTD Distinctive, Othe...
## $ Model_Year         <dbl> 2011, 2012, 2013, 1998, 2012, 2013, 2013, 2016, ...
## $ Fuel_Type          <fct> Diesel, Diesel, Diesel, Gasoline, Diesel, Electr...
## $ Gear               <fct> Manual, Manual, Manual, Manual, Manual, Semi Aut...
## $ CCM                <fct> 1300 cc and below, 1801-2000 cc, 1300 cc and bel...
## $ Horse_Power        <fct> Don't Know, Don't Know, Don't Know, Don't Know, ...
## $ Color              <fct> Black, Silver Gray, Red, Red, White, Blue, White...
## $ Body_Type          <fct> Hatchback 5 Doors, Off-road Vehicle, Hatchback 3...
## $ Seller             <fct> Owner, Galery, Galery, Galery, Galery, Galery, G...
## $ Seller_Status      <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers         <int> 182, 270, 182000, 262000, 222000, 99000, 85000, ...
## $ Price              <int> 62000, 82500, 72950, 32000, 69000, 105900, 12750...
carmarket.HP.FULL %>% glimpse()
## Rows: 3,491
## Columns: 17
## $ Date               <date> 2020-04-15, 2020-04-15, 2020-05-08, 2020-04-14,...
## $ Year               <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month              <fct> 4, 4, 5, 4, 6, 5, 4, 4, 4, 3, 6, 6, 6, 5, 6, 4, ...
## $ Brand              <fct> Acura, Acura, Acura, Acura, Acura, Acura, Acura,...
## $ Vehicle_Type_Group <fct> CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, ...
## $ Vehicle_Type       <fct> -, -, -, 2.2, 2.2, -, -, -, Other, 2.2, -, 2.2, ...
## $ Model_Year         <dbl> 2009, 2009, 2010, 2016, 1994, 2001, 2004, 2004, ...
## $ Fuel_Type          <fct> Diesel, Diesel, Diesel, Electricity, Gasoline, G...
## $ Gear               <fct> Manual, Manual, Manual, Manual, Manual, Manual, ...
## $ CCM                <fct> 1300 cc and below, 1300 cc and below, 1300 cc an...
## $ Horse_Power        <fct> 100 HP and below, 100 HP and below, 100 HP and b...
## $ Color              <fct> Burgundy, Burgundy, Gray, Blue, Turquoise, Black...
## $ Body_Type          <fct> Sedan, Sedan, Sedan, Hatchback 5 Doors, Sedan, H...
## $ Seller             <fct> Owner, Owner, Galery, Galery, Owner, Owner, Gale...
## $ Seller_Status      <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers         <int> 150000, 150000, 232000, 70000, 175000, 252000, 2...
## $ Price              <int> 34000, 34000, 52000, 77500, 19750, 36750, 32000,...
for (i in 1: nrow(carmarket.HP.DK)){
  brand = carmarket.HP.DK[i, Brand]
  ccm = carmarket.HP.DK[i, CCM]
  carmarket.HP.DK[i, Horse_Power := calculate_mode(carmarket.HP.FULL[Brand == brand & CCM == ccm, .(Brand, CCM, Horse_Power)])$Horse_Power]
}

carmarket.HP.DK[is.na(Horse_Power), .N]
## [1] 200

After this process, we filled 5344 rows and couldn’t fill 200 rows. With respect to number of observations in the data frame, which is 9035, we can drop these rows and add these two data.tables.

carmarket = rbind(na.omit(carmarket.HP.DK, 'Horse_Power'), carmarket.HP.FULL)

After these processes, we have a full dataset. You can see the summary of the dataset below.

summary(carmarket)
##       Date                 Year      Month         Brand     
##  Min.   :2019-12-23   Min.   :2019   3 : 285   Renault:2079  
##  1st Qu.:2020-04-18   1st Qu.:2020   4 :5248   Fiat   : 653  
##  Median :2020-04-18   Median :2020   5 :1778   Opel   : 643  
##  Mean   :2020-05-01   Mean   :2020   6 :1522   Hyundai: 639  
##  3rd Qu.:2020-05-18   3rd Qu.:2020   12:   2   Ford   : 601  
##  Max.   :2020-06-16   Max.   :2020             BMW    : 598  
##                                                (Other):3622  
##  Vehicle_Type_Group                  Vehicle_Type    Model_Year  
##  Clio   : 490       Other                  : 994   Min.   :1959  
##  Megane : 443       1.5 dCi Joy            : 409   1st Qu.:2010  
##  Fluence: 426       1.5 dCi Touch          : 377   Median :2014  
##  Symbol : 420       1.5 dCi Icon           : 183   Mean   :2012  
##  Focus  : 359       1.5 dCi Ambiance       : 150   3rd Qu.:2017  
##  Astra  : 332       1.5 dCi SportTourer Joy: 105   Max.   :2020  
##  (Other):6365       (Other)                :6617                 
##        Fuel_Type                Gear                     CCM      
##  Diesel     :5813   Automatic     :2325   1301-1600 cc     :5992  
##  Electricity:1348   Manual        :4463   1300 cc and below: 984  
##  Gasoline   :1647   Semi Automatic:2047   1801-2000 cc     : 815  
##  Hybrid     :  27                         1601-1800 cc     : 428  
##                                           2501-3000 cc     : 317  
##                                           2001-2500 cc     : 205  
##                                           (Other)          :  94  
##            Horse_Power           Color                  Body_Type   
##  100 HP and below:5441   White      :4225   Sedan            :4556  
##  101-125 HP      :1315   Black      :1159   Hatchback 5 Doors:2292  
##  126-150 HP      : 892   Gray       :1004   Off-road Vehicle :1149  
##  151-175 HP      : 484   Silver Gray: 640   Station Wagon    : 240  
##  176-200 HP      : 267   Smoked     : 420   Glass Van        : 149  
##  276-300 HP      : 202   Red        : 379   Sport / Coupe    : 132  
##  (Other)         : 234   (Other)    :1008   (Other)          : 317  
##        Seller      Seller_Status    Kilometers          Price        
##  Authority:  67   0 km    : 234   Min.   :      0   Min.   :   5500  
##  Galery   :7152   2nd Hand:8552   1st Qu.:  68000   1st Qu.:  60000  
##  Owner    :1616   Classic :  10   Median : 125000   Median :  94500  
##                   Damaged :  39   Mean   : 135596   Mean   : 141747  
##                                   3rd Qu.: 193000   3rd Qu.: 150000  
##                                   Max.   :1850000   Max.   :5086500  
## 

1.5 Check Accuracy of Values

Before starting any of the analysis, we need to check the accuracy of values. For example, we can look for the Price column that has negative values.

carmarket[Price < 0, .N]
## [1] 0

So, we can say that the values in the Price column are valid values. We can look for the Kilometers column that has negative values.

carmarket[Kilometers < 0, .N]
## [1] 0

So, we can say that the values in the Kilometers column are valid values. We can look for the Model_Year column that has values that can not be a year.

sort(unique(carmarket[, Model_Year]))
##  [1] 1959 1961 1976 1977 1978 1979 1981 1982 1983 1984 1985 1986 1987 1988 1989
## [16] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
## [31] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
## [46] 2020
carmarket[Model_Year > year(Date), .N]
## [1] 0

So, we can say that the values in the Model_Year column are valid values and model years of all cars in the advertisements are appropriate with dates of advertisements. We can check for the advertisements whose Seller_Status is 2nd Hand and Kilometers is 0.

carmarket[Seller_Status == "2nd Hand" & Kilometers == 0, .N]
## [1] 6
carmarket[Seller_Status == "2nd Hand" & Kilometers == 0, Seller]
## [1] Galery Owner  Owner  Galery Galery Owner 
## Levels: Authority Galery Owner

There are 6 advertisements meaning that the Seller (Galery, Owner) has a secondhand car, which means the car is used, but it is not driven any kilometers. In practice it is not a common case but we can assume that it can be correct.

We can check for the highest 10 price in the dataset.

setorder(carmarket, -Price) %>% head(10)
##           Date Year Month      Brand Vehicle_Type_Group        Vehicle_Type
##  1: 2020-05-12 2020     5    Hyundai                I20       1.4 CRDI Jump
##  2: 2020-04-18 2020     4   Mercedes           G Series               400 D
##  3: 2020-04-18 2020     4       Audi                 RS                RS 6
##  4: 2020-04-18 2020     4       Audi                 RS                RS 6
##  5: 2020-05-12 2020     5 Land Rover              Other               Other
##  6: 2020-04-22 2020     4 Land Rover              Other               Other
##  7: 2020-05-12 2020     5    Porsche           Panamera               Other
##  8: 2020-05-01 2020     5    Porsche           Panamera Panamera 4 E-Hybrid
##  9: 2020-04-18 2020     4   Mercedes           S Series           S 400 AMG
## 10: 2020-04-17 2020     4   Mercedes           G Series               350 D
##     Model_Year   Fuel_Type      Gear               CCM      Horse_Power Color
##  1:       2013      Diesel    Manual 1300 cc and below 100 HP and below White
##  2:       2020      Diesel Automatic      2501-3000 cc       126-150 HP Black
##  3:       2020 Electricity Automatic 1300 cc and below       251-275 HP Black
##  4:       2020 Electricity Automatic 1300 cc and below       251-275 HP Black
##  5:       2020      Diesel Automatic      2501-3000 cc       176-200 HP Black
##  6:       2020      Diesel Automatic      2501-3000 cc       176-200 HP Black
##  7:       2020      Hybrid Automatic      2501-3000 cc       226-250 HP Black
##  8:       2020      Hybrid Automatic      2501-3000 cc       226-250 HP Black
##  9:       2020      Diesel Automatic      2501-3000 cc       126-150 HP Black
## 10:       2020      Diesel Automatic      2501-3000 cc       126-150 HP Black
##             Body_Type Seller Seller_Status Kilometers   Price
##  1: Hatchback 5 Doors  Owner      2nd Hand     140000 5086500
##  2:  Off-road Vehicle Galery          0 km          0 2720000
##  3:     Station Wagon Galery      2nd Hand        250 2550000
##  4:     Station Wagon Galery          0 km          0 2500000
##  5:  Off-road Vehicle Galery          0 km          0 2425000
##  6:  Off-road Vehicle Galery          0 km          0 2390000
##  7: Hatchback 5 Doors Galery          0 km          0 2375000
##  8: Hatchback 5 Doors Galery          0 km          0 2375000
##  9:             Sedan Galery          0 km          0 2350000
## 10:  Off-road Vehicle Galery          0 km          0 2335000

When we look at the first row we see a car whose horse power is 100 HP or below, CCM is 1300 CC or below, secondhand, kilometer is 140000 and brand is Hyundai i20. Its price is more than 5m Turkish Liras. When we search for some advertisements about the model, we come up with prices at most 100000 Turkish Liras. In reality this row is not possible since the Seller does not want to show-off with his/her car. So, we will remove that row from the data.

carmarket = carmarket[Price != 5086500,]

Another control in the data is to check for outliers.

qtl1 = quantile(carmarket[, Price], 0.25)
qtl3 = quantile(carmarket[, Price], 0.75)

iqr = qtl3 - qtl1

lower = qtl1 - 1.5 * iqr
upper = qtl3 + 1.5 * iqr

carmarket[Price > upper | Price < lower, .SD]
##            Date Year Month      Brand Vehicle_Type_Group         Vehicle_Type
##   1: 2020-04-18 2020     4   Mercedes           G Series                400 D
##   2: 2020-04-18 2020     4       Audi                 RS                 RS 6
##   3: 2020-04-18 2020     4       Audi                 RS                 RS 6
##   4: 2020-05-12 2020     5 Land Rover              Other                Other
##   5: 2020-04-22 2020     4 Land Rover              Other                Other
##  ---                                                                         
## 738: 2020-04-18 2020     4        BMW           5 Series 525d xDrive  Premium
## 739: 2020-06-07 2020     6      Honda               CR-V 1.6 i-DTEC Executive
## 740: 2020-04-18 2020     4   Mercedes           E Series            E 180 AMG
## 741: 2020-04-18 2020     4   Mercedes                GLA                Other
## 742: 2020-04-18 2020     4    Peugeot                508                Other
##      Model_Year   Fuel_Type           Gear               CCM      Horse_Power
##   1:       2020      Diesel      Automatic      2501-3000 cc       126-150 HP
##   2:       2020 Electricity      Automatic 1300 cc and below       251-275 HP
##   3:       2020 Electricity      Automatic 1300 cc and below       251-275 HP
##   4:       2020      Diesel      Automatic      2501-3000 cc       176-200 HP
##   5:       2020      Diesel      Automatic      2501-3000 cc       176-200 HP
##  ---                                                                         
## 738:       2014      Diesel Semi Automatic      1801-2000 cc       276-300 HP
## 739:       2017      Diesel      Automatic      1301-1600 cc 100 HP and below
## 740:       2014 Electricity      Automatic      1301-1600 cc       151-175 HP
## 741:       2016 Electricity Semi Automatic      1301-1600 cc       151-175 HP
## 742:       2020      Diesel      Automatic      1301-1600 cc 100 HP and below
##      Color        Body_Type Seller Seller_Status Kilometers   Price
##   1: Black Off-road Vehicle Galery          0 km          0 2720000
##   2: Black    Station Wagon Galery      2nd Hand        250 2550000
##   3: Black    Station Wagon Galery          0 km          0 2500000
##   4: Black Off-road Vehicle Galery          0 km          0 2425000
##   5: Black Off-road Vehicle Galery          0 km          0 2390000
##  ---                                                               
## 738: White            Sedan Galery      2nd Hand     191000  285000
## 739: White Off-road Vehicle Galery      2nd Hand      53500  285000
## 740: Black            Sedan Galery      2nd Hand      90000  285000
## 741: White Off-road Vehicle Galery      2nd Hand      67467  285000
## 742: Black            Sedan Galery      2nd Hand        400  285000

There are 742 outliers in the data with respect to Price values. This is normal, because there are many low-segment or mid-segment cars in the advertisements and these outliers are some of the high-segment car companies. There is nothing wrong with this data.

We can check for the combination Vehicle_Type and Vehicle_Type_Group columns whether both of them are matching with each other. But, it will take more time than the value it would add.

Now, our data is ready for making further analysis. So, we need to export the data for exploratory data analysis (EDA).

saveRDS(carmarket, file = "turkey_car_market_EDA")

For the analysis, you can follow this link

References

  1. Mode Function
  2. Kaggle