Our data is gathered from the Kaggle. Also you can find the csv in our Github Repository. The data are about the online car advertisements in Turkey market. There are 15 columns in the dataset. These are
Date
(the date that the advertisement is published)Brand
(the brand of the car)Vehicle_Type_Group
(the vehicle type group of the car like Ranger, Sorento etc.)Vehicle_Type
(the vehicle type of the car like 1.6, Broadway etc.)Model_Year
(the year the car is produced)Fuel_Type
(fuel type that the car has like Diesel, Electricity etc.)Gear
(gear type the car has like Automatic, Manual etc.)CCM
(the indicator of the motor capacity)Horse_Power
(the indicator of the motor power)Color
(the color of the car)Body_Type
(the type of the body of the car like Sedan, Minivan etc.)Seller
(the owner of the car)Seller_Status
(the status of the usage of the car like 0 km, 2nd Hand etc.)Kilometers
(the kilometer that the car traveled)Price
(the price of the car in that advertisement)This data is prepared from the Turkish sites, so all values sand columns are in Turkish.
Before the preprocessing, we need to upload required packages which are given in the below:
#Required packages
pti <- c("data.table", "tidyverse", "lubridate", "knitr", "tinytex", "scales")
pti <- pti[!(pti %in% installed.packages())]
if(length(pti)>0){
install.packages(pti)
}
library(tidyverse)
library(lubridate)
library(knitr)
library(tinytex)
library(data.table)
library(scales)
#library(kableExtra)
At first, we need to upload the dataset. To do so, you can run the following command:
#data uploading
carmarket = fread('https://raw.githubusercontent.com/pjournal/boun01g-data-mine-r-s/gh-pages/data.table/turkey_car_market.csv')
Our data set which we obtain from the kaggle is created by using Turkish language. For this reason, before the exploratory analysis, we transform our name of the columns to English. For this, we use the following code.
#changing column names by using English meaning
col_names = c('Date', 'Brand', 'Vehicle_Type_Group', 'Vehicle_Type', 'Model_Year', 'Fuel_Type', 'Gear', 'CCM', 'Horse_Power', 'Color', 'Body_Type', 'Seller', 'Seller_Status', 'Kilometers', 'Price')
names(carmarket) = col_names
For the values, we use the following code.
carmarket = carmarket %>%
#Brand type
mutate(Brand = as.character(Brand),
Brand = if_else(Brand == 'TofaÅŸ', 'Tofas', Brand),
Brand = as.factor(Brand))%>%
#Seller
mutate(Seller = as.character(Seller),
Seller = if_else(Seller == 'Galeriden', 'Galery', if_else(Seller == 'Sahibinden', 'Owner', 'Authority')),
Seller = as.factor(Seller)) %>% #unique(carmarket$Seller) #Turkish seller names are changed by English version
#fuel type
mutate(Fuel_Type = as.character(Fuel_Type),
Fuel_Type = if_else(Fuel_Type == 'Dizel', 'Diesel',
if_else(Fuel_Type == 'Benzin/LPG', 'Gasoline',
if_else(Fuel_Type == 'Hibrit', 'Hybrid', 'Electricity'))),
Fuel_Type = as.factor(Fuel_Type)) %>%
#Gear
mutate(Gear = as.character(Gear),
Gear = if_else(Gear == 'Otomatik Vites', 'Automatic',
if_else(Gear == 'Yarı Otomatik Vites', 'Semi Automatic', 'Manual')),
Gear = as.factor(Gear))%>%
#Seller Status
mutate(Seller_Status = as.character(Seller_Status),
Seller_Status = if_else(Seller_Status == '2. El', '2nd Hand',
if_else(Seller_Status == 'Klasik', 'Classic',
if_else(Seller_Status == "0 km", "0 km", "Damaged"))),
Seller_Status = as.factor(Seller_Status))%>%
#Vehicle Type Group
mutate(Vehicle_Type_Group = as.character(Vehicle_Type_Group),
Vehicle_Type_Group = case_when(Vehicle_Type_Group == 'DiÄŸer' ~ 'Other',
Vehicle_Type_Group == 'A Serisi' ~ 'A Series',
Vehicle_Type_Group == '100 Serisi' ~ '100 Series',
Vehicle_Type_Group == '5 Serisi' ~ '5 Series',
Vehicle_Type_Group == 'S Serisi' ~ 'S Series',
Vehicle_Type_Group == '7 Serisi' ~ '7 Series',
Vehicle_Type_Group == '3 Serisi' ~ '3 Series',
Vehicle_Type_Group == '1 Serisi' ~ '1 Series',
Vehicle_Type_Group == '4 Serisi' ~ '4 Series',
Vehicle_Type_Group == '2 Serisi' ~ '2 Series',
Vehicle_Type_Group == '6 Serisi' ~ '6 Series',
Vehicle_Type_Group == 'Z Serisi' ~ 'Z Series',
Vehicle_Type_Group == 'M Serisi' ~ 'M Series',
Vehicle_Type_Group == '500 Ailesi' ~ '500 Family',
Vehicle_Type_Group == 'E Serisi' ~ 'E Series',
Vehicle_Type_Group == 'B Serisi' ~ 'B Series',
Vehicle_Type_Group == 'G Serisi' ~ 'G Series',
Vehicle_Type_Group == 'Å\u009eahin' ~ 'Sahin',
Vehicle_Type_Group == 'Serçe' ~ 'Serce',
Vehicle_Type_Group == 'DoÄŸan' ~ 'Dogan',
TRUE ~ Vehicle_Type_Group),
Vehicle_Type_Group = as.factor(Vehicle_Type_Group)) %>%
#Vehicle Type
mutate(Vehicle_Type = as.character(Vehicle_Type),
Vehicle_Type = case_when(Vehicle_Type == 'DiÄŸer' ~ 'Other',
Vehicle_Type == '1.6 BlueHDI 130HP INTENSIVE EAT6 7 KiÅŸilik' ~ '1.6 BlueHDI 130HP INTENSIVE EAT6 7 Seats',
Vehicle_Type == '1.3 CDTI Enjoy 111.Yıl' ~ '1.3 CDTI Enjoy 111',
Vehicle_Type == '1.3 GLS Otm.' ~ '1.3 GLS Auto',
Vehicle_Type == '1.4 120.Yıl' ~ '1.4 120',
Vehicle_Type == '1.4 CVVT Select Otm.' ~ '1.4 CVVT Select Auto',
Vehicle_Type == '1.4 CVVT Style Otm.' ~ '1.4 CVVT Style Auto',
Vehicle_Type == '0.9\n' ~ '0.9',
Vehicle_Type == '1.0 TFSI Sportback\n' ~ '1.0 TFSI Sportback',
Vehicle_Type == '1.0\n' ~ '1.0',
Vehicle_Type == '1.2 D-CVVT\n' ~ '1.2 D-CVVT',
Vehicle_Type == '1.2 DOHC\n' ~ '1.2 DOHC',
Vehicle_Type == '1.2 TSI Sport Tourer\n' ~ '1.2 TSI Sport Tourer',
Vehicle_Type == '1.2 TSI\n' ~ '1.2 TSI',
Vehicle_Type == '1.2 VTi\n' ~ '1.2 VTi',
Vehicle_Type == '1.2\n' ~ '1.2',
Vehicle_Type == '1.3 CDTI \n' ~ '1.3 CDTI',
Vehicle_Type == '1.3 Mjet\n' ~ '1.3 Mjet',
Vehicle_Type == '1.3 Multijet Active Plus\n' ~ '1.3 Multijet Active Plus',
Vehicle_Type == '1.3 Multijet Emotion Plus\n' ~ '1.3 Multijet Emotion Plus',
Vehicle_Type == '1.3 Multijet\n' ~ '1.3 Multijet',
Vehicle_Type == '1.3\n' ~ '1.3',
Vehicle_Type == '1.4 CRDi' ~ '1.4 CRDi',
Vehicle_Type == '1.4 CVVT\n' ~ '1.4 CVVT',
Vehicle_Type == '1.4 Fire\n' ~ '1.4 Fire',
Vehicle_Type == '1.4 HDi X\n' ~ '1.4 HDi X',
Vehicle_Type == '1.4 HDi\n' ~ '1.4 HDI',
Vehicle_Type == '1.4 ie SX\n' ~ '1.4 ie SX',
Vehicle_Type == '1.4 PopStar\n' ~ '1.4 PopStar',
Vehicle_Type == '1.4 T\n' ~ '1.4 T',
Vehicle_Type == '1.4 TDCi\n' ~ '1.4 TDCi',
Vehicle_Type == '1.4 TDI\n' ~ '1.4 TDI',
Vehicle_Type == '1.4 TSI Sport Coupe Cupra\n' ~ '1.4 TSI Sport Coupe Cupra',
Vehicle_Type == '1.4 Twinport\n' ~ '1.4 Twinport',
Vehicle_Type == '1.4\n' ~ '1.4',
Vehicle_Type == '1.5 CRDi \n' ~ '1.5 CRDi',
Vehicle_Type == '1.5 DCI Ambiance' ~ '1.5 Dci Ambiance',
Vehicle_Type == '1.5 dCi Ambiance\n' ~ '1.5 dCi Ambiance',
Vehicle_Type == '1.5 dCi Authentique' ~ '1.5 DCi Authentique',
Vehicle_Type == '1.5 DCi Extreme' ~ '1.5 dCi Extreme',
Vehicle_Type == '1.5 dCi Grandtour Authentique\n' ~ '1.5 dCi Grandtour Authentique Edition',
Vehicle_Type == '1.5 DCi Expression' ~ '1.5 dCi Expression',
Vehicle_Type == '1.5 dCi Grandtour Executive\n' ~ '1.5 dCi Grandtour Executive',
Vehicle_Type == '1.5 dCi GT-Line\n' ~ '1.5 dCi GT Line',
Vehicle_Type == '1.5 DCi Privilege' ~ '1.5 dCi Privilege',
Vehicle_Type == '1.5 DCI Tekna' ~ '1.5 dCi Tekna',
Vehicle_Type == '1.5 DCI Visia' ~ '1.5 dCi Visia',
Vehicle_Type == '1.5 dCi\n' ~ '1.5 dCi',
Vehicle_Type == '1.5 TDCi Titanium' ~ '1.5 TDCI Titanium',
Vehicle_Type == '1.5\n' ~ '1.5',
Vehicle_Type == '1.6 BlueHDI 130HP INTENSIVE EAT6 7 KiÅŸilik' ~ '1.6 BlueHDI 130HP INTENSIVE EAT6 7 Seats',
Vehicle_Type == '1.6 BlueHDI Active' ~ '1.6 BlueHDi Active',
Vehicle_Type == '1.6 CDTI Grand Sport 120.Yıl' ~ '1.6 CDTI Grand Sport 120',
Vehicle_Type == '1.6 CDTI Cosmo' ~ '1.6 CDTI Cosmo',
Vehicle_Type == '1.6 CDTI Business' ~ '1.6 CDTIBusiness',
Vehicle_Type == '1.6 CDTI Design' ~ '1.6 CDTI Design',
Vehicle_Type == '1.6 CDTI Elite' ~ '1.6 CDTI Elite',
Vehicle_Type == '1.6 CDTI Enjoy' ~ '1.6 CDTI Enjoy',
Vehicle_Type == '1.6 CDTI Sport' ~ '1.6 CDTI Sport',
Vehicle_Type == '1.6 CDTI\n' ~ '1.6 CDTI',
Vehicle_Type == '1.6 CR TDI Elegance\n' ~ '1.6 CR TDI Elegance',
Vehicle_Type == '1.6 CRDI Concept Plus' ~ '1.6 CRDi Concept Plus',
Vehicle_Type == '1.6 CRDi Elite\n' ~ '1.6 CRDi Elite',
Vehicle_Type == '1.6 D Türkiye Paketi' ~ '1.6 D Turkey Package',
Vehicle_Type == '1.6 dCi Tekna Sky Pack\n' ~ '1.6 dCi Tekna Sky Pack',
Vehicle_Type == '1.6 e-HDi\n' ~ '1.6 e-HDi',
Vehicle_Type == '1.6 Enjoy 111. Yıl' ~ '1.6 Enjoy 111',
Vehicle_Type == '1.6 GDI\n' ~ '1.6 GDI',
Vehicle_Type == '1.6 HDi \n' ~ '1.6 HDi',
Vehicle_Type == '1.6 HDi\n' ~ '1.6 HDi',
Vehicle_Type == '1.6 HDi Confort' ~ '1.6 HDi Comfort',
Vehicle_Type == '1.6 HDI SX PK' ~ '1.6 HDi SX PK',
Vehicle_Type == '1.6 Invite Otm' ~ '1.6 Invite Auto',
Vehicle_Type == '1.6 LT\n' ~ '1.6 LT',
Vehicle_Type == '1.6 Mjet\n' ~ '1.6 Mjet',
Vehicle_Type == '1.6 Multijet Emotion Plus\n' ~ '1.6 Multijet Emotion Plus',
Vehicle_Type == '1.6 Multijet\n' ~ '1.6 Multijet',
Vehicle_Type == '1.6 Privelege' ~ '1.6 Privilege',
Vehicle_Type == '1.6 Türkiye Paketi' ~ '1.6 Turkey Package',
Vehicle_Type == '1.6 TDCÄ° Ghia' ~ '1.6 TDCi Ghia',
Vehicle_Type == '1.6 TDCi\n' ~ '1.6 TDCi',
Vehicle_Type == '1.6 TDI \n' ~ '1.6 TDI',
Vehicle_Type == '1.6 TDI Ambiente' ~ '1.6 TDi Ambiente',
Vehicle_Type == '1.6 TDI Ambition Sportback\n' ~ '1.6 TDI Ambition Sportback',
Vehicle_Type == '1.6 TDI Attraction Sportback\n' ~ '1.6 TDI Attraction Sportback',
Vehicle_Type == '1.6 TDI Elegance' ~ '1.6 TDi Elegance',
Vehicle_Type == '1.6 TDI GreenLine Elegance\n' ~ '1.6 TDI GreenLine Elegance',
Vehicle_Type == '1.6 TDI Sportback\n' ~ '1.6 TDI Sportback',
Vehicle_Type == '1.6 TDI\n' ~ '1.6 TDI',
Vehicle_Type == '1.6 Ti-VCT\n' ~ '1.6 Ti-VCT',
Vehicle_Type == '1.6 VTES ES' ~ '1.6 VTEC ES',
Vehicle_Type == '1.6 XHT Cosmo\n' ~ '1.6 XHT Cosmo',
Vehicle_Type == '1.6\n' ~ '1.6',
Vehicle_Type == '1.7\n' ~ '1.7',
Vehicle_Type == '1.8\n' ~ '1.8',
Vehicle_Type == 'Serçe' ~ 'Serce',
Vehicle_Type == 'Kartal 5 Vites' ~ 'Kartal 5 Transmissions',
Vehicle_Type == 'EVO 1.4 Fire Active' ~ 'EVO 1.4 Active Fire',
Vehicle_Type == 'Country 4x2\n' ~ 'Country 4x2',
Vehicle_Type == 'Å\u009eahin 5 vites' ~ 'Sahin 5 Transmissions',
Vehicle_Type == '4x2\n' ~ '4x2',
Vehicle_Type == '3.0 SDV6\n' ~ '3.0 SDV6',
Vehicle_Type == '200 BlueTEC AMG\n' ~ '200 BlueTEC AMG',
Vehicle_Type == '200 BlueEfficiency Avantgarde\n' ~ '200 BlueEfficiency Avantgarde',
Vehicle_Type == '2.5 Çift Kabin 4x4' ~ '2.5 Double Cabinet 4x4',
Vehicle_Type == '2.0i sDrive\n' ~ '2.0i sDrive',
Vehicle_Type == '2.0\n' ~ '2.0',
Vehicle_Type == '2.0 TDI\n' ~ '2.0 TDI',
Vehicle_Type == '2.0 TDCi\n' ~ '2.0 TDCi',
Vehicle_Type == '2.0 SRTi \n' ~ '2.0 SRTi',
Vehicle_Type == '2.0 Quattro\n' ~ '2.0 Quattro',
Vehicle_Type == '2.0 Ghia Otm.' ~ '2.0 Ghia Auto',
Vehicle_Type == '2.0 CRDi GLS\n' ~ '2.0 CRDi GLS',
Vehicle_Type == '2.0 24V LT Otm.' ~ '2.0 24V LT Auto',
Vehicle_Type == '180 CDI AMG\n' ~ '180 CDI AMG',
Vehicle_Type == '180 BlueEfficiency AMG Sport\n' ~ '180 BlueEfficiency AMG Sport',
Vehicle_Type == '180 BlueEfficiency Fascination\n' ~ '180 BlueEfficiency Fascination',
Vehicle_Type == '150 Otm.' ~ '150 Auto',
Vehicle_Type == '1.9 TDI\n' ~ '1.9 TDI',
Vehicle_Type == '1.8\n' ~ '1.8',
Vehicle_Type == '1.7\n' ~ '1.7',
Vehicle_Type == '1.6\n' ~ '1.6',
TRUE ~ Vehicle_Type),
Vehicle_Type = as.factor(Vehicle_Type)) %>%
#CCM
mutate(CCM = as.character(CCM),
CCM = case_when(CCM == '1300 cc ve altı' ~ '1300 cc and below',
CCM == '6001 cc ve üzeri' ~ '6001 cc and above',
CCM == 'Bilmiyorum' | CCM == '-' ~ 'Don\'t Know',
TRUE ~ CCM),
CCM = factor(CCM, levels = c('Don\'t Know', '1300 cc and below', '1301-1600 cc', '1601-1800 cc', '1801-2000 cc', '2001-2500 cc', '2501-3000 cc', '3001-3500 cc', '3501-4000 cc', '4001-4500 cc', '4501-5000 cc', '5001-5500 cc', '5501-6000 cc', '6001 cc and above'))) %>%
#Horse Power
mutate(Horse_Power = as.character(Horse_Power),
Horse_Power = case_when(Horse_Power == '100 BG ve altı' ~ '100 HP and below',
Horse_Power == '50 BG ve altı' ~ '100 HP and below',
Horse_Power == '51-75 BG' ~ '100 HP and below',
Horse_Power == '601 BG ve üzeri' ~ '601 HP and above',
Horse_Power == '101-125 BG' ~ '101-125 HP',
Horse_Power == '126-150 BG' ~ '126-150 HP',
Horse_Power == '151-175 BG' ~ '151-175 HP',
Horse_Power == '176-200 BG' ~ '176-200 HP',
Horse_Power == '201-225 BG' ~ '201-225 HP',
Horse_Power == '226-250 BG' ~ '226-250 HP',
Horse_Power == '251-275 BG' ~ '251-275 HP',
Horse_Power == '276-300 BG' ~ '276-300 HP',
Horse_Power == '301-325 BG' ~ '301-325 HP',
Horse_Power == '326-350 BG' ~ '326-350 HP',
Horse_Power == '376-400 BG' ~ '376-400 HP',
Horse_Power == '451-475 BG' ~ '451-475 HP',
Horse_Power == '76-100 BG' ~ '100 HP and below',
Horse_Power == 'Bilmiyorum' | Horse_Power == '-' ~ 'Don\'t Know',
TRUE ~ Horse_Power),
Horse_Power = factor(Horse_Power, levels = c('Don\'t Know', '100 HP and below', '101-125 HP', '126-150 HP', '151-175 HP', '176-200 HP', '201-225 HP', '226-250 HP', '251-275 HP', '276-300 HP', '301-325 HP', '326-350 HP', '376-400 HP', '451-475 HP', '601 HP and above'))) %>%
#Color
mutate(Color = as.character(Color),
Color = case_when(Color == 'Å\u009eampanya' ~ 'Champagne',
Color == 'Altın' ~ 'Gold',
Color == 'Amarant' ~ 'Amaranth',
Color == 'Bal Rengi' ~ 'Honey',
Color == 'Bej' ~ 'Beige',
Color == 'Beyaz' ~ 'White',
Color == 'Bordo' ~ 'Burgundy',
Color == 'DiÄŸer' ~ 'Other',
Color == 'Füme' ~ 'Smoked',
Color == 'Gümüş' ~ 'Silver',
Color == 'Gümüş Gri' ~ 'Silver Gray',
Color == 'Gri' ~ 'Gray',
Color == 'Ihlamur' ~ 'Linden',
Color == 'Kırmızı' ~ 'Red',
Color == 'Kahverengi' ~ 'Brown',
Color == 'Krem' ~ 'Cream',
Color == 'Kum Rengi' ~ 'Sand Color',
Color == 'Lacivert' ~ 'Dark Blue',
Color == 'Mavi' ~ 'Blue',
Color == 'Mor' ~ 'Purple',
Color == 'Pembe' ~ 'Pink',
Color == 'Sarı' ~ 'Yellow',
Color == 'Siyah' ~ 'Black',
Color == 'Turkuaz' ~ 'Turquoise',
Color == 'Turuncu' ~ 'Orange',
Color == 'YeÅŸil' ~ 'Green',
Color == 'Zeytin Gri' ~ 'Olive Gray',
Color == 'Eflatun' ~ 'Magenta',
TRUE ~ Color),
Color = as.factor(Color)) %>%
#Body Type
mutate(Body_Type = as.character(Body_Type),
Body_Type = case_when(Body_Type == 'Üstü Açık / Cabriolet' ~ 'Open Top / Cabriolet',
Body_Type == 'Arazi Aracı' ~ 'Off-road Vehicle',
Body_Type == 'Camlı Van' ~ 'Glass Van',
Body_Type == 'DiÄŸer' ~ 'Other',
Body_Type == 'Hatchback 3 Kapı' ~ 'Hatchback 3 Doors',
Body_Type == 'Hatchback 5 Kapı' ~ 'Hatchback 5 Doors',
Body_Type == 'Spor / Coupe' ~ 'Sport / Coupe',
TRUE ~ Body_Type),
Body_Type = as.factor(Body_Type))
We can add the year and month of advertisements in Year
and Month
column. To be able to see all columns in order, we put the columns about the date to the beginning.
carmarket[, Date := dmy(Date)]
carmarket = carmarket[, Year:= year(Date)]
carmarket = carmarket[, Month:= month(Date)]
carmarket$Month = as.factor(carmarket$Month)
setcolorder(carmarket, c("Date", "Year", "Month", col_names[2:15]))
From now on, we have all columns ready for future analysis and models.
## Rows: 9,044
## Columns: 17
## $ Date <date> 2020-05-27, 2020-06-16, 2020-06-14, 2020-06-11,...
## $ Year <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month <fct> 5, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 4, 4, 4, ...
## $ Brand <fct> Jaguar, Acura, Acura, Acura, Acura, Acura, Acura...
## $ Vehicle_Type_Group <fct> XF, CL, CL, CL, CL, CL, CL, CL, RSX, RSX, CL, CL...
## $ Vehicle_Type <fct> 2.0 D Prestige Plus, -, 2.2, -, 2.2, 2.2, 2.2, -...
## $ Model_Year <dbl> 2017, 2015, 1994, 2013, 2010, 1999, 2014, 2014, ...
## $ Fuel_Type <fct> Diesel, Diesel, Gasoline, Diesel, Diesel, Diesel...
## $ Gear <fct> Automatic, Semi Automatic, Manual, Manual, Autom...
## $ CCM <fct> 1801-2000 cc, 1301-1600 cc, 1301-1600 cc, 1301-1...
## $ Horse_Power <fct> 176-200 HP, 101-125 HP, 101-125 HP, 100 HP and b...
## $ Color <fct> Dark Blue, Blue, Turquoise, Brown, White, Dark B...
## $ Body_Type <fct> Hatchback 5 Doors, Sedan, Sedan, Sedan, Sedan, R...
## $ Seller <fct> Galery, Owner, Owner, Owner, Owner, Galery, Gale...
## $ Seller_Status <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers <int> 26100, 127000, 175000, 325, 207000, 320000, 183,...
## $ Price <int> 634500, 151500, 19750, 52000, 148750, 42500, 170...
As we added two more columns, we have 17 columns. We have some categorical columns and numerical columns.
As always, we need to check for null values in the data.
## [1] 0
When we look for null values, the data don’t have any null values. But, there are some rows that contains Don’t Know value in CCM
and Horse_Power
columns.
## [1] 109
## [1] 5550
For 109 Don’t Know values in CCM
column, we can make imputation process. CCM
is the volume of the motor of the car. It means that cars that have the same Brand
, Fuel_Type
and Model_Year
values would have the same CCM
value (we assume that all cars in the advertisements are produced for Turkey and even if these three variables covers almost 80% of the information about CCM
, we will fill this column with respect to these three values.)
keys = c("Brand", "Model_Year", "Fuel_Type")
carmarket.CCM.DK = data.table(carmarket[CCM == 'Don\'t Know'], key = keys)
carmarket.CCM.FULL = data.table(carmarket[CCM != 'Don\'t Know'], key = keys)
carmarket.CCM.DK %>% glimpse()
## Rows: 109
## Columns: 17
## $ Date <date> 2020-04-20, 2020-03-21, 2020-06-15, 2020-03-25,...
## $ Year <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month <fct> 4, 3, 6, 3, 6, 3, 5, 3, 6, 4, 5, 5, 6, 4, 4, 6, ...
## $ Brand <fct> Acura, Acura, Audi, Audi, Audi, Audi, Audi, Audi...
## $ Vehicle_Type_Group <fct> CL, TL, A4, A3, A4, A4, A4, A3, 3 Series, X5, Ch...
## $ Vehicle_Type <fct> 2.2, Other, 2.0 TDI, 1.4 TFSI, 2.0 TDI, 2.0 TDI,...
## $ Model_Year <dbl> 2012, 2016, 2009, 2010, 2013, 2013, 2014, 2014, ...
## $ Fuel_Type <fct> Diesel, Diesel, Diesel, Electricity, Diesel, Die...
## $ Gear <fct> Manual, Manual, Semi Automatic, Manual, Automati...
## $ CCM <fct> Don't Know, Don't Know, Don't Know, Don't Know, ...
## $ Horse_Power <fct> Don't Know, Don't Know, 126-150 HP, 126-150 HP, ...
## $ Color <fct> Silver Gray, White, Smoked, White, Black, White,...
## $ Body_Type <fct> Off-road Vehicle, Sedan, Sedan, Hatchback 3 Door...
## $ Seller <fct> Galery, Owner, Owner, Owner, Galery, Owner, Gale...
## $ Seller_Status <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers <int> 270, 172000, 302000, 99000, 315000, 200000, 2830...
## $ Price <int> 82500, 92000, 117500, 95000, 175000, 178000, 225...
## Rows: 8,935
## Columns: 17
## $ Date <date> 2020-06-14, 2020-06-04, 2020-05-24, 2020-04-27,...
## $ Year <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month <fct> 6, 6, 5, 4, 4, 4, 4, 3, 4, 4, 6, 5, 4, 4, 4, 4, ...
## $ Brand <fct> Acura, Acura, Acura, Acura, Acura, Acura, Acura,...
## $ Vehicle_Type_Group <fct> CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, ...
## $ Vehicle_Type <fct> 2.2, 2.2, -, -, -, -, Other, 2.2, -, -, 2.2, -, ...
## $ Model_Year <dbl> 1994, 1999, 2001, 2004, 2004, 2004, 2004, 2008, ...
## $ Fuel_Type <fct> Gasoline, Diesel, Gasoline, Diesel, Gasoline, Ga...
## $ Gear <fct> Manual, Manual, Manual, Manual, Manual, Manual, ...
## $ CCM <fct> 1301-1600 cc, 1801-2000 cc, 1301-1600 cc, 1301-1...
## $ Horse_Power <fct> 101-125 HP, 101-125 HP, 101-125 HP, 100 HP and b...
## $ Color <fct> Turquoise, Dark Blue, Black, White, Dark Blue, B...
## $ Body_Type <fct> Sedan, Roadster, Hatchback 5 Doors, Hatchback 5 ...
## $ Seller <fct> Owner, Galery, Owner, Galery, Owner, Owner, Owne...
## $ Seller_Status <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers <int> 175000, 320000, 252000, 230000, 18000, 240000, 1...
## $ Price <int> 19750, 42500, 36750, 32000, 34500, 50000, 55000,...
We divided the data into 2 groups with respect to CCM
column.
We created a function to get the mode of the CCM
column of data.table which has no Don’t Know values in that column.
for (i in 1: nrow(carmarket.CCM.DK)){
brand = carmarket.CCM.DK[i, Brand]
model_year = carmarket.CCM.DK[i, Model_Year]
fuel_type = carmarket.CCM.DK[i, Fuel_Type]
carmarket.CCM.DK[i, CCM := calculate_mode(carmarket.CCM.FULL[Brand == brand & Model_Year == model_year & Fuel_Type == fuel_type, .(Brand, Model_Year, Fuel_Type, CCM)])$CCM]
}
We applied this function for all the rows of the carmarket.CCM.DK
data.table.
## [1] 9
After this process, we filled 100 rows and couldn’t fill 9 rows. With respect to number of observations in the data frame, which is 9044, we can drop these rows and add these two data.tables.
Now, we need to consider these steps for Horse_Power
. Horse_Power
shows the measurement unit of power that engine has. It means that cars that have the same Brand
and CCM
values would have the same CCM
value. We can do the same step in one chunk.
keys = c("Brand", "CCM")
carmarket.HP.DK = data.table(carmarket[Horse_Power == 'Don\'t Know'], key = keys)
carmarket.HP.FULL = data.table(carmarket[Horse_Power != 'Don\'t Know'], key = keys)
carmarket.HP.DK %>% glimpse()
## Rows: 5,544
## Columns: 17
## $ Date <date> 2020-04-14, 2020-04-20, 2020-04-18, 2020-04-18,...
## $ Year <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month <fct> 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ Brand <fct> Acura, Acura, Alfa Romeo, Alfa Romeo, Alfa Romeo...
## $ Vehicle_Type_Group <fct> CL, CL, Other, 156, Giulietta, Other, Giulietta,...
## $ Vehicle_Type <fct> -, 2.2, Other, 1.6 TS, 1.6 JTD Distinctive, Othe...
## $ Model_Year <dbl> 2011, 2012, 2013, 1998, 2012, 2013, 2013, 2016, ...
## $ Fuel_Type <fct> Diesel, Diesel, Diesel, Gasoline, Diesel, Electr...
## $ Gear <fct> Manual, Manual, Manual, Manual, Manual, Semi Aut...
## $ CCM <fct> 1300 cc and below, 1801-2000 cc, 1300 cc and bel...
## $ Horse_Power <fct> Don't Know, Don't Know, Don't Know, Don't Know, ...
## $ Color <fct> Black, Silver Gray, Red, Red, White, Blue, White...
## $ Body_Type <fct> Hatchback 5 Doors, Off-road Vehicle, Hatchback 3...
## $ Seller <fct> Owner, Galery, Galery, Galery, Galery, Galery, G...
## $ Seller_Status <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers <int> 182, 270, 182000, 262000, 222000, 99000, 85000, ...
## $ Price <int> 62000, 82500, 72950, 32000, 69000, 105900, 12750...
## Rows: 3,491
## Columns: 17
## $ Date <date> 2020-04-15, 2020-04-15, 2020-05-08, 2020-04-14,...
## $ Year <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ Month <fct> 4, 4, 5, 4, 6, 5, 4, 4, 4, 3, 6, 6, 6, 5, 6, 4, ...
## $ Brand <fct> Acura, Acura, Acura, Acura, Acura, Acura, Acura,...
## $ Vehicle_Type_Group <fct> CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, CL, ...
## $ Vehicle_Type <fct> -, -, -, 2.2, 2.2, -, -, -, Other, 2.2, -, 2.2, ...
## $ Model_Year <dbl> 2009, 2009, 2010, 2016, 1994, 2001, 2004, 2004, ...
## $ Fuel_Type <fct> Diesel, Diesel, Diesel, Electricity, Gasoline, G...
## $ Gear <fct> Manual, Manual, Manual, Manual, Manual, Manual, ...
## $ CCM <fct> 1300 cc and below, 1300 cc and below, 1300 cc an...
## $ Horse_Power <fct> 100 HP and below, 100 HP and below, 100 HP and b...
## $ Color <fct> Burgundy, Burgundy, Gray, Blue, Turquoise, Black...
## $ Body_Type <fct> Sedan, Sedan, Sedan, Hatchback 5 Doors, Sedan, H...
## $ Seller <fct> Owner, Owner, Galery, Galery, Owner, Owner, Gale...
## $ Seller_Status <fct> 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand, 2nd Hand...
## $ Kilometers <int> 150000, 150000, 232000, 70000, 175000, 252000, 2...
## $ Price <int> 34000, 34000, 52000, 77500, 19750, 36750, 32000,...
for (i in 1: nrow(carmarket.HP.DK)){
brand = carmarket.HP.DK[i, Brand]
ccm = carmarket.HP.DK[i, CCM]
carmarket.HP.DK[i, Horse_Power := calculate_mode(carmarket.HP.FULL[Brand == brand & CCM == ccm, .(Brand, CCM, Horse_Power)])$Horse_Power]
}
carmarket.HP.DK[is.na(Horse_Power), .N]
## [1] 200
After this process, we filled 5344 rows and couldn’t fill 200 rows. With respect to number of observations in the data frame, which is 9035, we can drop these rows and add these two data.tables.
After these processes, we have a full dataset. You can see the summary of the dataset below.
## Date Year Month Brand
## Min. :2019-12-23 Min. :2019 3 : 285 Renault:2079
## 1st Qu.:2020-04-18 1st Qu.:2020 4 :5248 Fiat : 653
## Median :2020-04-18 Median :2020 5 :1778 Opel : 643
## Mean :2020-05-01 Mean :2020 6 :1522 Hyundai: 639
## 3rd Qu.:2020-05-18 3rd Qu.:2020 12: 2 Ford : 601
## Max. :2020-06-16 Max. :2020 BMW : 598
## (Other):3622
## Vehicle_Type_Group Vehicle_Type Model_Year
## Clio : 490 Other : 994 Min. :1959
## Megane : 443 1.5 dCi Joy : 409 1st Qu.:2010
## Fluence: 426 1.5 dCi Touch : 377 Median :2014
## Symbol : 420 1.5 dCi Icon : 183 Mean :2012
## Focus : 359 1.5 dCi Ambiance : 150 3rd Qu.:2017
## Astra : 332 1.5 dCi SportTourer Joy: 105 Max. :2020
## (Other):6365 (Other) :6617
## Fuel_Type Gear CCM
## Diesel :5813 Automatic :2325 1301-1600 cc :5992
## Electricity:1348 Manual :4463 1300 cc and below: 984
## Gasoline :1647 Semi Automatic:2047 1801-2000 cc : 815
## Hybrid : 27 1601-1800 cc : 428
## 2501-3000 cc : 317
## 2001-2500 cc : 205
## (Other) : 94
## Horse_Power Color Body_Type
## 100 HP and below:5441 White :4225 Sedan :4556
## 101-125 HP :1315 Black :1159 Hatchback 5 Doors:2292
## 126-150 HP : 892 Gray :1004 Off-road Vehicle :1149
## 151-175 HP : 484 Silver Gray: 640 Station Wagon : 240
## 176-200 HP : 267 Smoked : 420 Glass Van : 149
## 276-300 HP : 202 Red : 379 Sport / Coupe : 132
## (Other) : 234 (Other) :1008 (Other) : 317
## Seller Seller_Status Kilometers Price
## Authority: 67 0 km : 234 Min. : 0 Min. : 5500
## Galery :7152 2nd Hand:8552 1st Qu.: 68000 1st Qu.: 60000
## Owner :1616 Classic : 10 Median : 125000 Median : 94500
## Damaged : 39 Mean : 135596 Mean : 141747
## 3rd Qu.: 193000 3rd Qu.: 150000
## Max. :1850000 Max. :5086500
##
Before starting any of the analysis, we need to check the accuracy of values. For example, we can look for the Price
column that has negative values.
## [1] 0
So, we can say that the values in the Price
column are valid values. We can look for the Kilometers
column that has negative values.
## [1] 0
So, we can say that the values in the Kilometers
column are valid values. We can look for the Model_Year
column that has values that can not be a year.
## [1] 1959 1961 1976 1977 1978 1979 1981 1982 1983 1984 1985 1986 1987 1988 1989
## [16] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
## [31] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
## [46] 2020
## [1] 0
So, we can say that the values in the Model_Year
column are valid values and model years of all cars in the advertisements are appropriate with dates of advertisements. We can check for the advertisements whose Seller_Status
is 2nd Hand and Kilometers
is 0.
## [1] 6
## [1] Galery Owner Owner Galery Galery Owner
## Levels: Authority Galery Owner
There are 6 advertisements meaning that the Seller
(Galery, Owner) has a secondhand car, which means the car is used, but it is not driven any kilometers. In practice it is not a common case but we can assume that it can be correct.
We can check for the highest 10 price in the dataset.
## Date Year Month Brand Vehicle_Type_Group Vehicle_Type
## 1: 2020-05-12 2020 5 Hyundai I20 1.4 CRDI Jump
## 2: 2020-04-18 2020 4 Mercedes G Series 400 D
## 3: 2020-04-18 2020 4 Audi RS RS 6
## 4: 2020-04-18 2020 4 Audi RS RS 6
## 5: 2020-05-12 2020 5 Land Rover Other Other
## 6: 2020-04-22 2020 4 Land Rover Other Other
## 7: 2020-05-12 2020 5 Porsche Panamera Other
## 8: 2020-05-01 2020 5 Porsche Panamera Panamera 4 E-Hybrid
## 9: 2020-04-18 2020 4 Mercedes S Series S 400 AMG
## 10: 2020-04-17 2020 4 Mercedes G Series 350 D
## Model_Year Fuel_Type Gear CCM Horse_Power Color
## 1: 2013 Diesel Manual 1300 cc and below 100 HP and below White
## 2: 2020 Diesel Automatic 2501-3000 cc 126-150 HP Black
## 3: 2020 Electricity Automatic 1300 cc and below 251-275 HP Black
## 4: 2020 Electricity Automatic 1300 cc and below 251-275 HP Black
## 5: 2020 Diesel Automatic 2501-3000 cc 176-200 HP Black
## 6: 2020 Diesel Automatic 2501-3000 cc 176-200 HP Black
## 7: 2020 Hybrid Automatic 2501-3000 cc 226-250 HP Black
## 8: 2020 Hybrid Automatic 2501-3000 cc 226-250 HP Black
## 9: 2020 Diesel Automatic 2501-3000 cc 126-150 HP Black
## 10: 2020 Diesel Automatic 2501-3000 cc 126-150 HP Black
## Body_Type Seller Seller_Status Kilometers Price
## 1: Hatchback 5 Doors Owner 2nd Hand 140000 5086500
## 2: Off-road Vehicle Galery 0 km 0 2720000
## 3: Station Wagon Galery 2nd Hand 250 2550000
## 4: Station Wagon Galery 0 km 0 2500000
## 5: Off-road Vehicle Galery 0 km 0 2425000
## 6: Off-road Vehicle Galery 0 km 0 2390000
## 7: Hatchback 5 Doors Galery 0 km 0 2375000
## 8: Hatchback 5 Doors Galery 0 km 0 2375000
## 9: Sedan Galery 0 km 0 2350000
## 10: Off-road Vehicle Galery 0 km 0 2335000
When we look at the first row we see a car whose horse power is 100 HP or below, CCM is 1300 CC or below, secondhand, kilometer is 140000 and brand is Hyundai i20. Its price is more than 5m Turkish Liras. When we search for some advertisements about the model, we come up with prices at most 100000 Turkish Liras. In reality this row is not possible since the Seller
does not want to show-off with his/her car. So, we will remove that row from the data.
Another control in the data is to check for outliers.
qtl1 = quantile(carmarket[, Price], 0.25)
qtl3 = quantile(carmarket[, Price], 0.75)
iqr = qtl3 - qtl1
lower = qtl1 - 1.5 * iqr
upper = qtl3 + 1.5 * iqr
carmarket[Price > upper | Price < lower, .SD]
## Date Year Month Brand Vehicle_Type_Group Vehicle_Type
## 1: 2020-04-18 2020 4 Mercedes G Series 400 D
## 2: 2020-04-18 2020 4 Audi RS RS 6
## 3: 2020-04-18 2020 4 Audi RS RS 6
## 4: 2020-05-12 2020 5 Land Rover Other Other
## 5: 2020-04-22 2020 4 Land Rover Other Other
## ---
## 738: 2020-04-18 2020 4 BMW 5 Series 525d xDrive Premium
## 739: 2020-06-07 2020 6 Honda CR-V 1.6 i-DTEC Executive
## 740: 2020-04-18 2020 4 Mercedes E Series E 180 AMG
## 741: 2020-04-18 2020 4 Mercedes GLA Other
## 742: 2020-04-18 2020 4 Peugeot 508 Other
## Model_Year Fuel_Type Gear CCM Horse_Power
## 1: 2020 Diesel Automatic 2501-3000 cc 126-150 HP
## 2: 2020 Electricity Automatic 1300 cc and below 251-275 HP
## 3: 2020 Electricity Automatic 1300 cc and below 251-275 HP
## 4: 2020 Diesel Automatic 2501-3000 cc 176-200 HP
## 5: 2020 Diesel Automatic 2501-3000 cc 176-200 HP
## ---
## 738: 2014 Diesel Semi Automatic 1801-2000 cc 276-300 HP
## 739: 2017 Diesel Automatic 1301-1600 cc 100 HP and below
## 740: 2014 Electricity Automatic 1301-1600 cc 151-175 HP
## 741: 2016 Electricity Semi Automatic 1301-1600 cc 151-175 HP
## 742: 2020 Diesel Automatic 1301-1600 cc 100 HP and below
## Color Body_Type Seller Seller_Status Kilometers Price
## 1: Black Off-road Vehicle Galery 0 km 0 2720000
## 2: Black Station Wagon Galery 2nd Hand 250 2550000
## 3: Black Station Wagon Galery 0 km 0 2500000
## 4: Black Off-road Vehicle Galery 0 km 0 2425000
## 5: Black Off-road Vehicle Galery 0 km 0 2390000
## ---
## 738: White Sedan Galery 2nd Hand 191000 285000
## 739: White Off-road Vehicle Galery 2nd Hand 53500 285000
## 740: Black Sedan Galery 2nd Hand 90000 285000
## 741: White Off-road Vehicle Galery 2nd Hand 67467 285000
## 742: Black Sedan Galery 2nd Hand 400 285000
There are 742 outliers in the data with respect to Price
values. This is normal, because there are many low-segment or mid-segment cars in the advertisements and these outliers are some of the high-segment car companies. There is nothing wrong with this data.
We can check for the combination Vehicle_Type
and Vehicle_Type_Group
columns whether both of them are matching with each other. But, it will take more time than the value it would add.
Now, our data is ready for making further analysis. So, we need to export the data for exploratory data analysis (EDA).
For the analysis, you can follow this link