The dataset covers up all the recorded service information of x company’s dealers in Turkey. The data is belong the dates between 2019 and 2020. This data is obtained from x company where one of our team member is working for.In the dataset we have 561,142 rows and 17 columns which gives different kind of parameters. According to the parameters a reader can get information about details of service processes such as model number, km, warranty start & end date, prices, vehicle id etc.
You can find variables as below:
In this project, we will first investigate the data for preprocessing to improve its quality. Then we will perform an exploratory data analysis(EDA) by data manipulation and data visualization steps.
The packages used during the project can be listed as below:
We can use the glimpse function to inspect our data. By using it,each column is represented in a row with its data type and first few entries. We have 561,142 rows and 17 variables.
library(readxl)
library(tidyverse)
library(lubridate)
library(openxlsx)
library(rio)
library(reshape)
library(kableExtra)
raw_data <- rio::import("https://github.com/pjournal/mef04g-madagaskar/blob/gh-pages/Data/x_vehicle_company_service_dataset_v1.xlsx?raw=True")
raw_dt=na.omit(raw_data)
glimpse(raw_dt)
## Rows: 561,142
## Columns: 17
## $ Malzeme <chr> "malzeme_236", "malzeme_2156", "malzeme_...
## $ `Sasi No` <chr> "sasi_no_9883", "sasi_no_9883", "sasi_no...
## $ `müsteri No` <chr> "müsteri_15", "müsteri_15", "müsteri_15"...
## $ `Is Emri No` <chr> "000000000127819", "000000000127819", "0...
## $ `Malzeme tipi` <chr> "Diger", "Parca", "Diger", "Parca", "Dig...
## $ `Islem tipi` <chr> "Ariza", "Ariza", "Ariza", "Ariza", "Ari...
## $ `Arac giris tarihi` <dttm> 2019-01-22, 2019-01-22, 2019-01-22, 201...
## $ `Arac cikis tarihi` <dttm> 2019-01-20, 2019-01-20, 2019-01-20, 201...
## $ Miktar <dbl> 1.0, 2.0, 20.0, 1.0, 1.0, 10.0, 1.0, 1.0...
## $ `Net fiyat` <dbl> 209.96415, 97.63875, 0.01485, 905.86485,...
## $ Model <chr> "model_23", "model_23", "model_23", "mod...
## $ `Yil-Model` <dbl> 2016, 2016, 2016, 2017, 2017, 2017, 2016...
## $ `Is emri kapanis tarihi` <chr> "25/01/2019", "25/01/2019", "25/01/2019"...
## $ `Arac KM` <dbl> 732544, 732544, 732544, 499700, 499700, ...
## $ `Firma Sehir` <chr> "Ankara", "Ankara", "Ankara", "Ankara", ...
## $ `Garanti Baslangic tarihi` <dttm> 2016-10-04, 2016-10-04, 2016-10-04, 201...
## $ `Garanti Bitis tarihi` <dttm> 2018-10-04, 2018-10-04, 2018-10-04, 201...
Transforming the job_closed_date column from character to datetime by using a function of lubridate which is as.POSIXct
raw_dt1 = raw_dt
raw_dt1$'Is emri kapanis tarihi' = str_replace_all(raw_dt1$'Is emri kapanis tarihi', "/","-")
raw_dt1$'Is emri kapanis tarihi' <- as.POSIXct(raw_dt1$'Is emri kapanis tarihi',format = "%d-%m-%Y")
glimpse(raw_dt1)
## Rows: 561,142
## Columns: 17
## $ Malzeme <chr> "malzeme_236", "malzeme_2156", "malzeme_...
## $ `Sasi No` <chr> "sasi_no_9883", "sasi_no_9883", "sasi_no...
## $ `müsteri No` <chr> "müsteri_15", "müsteri_15", "müsteri_15"...
## $ `Is Emri No` <chr> "000000000127819", "000000000127819", "0...
## $ `Malzeme tipi` <chr> "Diger", "Parca", "Diger", "Parca", "Dig...
## $ `Islem tipi` <chr> "Ariza", "Ariza", "Ariza", "Ariza", "Ari...
## $ `Arac giris tarihi` <dttm> 2019-01-22, 2019-01-22, 2019-01-22, 201...
## $ `Arac cikis tarihi` <dttm> 2019-01-20, 2019-01-20, 2019-01-20, 201...
## $ Miktar <dbl> 1.0, 2.0, 20.0, 1.0, 1.0, 10.0, 1.0, 1.0...
## $ `Net fiyat` <dbl> 209.96415, 97.63875, 0.01485, 905.86485,...
## $ Model <chr> "model_23", "model_23", "model_23", "mod...
## $ `Yil-Model` <dbl> 2016, 2016, 2016, 2017, 2017, 2017, 2016...
## $ `Is emri kapanis tarihi` <dttm> 2019-01-25, 2019-01-25, 2019-01-25, 201...
## $ `Arac KM` <dbl> 732544, 732544, 732544, 499700, 499700, ...
## $ `Firma Sehir` <chr> "Ankara", "Ankara", "Ankara", "Ankara", ...
## $ `Garanti Baslangic tarihi` <dttm> 2016-10-04, 2016-10-04, 2016-10-04, 201...
## $ `Garanti Bitis tarihi` <dttm> 2018-10-04, 2018-10-04, 2018-10-04, 201...
Renaming of the columns by using rename() function in order to get clear information about data.
service_dt <- dplyr::rename(raw_dt1, "Material_id" = 1,
"Vehicle_id" = 2,
"Dealer_id" = 3,
"Job_order_number" = 4,
"material_type" = 5,
"process_type" = 6,
"beginning_date" = 7,
"ending_date" = 8,
"quantity" = 9,
"price" = 10,
"Model" = 11,
"production_date" = 12,
"job_closed_date" = 13,
"vehicle_km" = 14,
"dealer_city" = 15,
"warranty_beginning" = 16,
"warranty_ending" = 17)
glimpse(service_dt)
## Rows: 561,142
## Columns: 17
## $ Material_id <chr> "malzeme_236", "malzeme_2156", "malzeme_14571", ...
## $ Vehicle_id <chr> "sasi_no_9883", "sasi_no_9883", "sasi_no_9883", ...
## $ Dealer_id <chr> "müsteri_15", "müsteri_15", "müsteri_15", "müste...
## $ Job_order_number <chr> "000000000127819", "000000000127819", "000000000...
## $ material_type <chr> "Diger", "Parca", "Diger", "Parca", "Diger", "Ya...
## $ process_type <chr> "Ariza", "Ariza", "Ariza", "Ariza", "Ariza", "Ar...
## $ beginning_date <dttm> 2019-01-22, 2019-01-22, 2019-01-22, 2019-11-28,...
## $ ending_date <dttm> 2019-01-20, 2019-01-20, 2019-01-20, 2019-11-18,...
## $ quantity <dbl> 1.0, 2.0, 20.0, 1.0, 1.0, 10.0, 1.0, 1.0, 1.0, 1...
## $ price <dbl> 209.96415, 97.63875, 0.01485, 905.86485, 220.908...
## $ Model <chr> "model_23", "model_23", "model_23", "model_23", ...
## $ production_date <dbl> 2016, 2016, 2016, 2017, 2017, 2017, 2016, 2018, ...
## $ job_closed_date <dttm> 2019-01-25, 2019-01-25, 2019-01-25, 2019-11-28,...
## $ vehicle_km <dbl> 732544, 732544, 732544, 499700, 499700, 499700, ...
## $ dealer_city <chr> "Ankara", "Ankara", "Ankara", "Ankara", "Ankara"...
## $ warranty_beginning <dttm> 2016-10-04, 2016-10-04, 2016-10-04, 2017-07-19,...
## $ warranty_ending <dttm> 2018-10-04, 2018-10-04, 2018-10-04, 2019-07-19,...
Trend charts were created by comparing process volumes in services for 2019 and 2020. While 2019 monthly data show a similar trend, 2020 transaction volumes show highly volatility due to the pandemic effect.
dt_2019<- service_dt%>%
group_by(month=lubridate::month(beginning_date),year=lubridate::year(beginning_date)) %>%
filter(year == 2019) %>% summarise(total_quantity = sum(quantity))
graph_2019 <- dt_2019 %>% select(month,total_quantity)
dt_2020<- service_dt%>%
group_by(month=lubridate::month(beginning_date),year=lubridate::year(beginning_date))%>%
filter(year == 2020) %>% summarise(total_quantity = sum(quantity))
graph_2020 <- dt_2020 %>% select(month,total_quantity)
dt_graph = merge(graph_2019, graph_2020, by="month", )
dt_graph_2 <- reshape::melt(dt_graph, id.var='month')
p<-ggplot(dt_graph_2, aes(x=month, y=value, col=variable)) +geom_line(size=0.9) +
ggtitle("Vehicle Operation Trend", "2019-2020 Operations in Services") +
xlab("Month") + ylab("Total Quantity") + labs(color="Year") + theme_classic()
p + scale_color_manual(name="Year",
labels = c("2019", "2020"),
values = c("total_quantity.x"="#D16103",
"total_quantity.y"="#4E84C4"))
The process types in the dataset were shown by proportionally according to their volume. Among the total data for two years, it was seen that the most transactions were made in fault type with 83%
plot_1<- service_dt%>%
mutate(count_total = n())%>%
group_by(process_type)%>%
mutate(count_vehicle_id = n())%>%
select(process_type,count_vehicle_id, count_total)%>%
mutate(rate_process_type = round(100*count_vehicle_id/count_total))%>%
arrange(desc(rate_process_type))%>%
distinct(process_type, rate_process_type)
print(plot_1)
## # A tibble: 11 x 2
## # Groups: process_type [11]
## process_type rate_process_type
## <chr> <dbl>
## 1 Ariza 83
## 2 P.Bakim 6
## 3 Kontrol 4
## 4 Hasar 2
## 5 Kampanya 2
## 6 Garanti 2
## 7 Yol Yardım 0
## 8 Dahili 0
## 9 PDI 0
## 10 Aksesuar 0
## 11 P.Bak+Ariza 0
The most processed fault type was filtered and the model distribution was created.The most processed model type with fault type is seen as model 23 from the graphic.
plot_2<- service_dt%>%
filter(process_type == "Ariza")%>%
group_by(Model)%>%
summarise(count_vehicle_id = n())%>%
arrange(desc(Model))
ggplot(plot_2, aes(x = Model, y = count_vehicle_id, fill= Model))+ ggtitle("Model Distribution in Fault Type") +
geom_bar(stat="identity") +theme(axis.text.x = element_text(angle=90, size=9, vjust=0.5,hjust=1))
In this part, The distribution of the vehicle kms was plotted according to the models. After the extreme values were discovered, a different idea of analysis emerged to expel these values.
plot_3<- service_dt %>%
group_by(Model)%>%
summarise(min_km=min(vehicle_km), max_km = max(vehicle_km), mean_km = mean(vehicle_km))%>%
pivot_longer(cols = c(-Model))
print(plot_3)
## # A tibble: 81 x 3
## Model name value
## <chr> <chr> <dbl>
## 1 model_1 min_km 0
## 2 model_1 max_km 1433502
## 3 model_1 mean_km 330476.
## 4 model_10 min_km 100
## 5 model_10 max_km 481610
## 6 model_10 mean_km 201121.
## 7 model_11 min_km 100
## 8 model_11 max_km 1109000
## 9 model_11 mean_km 426548.
## 10 model_12 min_km 30000
## # ... with 71 more rows
In order to make sure that outliers actually exist the boxplot was created. From the graph which was drawn at below, the outliers could be seen very clearly especially in max graph.
plot_3<- service_dt %>%
group_by(Model)%>%
summarise(min_km=min(vehicle_km), max_km = max(vehicle_km), mean_km = mean(vehicle_km))%>%
pivot_longer(cols = c(-Model))%>%
print(plot_3)
## # A tibble: 81 x 3
## Model name value
## <chr> <chr> <dbl>
## 1 model_1 min_km 0
## 2 model_1 max_km 1433502
## 3 model_1 mean_km 330476.
## 4 model_10 min_km 100
## 5 model_10 max_km 481610
## 6 model_10 mean_km 201121.
## 7 model_11 min_km 100
## 8 model_11 max_km 1109000
## 9 model_11 mean_km 426548.
## 10 model_12 min_km 30000
## # ... with 71 more rows
As it can be seen from the codes which was written at below, a new dataset(“Eliminated”) was created after the determination of the upper and lower ranges.
Q <- quantile(service_dt$vehicle_km, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(service_dt$vehicle_km)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
eliminated<- subset(service_dt, service_dt$vehicle_km> (Q[1] - 1.5*iqr) & service_dt$vehicle_km < (Q[2]+1.5*iqr))
Thanks to the elimination of the outliers the range of vehicle kms graph was decreased. Therefore, the graph could be analyzed in more detail as if with a magnifying glass.
plot_4<- eliminated %>%
group_by(Model)%>%
summarise(min_km=min(vehicle_km), max_km = max(vehicle_km), mean_km = mean(vehicle_km))%>%
pivot_longer(cols = c(-Model))
print(plot_4)
## # A tibble: 81 x 3
## Model name value
## <chr> <chr> <dbl>
## 1 model_1 min_km 0
## 2 model_1 max_km 1312210
## 3 model_1 mean_km 329754.
## 4 model_10 min_km 100
## 5 model_10 max_km 481610
## 6 model_10 mean_km 201121.
## 7 model_11 min_km 100
## 8 model_11 max_km 1109000
## 9 model_11 mean_km 426548.
## 10 model_12 min_km 30000
## # ... with 71 more rows
In order to understand which material types are used more commonly in which process, the following analysis was conducted. It was seen that the “diğer”, “parça” and “oil” materials were ranked in order of high to low, respectively, according to their usage amounts.
plot_5<- eliminated%>%
group_by(process_type, material_type)%>%
summarise(count = sum(quantity))%>%
arrange(desc(count))
kable(plot_5,col.names = c("Process_type", "Material_type.", "Total_quantity"))%>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "100%", height = "400px")
Process_type | Material_type. | Total_quantity |
---|---|---|
Ariza | Diger | 904975.05 |
Ariza | Parca | 362572.14 |
Ariza | Yag | 134444.46 |
P.Bakim | Diger | 62750.75 |
P.Bakim | Yag | 32835.40 |
Kontrol | Diger | 32586.03 |
Kampanya | Diger | 24405.98 |
Kontrol | Parca | 22615.95 |
Hasar | Diger | 21179.68 |
P.Bakim | Parca | 18936.55 |
Kampanya | Yag | 16248.20 |
Garanti | Diger | 11857.30 |
Garanti | Parca | 10737.15 |
Hasar | Parca | 10176.05 |
Kampanya | Parca | 9446.61 |
Kontrol | Yag | 7087.30 |
Yol Yardım | Diger | 5803.45 |
Dahili | Diger | 4587.10 |
Yol Yardım | Parca | 3288.90 |
Garanti | Yag | 1661.10 |
Dahili | Parca | 1014.00 |
Yol Yardım | Yag | 474.60 |
Aksesuar | Diger | 162.00 |
Dahili | Yag | 142.50 |
PDI | Diger | 68.00 |
Ariza | Aksesuar | 61.00 |
PDI | Parca | 38.00 |
Hasar | Yag | 33.00 |
PDI | Yag | 24.50 |
Aksesuar | Parca | 13.00 |
Kampanya | Aksesuar | 13.00 |
P.Bak+Ariza | Yag | 9.50 |
P.Bak+Ariza | Diger | 4.00 |
P.Bak+Ariza | Parca | 3.00 |
Kontrol | Aksesuar | 1.00 |
P.Bakim | Aksesuar | 1.00 |
ggplot(plot_5, aes(y=count, x = process_type, fill = process_type)) +
geom_col() +
facet_wrap(~material_type, scales ="free") +
theme_minimal() +
theme(legend.position = "none") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
theme(axis.text.x = element_text(angle = 90), legend.position = "none")+
labs(title = "Total Quantity According to Material Type ",
x = "Process Type",
y = "Quantity")
Unlike the usage amount graph, when the total part costs are sorted by much less, it is seen that “parts” come first. And then “other” and “oil” followed. The most important reason for this is that the unit cost of “other” part is very low compared to “part”.
plot_6<- eliminated%>%
group_by(process_type, material_type)%>%
summarise(total_cost = sum(price))%>%
arrange(desc(total_cost))
plot_6
## # A tibble: 36 x 3
## # Groups: process_type [11]
## process_type material_type total_cost
## <chr> <chr> <dbl>
## 1 Ariza Parca 144742205.
## 2 Ariza Diger 80304717.
## 3 Garanti Parca 14356548.
## 4 Kontrol Parca 7035741.
## 5 Hasar Diger 6040451.
## 6 P.Bakim Parca 5957358.
## 7 Hasar Parca 5788021.
## 8 Dahili Parca 4251055.
## 9 Kampanya Parca 3593142.
## 10 Kontrol Diger 2451356.
## # ... with 26 more rows
ggplot(plot_6, aes(y=total_cost, x = process_type, fill = process_type)) +
geom_col() +
facet_wrap(~material_type, scales ="free") +
theme_minimal() +
theme(legend.position = "none") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
theme(axis.text.x = element_text(angle = 90), legend.position = "none")+
labs(title = "Total Cost According to Material Type ",
x = "Process Type",
y = "Total Cost")
The number of processes applied in the cities where the services were located was calculated. Their ratios within the total transactions were calculated.
plot_6<- eliminated%>%
count(dealer_city)%>%
mutate(percentage = 100*round(n/sum(n),3))%>%
arrange(desc(percentage))
kable(plot_6,col.names = c("Dealer_City", "Number of Process.", "Percentage"))%>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "100%", height = "400px")
Dealer_City | Number of Process. | Percentage |
---|---|---|
İstanbul | 110742 | 20.4 |
Antalya | 87578 | 16.2 |
İzmir | 68050 | 12.6 |
Ankara | 47715 | 8.8 |
Kocaeli | 42691 | 7.9 |
Bursa | 24247 | 4.5 |
Denizli | 21804 | 4.0 |
Gaziantep | 17274 | 3.2 |
Erzurum | 15446 | 2.9 |
Muğla | 14238 | 2.6 |
Nevşehir | 10552 | 1.9 |
Balıkesir | 9945 | 1.8 |
Mersin | 9692 | 1.8 |
Konya | 8715 | 1.6 |
Sakarya | 6632 | 1.2 |
Bolu | 5316 | 1.0 |
Samsun | 4766 | 0.9 |
Trabazon | 4406 | 0.8 |
Düzce | 3571 | 0.7 |
Eskişehir | 3752 | 0.7 |
Hatay | 3256 | 0.6 |
Ada | 2218 | 0.4 |
Adıyaman | 2211 | 0.4 |
Çakkale | 2326 | 0.4 |
Şanlıurfa | 1958 | 0.4 |
Aksaray | 1761 | 0.3 |
Aydın | 1586 | 0.3 |
Uşak | 1729 | 0.3 |
Çankırı | 1212 | 0.2 |
Edirne | 1317 | 0.2 |
Elazığ | 1274 | 0.2 |
Karaman | 656 | 0.1 |
Kayseri | 514 | 0.1 |
Kütahya | 602 | 0.1 |
Malatya | 420 | 0.1 |
Sivas | 433 | 0.1 |
Tekirdağ | 371 | 0.1 |
Afyonkarahisar | 60 | 0.0 |
Çorum | 260 | 0.0 |
Diyarbakır | 33 | 0.0 |
Kahramanmaraş | 6 | 0.0 |
Niğde | 164 | 0.0 |
Van | 249 | 0.0 |
The calculated rates were shown in the graph. It was observed that the transactions were concentrated in the Marmara Region, where industry and transportation are intense, and in the Akdeniz and Ege regions, where tourism activities are intense.
plot_6%>%
filter(percentage >= 1.0)%>%
ggplot(aes(x=percentage, y=reorder(dealer_city, percentage), fill=percentage)) +
geom_col() +
scale_fill_gradient("percentage", low="thistle1", high="thistle4") +
geom_text(aes(label = paste(format(percentage,digits=1), "%")), size=4, position = position_stack(vjust = 0.5)) +
theme_minimal() +
theme(legend.position = "none", plot.title = element_text(vjust = 0.5)) +
labs(x = "Percentages",
y = "Dealers",
title = "The Most Popular Dealers")
In this section, the dataset is grouped according to the model production date variables, and it is aimed to find the model with the highest service expenditure. Model 23 was seen to be the most costly model.
plot_7 <- eliminated %>%
group_by(Model, production_date) %>%
summarise(total_price = sum(price)) %>%
mutate(age_of_car = as.numeric(format(Sys.Date(), "%Y"))-production_date)%>%
arrange(desc(total_price))
plot_7
## # A tibble: 122 x 4
## # Groups: Model [27]
## Model production_date total_price age_of_car
## <chr> <dbl> <dbl> <dbl>
## 1 model_23 2016 30377790. 4
## 2 model_1 2016 23435455. 4
## 3 model_23 2017 22175864. 3
## 4 model_21 2017 21019086. 3
## 5 model_21 2018 20072235. 2
## 6 model_9 2017 14764589. 3
## 7 model_9 2016 11963495. 4
## 8 model_21 2016 11251275. 4
## 9 model_27 2018 9431047. 2
## 10 model_2 2015 9139873. 5
## # ... with 112 more rows
When filtered according to Model 23, which has the highest service cost, it was seen that the vehicles’ service costs with the production year 2016 and 2017 were the two highest models.
plot_8 <- eliminated %>%
group_by(Model, production_date) %>%
filter(Model =="model_23")%>%
summarise(total_price = sum(price)) %>%
arrange(desc(total_price))
plot_8
## # A tibble: 8 x 3
## # Groups: Model [1]
## Model production_date total_price
## <chr> <dbl> <dbl>
## 1 model_23 2016 30377790.
## 2 model_23 2017 22175864.
## 3 model_23 2015 7957232.
## 4 model_23 2014 7569433.
## 5 model_23 2013 2674717.
## 6 model_23 2012 2146688.
## 7 model_23 2018 893249.
## 8 model_23 2011 215316.
According to vehicle age, the rate of warranty coverage of service expenses according to vehicle age has been examined in the chart below. Since the general warranty period is 3 years, the number of transactions made within the warranty scope in the first 3 years has been high but has become very low after 4 years. After the age of 7, all transactions are made out of warranty.
plot_9 <- eliminated %>%
mutate(Warranty_status = ifelse(beginning_date>warranty_ending,"out_of_warranty","warranty_expenses"))%>%
mutate(age_of_car = as.numeric(format(Sys.Date(), "%Y"))-production_date)%>%
group_by(Model, age_of_car, Warranty_status, )%>%
filter(Model =="model_23")%>%
summarise(total_price = sum(price)) %>%
group_by(age_of_car)%>%
mutate(rate_of_expenses = round(total_price / sum(total_price)*100, 2))%>%
arrange(desc(age_of_car))
plot_9
## # A tibble: 13 x 5
## # Groups: age_of_car [8]
## Model age_of_car Warranty_status total_price rate_of_expenses
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 model_23 9 out_of_warranty 215316. 100
## 2 model_23 8 out_of_warranty 2146688. 100
## 3 model_23 7 out_of_warranty 2674717. 100
## 4 model_23 6 out_of_warranty 7436641. 98.2
## 5 model_23 6 warranty_expenses 132791. 1.75
## 6 model_23 5 out_of_warranty 7743090. 97.3
## 7 model_23 5 warranty_expenses 214141. 2.69
## 8 model_23 4 out_of_warranty 29284083. 96.4
## 9 model_23 4 warranty_expenses 1093707. 3.6
## 10 model_23 3 out_of_warranty 12064010. 54.4
## 11 model_23 3 warranty_expenses 10111853. 45.6
## 12 model_23 2 out_of_warranty 174675. 19.6
## 13 model_23 2 warranty_expenses 718573. 80.4
ggplot(plot_9, aes(x = age_of_car, y = rate_of_expenses, fill= Warranty_status))+
geom_bar(stat="identity", position = "dodge") +theme(axis.text.x = element_text(angle=0, size=10, vjust=0,hjust=1))+scale_x_continuous(breaks=seq(0,9,1)) + labs(x="The Rate of Expenses (TL)", y="Age of Car (Year)", color="Warranty Status", title = "The Rate of Warranty Coverage of Service Expenses")