1. Introduction

1.1 Data

The dataset covers up all the recorded service information of x company’s dealers in Turkey. The data is belong the dates between 2019 and 2020. This data is obtained from x company where one of our team member is working for.In the dataset we have 561,142 rows and 17 columns which gives different kind of parameters. According to the parameters a reader can get information about details of service processes such as model number, km, warranty start & end date, prices, vehicle id etc.

You can find variables as below:

  • Material_id: Material id
  • Vehicle_id: Vehicle id
  • Dealer_id: Dealer id
  • Job_order_number: Job order number
  • material_type: Material type that used in service process
  • process_type: Process type
  • beginning_date: Date of beginning the service process
  • ending_date: Date of ending the service process
  • quantity: Quantity of part or material used
  • price: Price
  • Model:Model of vehicle
  • production_date: Production year of model
  • job_closed_date: Job order ending date
  • vehicle_km: Kilometer of vehicle
  • dealer_city: City of the dealer
  • warranty_beginning: Date of beginning the warranty
  • warranty_ending: Date of ending the warranty

1.2. Objectives

In this project, we will first investigate the data for preprocessing to improve its quality. Then we will perform an exploratory data analysis(EDA) by data manipulation and data visualization steps.

  1. Data Preprocessing
  2. Data Manipulation
  3. Data Visualization
  4. Interactive Shiny App

1.3 Used Packages

The packages used during the project can be listed as below:

  • tidyverse
  • shiny
  • ggplot2
  • lubridate
  • knitr
  • rio
  • reshape
  • kableExtra

We can use the glimpse function to inspect our data. By using it,each column is represented in a row with its data type and first few entries. We have 561,142 rows and 17 variables.

library(readxl)
library(tidyverse)
library(lubridate)
library(openxlsx)
library(rio)
library(reshape)
library(kableExtra)

raw_data <- rio::import("https://github.com/pjournal/mef04g-madagaskar/blob/gh-pages/Data/x_vehicle_company_service_dataset_v1.xlsx?raw=True")

raw_dt=na.omit(raw_data)

glimpse(raw_dt)
## Rows: 561,142
## Columns: 17
## $ Malzeme                    <chr> "malzeme_236", "malzeme_2156", "malzeme_...
## $ `Sasi No`                  <chr> "sasi_no_9883", "sasi_no_9883", "sasi_no...
## $ `müsteri No`               <chr> "müsteri_15", "müsteri_15", "müsteri_15"...
## $ `Is Emri No`               <chr> "000000000127819", "000000000127819", "0...
## $ `Malzeme tipi`             <chr> "Diger", "Parca", "Diger", "Parca", "Dig...
## $ `Islem tipi`               <chr> "Ariza", "Ariza", "Ariza", "Ariza", "Ari...
## $ `Arac giris tarihi`        <dttm> 2019-01-22, 2019-01-22, 2019-01-22, 201...
## $ `Arac cikis tarihi`        <dttm> 2019-01-20, 2019-01-20, 2019-01-20, 201...
## $ Miktar                     <dbl> 1.0, 2.0, 20.0, 1.0, 1.0, 10.0, 1.0, 1.0...
## $ `Net fiyat`                <dbl> 209.96415, 97.63875, 0.01485, 905.86485,...
## $ Model                      <chr> "model_23", "model_23", "model_23", "mod...
## $ `Yil-Model`                <dbl> 2016, 2016, 2016, 2017, 2017, 2017, 2016...
## $ `Is emri kapanis tarihi`   <chr> "25/01/2019", "25/01/2019", "25/01/2019"...
## $ `Arac KM`                  <dbl> 732544, 732544, 732544, 499700, 499700, ...
## $ `Firma Sehir`              <chr> "Ankara", "Ankara", "Ankara", "Ankara", ...
## $ `Garanti Baslangic tarihi` <dttm> 2016-10-04, 2016-10-04, 2016-10-04, 201...
## $ `Garanti Bitis tarihi`     <dttm> 2018-10-04, 2018-10-04, 2018-10-04, 201...

Transforming the job_closed_date column from character to datetime by using a function of lubridate which is as.POSIXct

raw_dt1 = raw_dt
raw_dt1$'Is emri kapanis tarihi' = str_replace_all(raw_dt1$'Is emri kapanis tarihi', "/","-")
raw_dt1$'Is emri kapanis tarihi' <- as.POSIXct(raw_dt1$'Is emri kapanis tarihi',format = "%d-%m-%Y")
glimpse(raw_dt1)
## Rows: 561,142
## Columns: 17
## $ Malzeme                    <chr> "malzeme_236", "malzeme_2156", "malzeme_...
## $ `Sasi No`                  <chr> "sasi_no_9883", "sasi_no_9883", "sasi_no...
## $ `müsteri No`               <chr> "müsteri_15", "müsteri_15", "müsteri_15"...
## $ `Is Emri No`               <chr> "000000000127819", "000000000127819", "0...
## $ `Malzeme tipi`             <chr> "Diger", "Parca", "Diger", "Parca", "Dig...
## $ `Islem tipi`               <chr> "Ariza", "Ariza", "Ariza", "Ariza", "Ari...
## $ `Arac giris tarihi`        <dttm> 2019-01-22, 2019-01-22, 2019-01-22, 201...
## $ `Arac cikis tarihi`        <dttm> 2019-01-20, 2019-01-20, 2019-01-20, 201...
## $ Miktar                     <dbl> 1.0, 2.0, 20.0, 1.0, 1.0, 10.0, 1.0, 1.0...
## $ `Net fiyat`                <dbl> 209.96415, 97.63875, 0.01485, 905.86485,...
## $ Model                      <chr> "model_23", "model_23", "model_23", "mod...
## $ `Yil-Model`                <dbl> 2016, 2016, 2016, 2017, 2017, 2017, 2016...
## $ `Is emri kapanis tarihi`   <dttm> 2019-01-25, 2019-01-25, 2019-01-25, 201...
## $ `Arac KM`                  <dbl> 732544, 732544, 732544, 499700, 499700, ...
## $ `Firma Sehir`              <chr> "Ankara", "Ankara", "Ankara", "Ankara", ...
## $ `Garanti Baslangic tarihi` <dttm> 2016-10-04, 2016-10-04, 2016-10-04, 201...
## $ `Garanti Bitis tarihi`     <dttm> 2018-10-04, 2018-10-04, 2018-10-04, 201...

Renaming of the columns by using rename() function in order to get clear information about data.

service_dt <- dplyr::rename(raw_dt1, "Material_id" = 1,
                 "Vehicle_id" = 2,
                 "Dealer_id" = 3,
                 "Job_order_number" = 4,
                 "material_type" = 5,
                 "process_type" = 6,
                 "beginning_date" = 7,
                 "ending_date" = 8,
                 "quantity" = 9,
                 "price" = 10,
                 "Model" = 11,
                 "production_date" = 12,
                 "job_closed_date" = 13,
                 "vehicle_km" = 14,
                 "dealer_city" = 15,
                 "warranty_beginning" = 16,
                 "warranty_ending" = 17)


glimpse(service_dt)
## Rows: 561,142
## Columns: 17
## $ Material_id        <chr> "malzeme_236", "malzeme_2156", "malzeme_14571", ...
## $ Vehicle_id         <chr> "sasi_no_9883", "sasi_no_9883", "sasi_no_9883", ...
## $ Dealer_id          <chr> "müsteri_15", "müsteri_15", "müsteri_15", "müste...
## $ Job_order_number   <chr> "000000000127819", "000000000127819", "000000000...
## $ material_type      <chr> "Diger", "Parca", "Diger", "Parca", "Diger", "Ya...
## $ process_type       <chr> "Ariza", "Ariza", "Ariza", "Ariza", "Ariza", "Ar...
## $ beginning_date     <dttm> 2019-01-22, 2019-01-22, 2019-01-22, 2019-11-28,...
## $ ending_date        <dttm> 2019-01-20, 2019-01-20, 2019-01-20, 2019-11-18,...
## $ quantity           <dbl> 1.0, 2.0, 20.0, 1.0, 1.0, 10.0, 1.0, 1.0, 1.0, 1...
## $ price              <dbl> 209.96415, 97.63875, 0.01485, 905.86485, 220.908...
## $ Model              <chr> "model_23", "model_23", "model_23", "model_23", ...
## $ production_date    <dbl> 2016, 2016, 2016, 2017, 2017, 2017, 2016, 2018, ...
## $ job_closed_date    <dttm> 2019-01-25, 2019-01-25, 2019-01-25, 2019-11-28,...
## $ vehicle_km         <dbl> 732544, 732544, 732544, 499700, 499700, 499700, ...
## $ dealer_city        <chr> "Ankara", "Ankara", "Ankara", "Ankara", "Ankara"...
## $ warranty_beginning <dttm> 2016-10-04, 2016-10-04, 2016-10-04, 2017-07-19,...
## $ warranty_ending    <dttm> 2018-10-04, 2018-10-04, 2018-10-04, 2019-07-19,...

2. Exploratory Data Analysis - EDA

2.1 Vehicle Operation Trend

Trend charts were created by comparing process volumes in services for 2019 and 2020. While 2019 monthly data show a similar trend, 2020 transaction volumes show highly volatility due to the pandemic effect.

dt_2019<- service_dt%>%
  group_by(month=lubridate::month(beginning_date),year=lubridate::year(beginning_date)) %>%
  filter(year == 2019) %>% summarise(total_quantity = sum(quantity))
graph_2019 <- dt_2019 %>% select(month,total_quantity)
dt_2020<- service_dt%>%
  group_by(month=lubridate::month(beginning_date),year=lubridate::year(beginning_date))%>%
  filter(year == 2020) %>% summarise(total_quantity = sum(quantity))
graph_2020 <- dt_2020 %>% select(month,total_quantity)
dt_graph = merge(graph_2019, graph_2020, by="month", )
dt_graph_2 <- reshape::melt(dt_graph, id.var='month')
p<-ggplot(dt_graph_2, aes(x=month, y=value, col=variable)) +geom_line(size=0.9) + 
  ggtitle("Vehicle Operation Trend", "2019-2020 Operations in Services") + 
  xlab("Month") + ylab("Total Quantity") + labs(color="Year") + theme_classic()
p + scale_color_manual(name="Year",
                      labels = c("2019", "2020"),
                      values = c("total_quantity.x"="#D16103",
                                 "total_quantity.y"="#4E84C4"))

2.2 Proportion of Process Types

The process types in the dataset were shown by proportionally according to their volume. Among the total data for two years, it was seen that the most transactions were made in fault type with 83%

plot_1<- service_dt%>%
  mutate(count_total = n())%>%
  group_by(process_type)%>%
  mutate(count_vehicle_id = n())%>%
  select(process_type,count_vehicle_id, count_total)%>%
  mutate(rate_process_type = round(100*count_vehicle_id/count_total))%>%
  arrange(desc(rate_process_type))%>%
  distinct(process_type, rate_process_type)
print(plot_1)
## # A tibble: 11 x 2
## # Groups:   process_type [11]
##    process_type rate_process_type
##    <chr>                    <dbl>
##  1 Ariza                       83
##  2 P.Bakim                      6
##  3 Kontrol                      4
##  4 Hasar                        2
##  5 Kampanya                     2
##  6 Garanti                      2
##  7 Yol Yardım                   0
##  8 Dahili                       0
##  9 PDI                          0
## 10 Aksesuar                     0
## 11 P.Bak+Ariza                  0
ggplot(plot_1, aes(x="", y=rate_process_type, fill=process_type)) + geom_bar(stat="identity") + 
  coord_polar("y") + ggtitle("Proportion of Process Types") +
  theme_minimal()+ theme(axis.title.y = element_blank(), plot.title = element_text(hjust=0.5))

2.3 Model Distribution in Fault Type

The most processed fault type was filtered and the model distribution was created.The most processed model type with fault type is seen as model 23 from the graphic.

plot_2<- service_dt%>%
  filter(process_type == "Ariza")%>%
  group_by(Model)%>%
  summarise(count_vehicle_id = n())%>%
  arrange(desc(Model))

ggplot(plot_2, aes(x = Model, y = count_vehicle_id, fill= Model))+ ggtitle("Model Distribution in Fault Type") +
  geom_bar(stat="identity") +theme(axis.text.x = element_text(angle=90, size=9, vjust=0.5,hjust=1))

2.4 Analyzing Min, Max and Mean of The Vehicle Kms

In this part, The distribution of the vehicle kms was plotted according to the models. After the extreme values were discovered, a different idea of analysis emerged to expel these values.

plot_3<- service_dt %>%
  group_by(Model)%>%
  summarise(min_km=min(vehicle_km), max_km = max(vehicle_km), mean_km = mean(vehicle_km))%>%
  pivot_longer(cols = c(-Model))
print(plot_3)
## # A tibble: 81 x 3
##    Model    name       value
##    <chr>    <chr>      <dbl>
##  1 model_1  min_km        0 
##  2 model_1  max_km  1433502 
##  3 model_1  mean_km  330476.
##  4 model_10 min_km      100 
##  5 model_10 max_km   481610 
##  6 model_10 mean_km  201121.
##  7 model_11 min_km      100 
##  8 model_11 max_km  1109000 
##  9 model_11 mean_km  426548.
## 10 model_12 min_km    30000 
## # ... with 71 more rows
 ggplot(plot_3, aes(x= Model,y = value, color = name)) +
   geom_point()+ theme(axis.text.x = element_text(angle=90, size=9, vjust=0.5,hjust=1))+
   labs(title = "Min, Max and Mean Kms Accoring to the Models")

2.5 Discovering Outliers

In order to make sure that outliers actually exist the boxplot was created. From the graph which was drawn at below, the outliers could be seen very clearly especially in max graph.

plot_3<- service_dt %>%
  group_by(Model)%>%
  summarise(min_km=min(vehicle_km), max_km = max(vehicle_km), mean_km = mean(vehicle_km))%>%
  pivot_longer(cols = c(-Model))%>%
  print(plot_3)
## # A tibble: 81 x 3
##    Model    name       value
##    <chr>    <chr>      <dbl>
##  1 model_1  min_km        0 
##  2 model_1  max_km  1433502 
##  3 model_1  mean_km  330476.
##  4 model_10 min_km      100 
##  5 model_10 max_km   481610 
##  6 model_10 mean_km  201121.
##  7 model_11 min_km      100 
##  8 model_11 max_km  1109000 
##  9 model_11 mean_km  426548.
## 10 model_12 min_km    30000 
## # ... with 71 more rows
 ggplot(plot_3, aes(x= name, y=value, fill= name)) +
  geom_boxplot() +
  theme_test() +
  labs(title = "KMs in Boxplot", y = "Km")+
  scale_y_continuous(breaks=seq(0, 24315000, 5000000))

2.6 Elimination of The Outliers

As it can be seen from the codes which was written at below, a new dataset(“Eliminated”) was created after the determination of the upper and lower ranges.

Q <- quantile(service_dt$vehicle_km, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(service_dt$vehicle_km)
up <-  Q[2]+1.5*iqr # Upper Range  
low<- Q[1]-1.5*iqr # Lower Range

eliminated<- subset(service_dt, service_dt$vehicle_km> (Q[1] - 1.5*iqr) & service_dt$vehicle_km < (Q[2]+1.5*iqr))

Thanks to the elimination of the outliers the range of vehicle kms graph was decreased. Therefore, the graph could be analyzed in more detail as if with a magnifying glass.

plot_4<- eliminated %>%
  group_by(Model)%>%
  summarise(min_km=min(vehicle_km), max_km = max(vehicle_km), mean_km = mean(vehicle_km))%>%
  pivot_longer(cols = c(-Model))
print(plot_4)
## # A tibble: 81 x 3
##    Model    name       value
##    <chr>    <chr>      <dbl>
##  1 model_1  min_km        0 
##  2 model_1  max_km  1312210 
##  3 model_1  mean_km  329754.
##  4 model_10 min_km      100 
##  5 model_10 max_km   481610 
##  6 model_10 mean_km  201121.
##  7 model_11 min_km      100 
##  8 model_11 max_km  1109000 
##  9 model_11 mean_km  426548.
## 10 model_12 min_km    30000 
## # ... with 71 more rows
 ggplot(plot_4, aes(x= Model,y = value, color = name)) +
   geom_point()+ theme(axis.text.x = element_text(angle=90, size=9, vjust=0.5,hjust=1))

2.7 The Usage Amount of The Material Types

In order to understand which material types are used more commonly in which process, the following analysis was conducted. It was seen that the “diğer”, “parça” and “oil” materials were ranked in order of high to low, respectively, according to their usage amounts.

plot_5<- eliminated%>%
  group_by(process_type, material_type)%>%
  summarise(count = sum(quantity))%>%
  arrange(desc(count))
kable(plot_5,col.names = c("Process_type", "Material_type.", "Total_quantity"))%>%
   kable_styling("striped", full_width = F) %>%
  scroll_box(width = "100%", height = "400px")
Process_type Material_type. Total_quantity
Ariza Diger 904975.05
Ariza Parca 362572.14
Ariza Yag 134444.46
P.Bakim Diger 62750.75
P.Bakim Yag 32835.40
Kontrol Diger 32586.03
Kampanya Diger 24405.98
Kontrol Parca 22615.95
Hasar Diger 21179.68
P.Bakim Parca 18936.55
Kampanya Yag 16248.20
Garanti Diger 11857.30
Garanti Parca 10737.15
Hasar Parca 10176.05
Kampanya Parca 9446.61
Kontrol Yag 7087.30
Yol Yardım Diger 5803.45
Dahili Diger 4587.10
Yol Yardım Parca 3288.90
Garanti Yag 1661.10
Dahili Parca 1014.00
Yol Yardım Yag 474.60
Aksesuar Diger 162.00
Dahili Yag 142.50
PDI Diger 68.00
Ariza Aksesuar 61.00
PDI Parca 38.00
Hasar Yag 33.00
PDI Yag 24.50
Aksesuar Parca 13.00
Kampanya Aksesuar 13.00
P.Bak+Ariza Yag 9.50
P.Bak+Ariza Diger 4.00
P.Bak+Ariza Parca 3.00
Kontrol Aksesuar 1.00
P.Bakim Aksesuar 1.00
  ggplot(plot_5, aes(y=count, x = process_type, fill = process_type)) +
  geom_col() +
  facet_wrap(~material_type, scales ="free") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  theme(axis.text.x = element_text(angle = 90), legend.position = "none")+
  labs(title = "Total Quantity According to Material Type ",
       x = "Process Type",
       y = "Quantity")

Unlike the usage amount graph, when the total part costs are sorted by much less, it is seen that “parts” come first. And then “other” and “oil” followed. The most important reason for this is that the unit cost of “other” part is very low compared to “part”.

plot_6<- eliminated%>%
  group_by(process_type, material_type)%>%
  summarise(total_cost = sum(price))%>%
  arrange(desc(total_cost))
plot_6
## # A tibble: 36 x 3
## # Groups:   process_type [11]
##    process_type material_type total_cost
##    <chr>        <chr>              <dbl>
##  1 Ariza        Parca         144742205.
##  2 Ariza        Diger          80304717.
##  3 Garanti      Parca          14356548.
##  4 Kontrol      Parca           7035741.
##  5 Hasar        Diger           6040451.
##  6 P.Bakim      Parca           5957358.
##  7 Hasar        Parca           5788021.
##  8 Dahili       Parca           4251055.
##  9 Kampanya     Parca           3593142.
## 10 Kontrol      Diger           2451356.
## # ... with 26 more rows
ggplot(plot_6, aes(y=total_cost, x = process_type, fill = process_type)) +
  geom_col() +
  facet_wrap(~material_type, scales ="free") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  theme(axis.text.x = element_text(angle = 90), legend.position = "none")+
  labs(title = "Total Cost According to Material Type ",
       x = "Process Type",
       y = "Total Cost")

2.9 The Total Price According to Types of Model

In this section, the dataset is grouped according to the model production date variables, and it is aimed to find the model with the highest service expenditure. Model 23 was seen to be the most costly model.

plot_7 <- eliminated %>%
  group_by(Model, production_date) %>%
  summarise(total_price = sum(price)) %>%
  mutate(age_of_car = as.numeric(format(Sys.Date(), "%Y"))-production_date)%>%
  arrange(desc(total_price))
plot_7
## # A tibble: 122 x 4
## # Groups:   Model [27]
##    Model    production_date total_price age_of_car
##    <chr>              <dbl>       <dbl>      <dbl>
##  1 model_23            2016   30377790.          4
##  2 model_1             2016   23435455.          4
##  3 model_23            2017   22175864.          3
##  4 model_21            2017   21019086.          3
##  5 model_21            2018   20072235.          2
##  6 model_9             2017   14764589.          3
##  7 model_9             2016   11963495.          4
##  8 model_21            2016   11251275.          4
##  9 model_27            2018    9431047.          2
## 10 model_2             2015    9139873.          5
## # ... with 112 more rows
ggplot(plot_7, aes(x = Model, y = total_price, fill= Model))+
  geom_point(stat="identity", position = "stack") +theme(axis.text.x = element_text(angle=90, size=9, vjust=0.5,hjust=1))+
  labs(x="Model", y="Total Price (TL)", fill ="Types of Model", title = "The Total Price According to Types of Model")

2.10 The Most Costly Model: Model 23

When filtered according to Model 23, which has the highest service cost, it was seen that the vehicles’ service costs with the production year 2016 and 2017 were the two highest models.

plot_8 <- eliminated %>%
  group_by(Model, production_date) %>%
  filter(Model =="model_23")%>%
  summarise(total_price = sum(price)) %>%
  arrange(desc(total_price))
  
plot_8
## # A tibble: 8 x 3
## # Groups:   Model [1]
##   Model    production_date total_price
##   <chr>              <dbl>       <dbl>
## 1 model_23            2016   30377790.
## 2 model_23            2017   22175864.
## 3 model_23            2015    7957232.
## 4 model_23            2014    7569433.
## 5 model_23            2013    2674717.
## 6 model_23            2012    2146688.
## 7 model_23            2018     893249.
## 8 model_23            2011     215316.

2.11 The Rate of Warranty Coverage of Service Expenses Belongs to Model 23

According to vehicle age, the rate of warranty coverage of service expenses according to vehicle age has been examined in the chart below. Since the general warranty period is 3 years, the number of transactions made within the warranty scope in the first 3 years has been high but has become very low after 4 years. After the age of 7, all transactions are made out of warranty.

plot_9 <- eliminated %>%
  mutate(Warranty_status = ifelse(beginning_date>warranty_ending,"out_of_warranty","warranty_expenses"))%>%
  mutate(age_of_car = as.numeric(format(Sys.Date(), "%Y"))-production_date)%>%
  group_by(Model, age_of_car, Warranty_status, )%>%
  filter(Model =="model_23")%>%
  summarise(total_price = sum(price)) %>%
  group_by(age_of_car)%>%
  mutate(rate_of_expenses = round(total_price / sum(total_price)*100, 2))%>%
  arrange(desc(age_of_car))
  plot_9
## # A tibble: 13 x 5
## # Groups:   age_of_car [8]
##    Model    age_of_car Warranty_status   total_price rate_of_expenses
##    <chr>         <dbl> <chr>                   <dbl>            <dbl>
##  1 model_23          9 out_of_warranty       215316.           100   
##  2 model_23          8 out_of_warranty      2146688.           100   
##  3 model_23          7 out_of_warranty      2674717.           100   
##  4 model_23          6 out_of_warranty      7436641.            98.2 
##  5 model_23          6 warranty_expenses     132791.             1.75
##  6 model_23          5 out_of_warranty      7743090.            97.3 
##  7 model_23          5 warranty_expenses     214141.             2.69
##  8 model_23          4 out_of_warranty     29284083.            96.4 
##  9 model_23          4 warranty_expenses    1093707.             3.6 
## 10 model_23          3 out_of_warranty     12064010.            54.4 
## 11 model_23          3 warranty_expenses   10111853.            45.6 
## 12 model_23          2 out_of_warranty       174675.            19.6 
## 13 model_23          2 warranty_expenses     718573.            80.4
ggplot(plot_9, aes(x = age_of_car, y = rate_of_expenses, fill= Warranty_status))+
  geom_bar(stat="identity", position = "dodge") +theme(axis.text.x = element_text(angle=0, size=10, vjust=0,hjust=1))+scale_x_continuous(breaks=seq(0,9,1)) + labs(x="The Rate of Expenses (TL)", y="Age of Car (Year)", color="Warranty Status", title = "The Rate of Warranty Coverage of Service Expenses")