Data Preprocessing

Importing Necessary Libraries

library(tidyverse)
library(rio)
library(knitr)

Reading & Preparing Datasets

In the project, we have four main datasets that we used to make analysis. Below, you can find the reading and preparing stages of these datasets.

Dataset 1 - Unemployed Job Searching People by the Channel

The original dataset can be found at this link.

The dataset is grouped by gender and includes job search methods. The data is between January 2014 and August 2020.

Since the data in the file does not start with A1 cell, the ranges are given as parameters during the import process.
All columns are renamed accordingly during the import process.
The original dataset has some empty cells for the year column. Therefore, empty cells filled with the correct year value.
The data table converted to a tibble.
month column includes both Turkish and English month names. By using str_split_fixed function, only the English name is taken.

All Genders Dataset

job_search_overall <- import("https://github.com/pjournal/mef04g-rhapsody/blob/gh-pages/Project_Data/search_channel.xls?raw=true", 
                             range = "TÜRKİYE!A6:P91", 
                             col_names = c("year", "month", "total_unemployed", "to_employers", "to_relatives", "to_emp_office", "to_emp_agencies", "to_newspaper", "insert_ad_to_newspaper", "take_interview", "look_place_equip_to_est_bus", "look_credit_license_to_est_bus", "wait_call_from_emp_office", "wait_result_of_app", "wait_result_of_comp_for_public_sec", "others")) %>% 
  fill(year, .direction = "down") %>%
  as_tibble()

job_search_overall$month <- str_split_fixed(job_search_overall$month, " - ", 2)[,2]

Final tibble is as follows.

## Rows: 86
## Columns: 16
## $ year                               <dbl> 2014, 2014, 2014, 2014, 2014, 2014…
## $ month                              <chr> "January", "February", "March", "A…
## $ total_unemployed                   <dbl> 2804, 2825, 2747, 2579, 2551, 2654…
## $ to_employers                       <dbl> 1826, 1814, 1731, 1621, 1603, 1689…
## $ to_relatives                       <dbl> 2581, 2605, 2535, 2364, 2326, 2417…
## $ to_emp_office                      <dbl> 592, 573, 531, 489, 500, 507, 555,…
## $ to_emp_agencies                    <dbl> 422, 420, 404, 391, 395, 425, 433,…
## $ to_newspaper                       <dbl> 828, 846, 846, 828, 852, 903, 909,…
## $ insert_ad_to_newspaper             <dbl> 185, 214, 229, 214, 203, 199, 185,…
## $ take_interview                     <dbl> 159, 171, 153, 154, 156, 173, 205,…
## $ look_place_equip_to_est_bus        <dbl> 66, 65, 71, 55, 62, 57, 58, 50, 46…
## $ look_credit_license_to_est_bus     <dbl> 33, 37, 37, 32, 35, 39, 41, 37, 33…
## $ wait_call_from_emp_office          <dbl> 420, 424, 398, 377, 352, 366, 396,…
## $ wait_result_of_app                 <dbl> 1227, 1261, 1232, 1115, 1108, 1162…
## $ wait_result_of_comp_for_public_sec <dbl> 101, 95, 65, 61, 62, 124, 191, 240…
## $ others                             <dbl> 7, 7, 6, 4, 3, 2, 1, 1, 2, 4, 3, 3…

Male Dataset

job_search_male <- import("https://github.com/pjournal/mef04g-rhapsody/blob/gh-pages/Project_Data/search_channel.xls?raw=true", 
                          range = "TÜRKİYE!R6:AG91", 
                          col_names = c("year", "month", "total_unemployed", "to_employers", "to_relatives", "to_emp_office", "to_emp_agencies", "to_newspaper", "insert_ad_to_newspaper", "take_interview", "look_place_equip_to_est_bus", "look_credit_license_to_est_bus", "wait_call_from_emp_office", "wait_result_of_app", "wait_result_of_comp_for_public_sec", "others")) %>% 
  fill(year, .direction = "down") %>%
  as_tibble()

job_search_male$month <- str_split_fixed(job_search_male$month, " - ", 2)[,2]

Final tibble is as follows.

## Rows: 86
## Columns: 16
## $ year                               <dbl> 2014, 2014, 2014, 2014, 2014, 2014…
## $ month                              <chr> "January", "February", "March", "A…
## $ total_unemployed                   <dbl> 1889, 1882, 1803, 1675, 1616, 1683…
## $ to_employers                       <dbl> 1262, 1235, 1161, 1081, 1036, 1086…
## $ to_relatives                       <dbl> 1760, 1747, 1668, 1543, 1488, 1556…
## $ to_emp_office                      <dbl> 356, 353, 329, 290, 292, 292, 322,…
## $ to_emp_agencies                    <dbl> 246, 251, 226, 203, 209, 230, 227,…
## $ to_newspaper                       <dbl> 511, 519, 513, 473, 490, 517, 519,…
## $ insert_ad_to_newspaper             <dbl> 114, 129, 142, 128, 126, 119, 108,…
## $ take_interview                     <dbl> 83, 88, 80, 83, 84, 82, 102, 108, …
## $ look_place_equip_to_est_bus        <dbl> 56, 57, 61, 44, 48, 43, 43, 38, 35…
## $ look_credit_license_to_est_bus     <dbl> 28, 33, 31, 24, 26, 30, 34, 30, 29…
## $ wait_call_from_emp_office          <dbl> 248, 259, 246, 223, 207, 209, 227,…
## $ wait_result_of_app                 <dbl> 763, 784, 754, 690, 676, 704, 741,…
## $ wait_result_of_comp_for_public_sec <dbl> 44, 44, 32, 32, 32, 59, 82, 97, 87…
## $ others                             <dbl> 3, 4, 4, 4, 3, 2, 1, 1, 1, 2, 1, 1…

Female Dataset

job_search_female <- import("https://github.com/pjournal/mef04g-rhapsody/blob/gh-pages/Project_Data/search_channel.xls?raw=true", 
                            range = "TÜRKİYE!AI6:AX91", 
                            col_names = c("year", "month", "total_unemployed", "to_employers", "to_relatives", "to_emp_office", "to_emp_agencies", "to_newspaper", "insert_ad_to_newspaper", "take_interview", "look_place_equip_to_est_bus", "look_credit_license_to_est_bus", "wait_call_from_emp_office", "wait_result_of_app", "wait_result_of_comp_for_public_sec", "others")) %>% 
  fill(year, .direction = "down") %>%
  as_tibble()

job_search_female$month <- str_split_fixed(job_search_female$month, " - ", 2)[,2]

Final tibble is as follows.

## Rows: 86
## Columns: 16
## $ year                               <dbl> 2014, 2014, 2014, 2014, 2014, 2014…
## $ month                              <chr> "January", "February", "March", "A…
## $ total_unemployed                   <dbl> 915, 942, 944, 903, 935, 971, 1065…
## $ to_employers                       <dbl> 564, 579, 570, 540, 568, 603, 644,…
## $ to_relatives                       <dbl> 822, 857, 868, 822, 838, 861, 940,…
## $ to_emp_office                      <dbl> 236, 220, 202, 199, 208, 215, 233,…
## $ to_emp_agencies                    <dbl> 176, 170, 177, 188, 186, 195, 206,…
## $ to_newspaper                       <dbl> 318, 328, 333, 354, 362, 386, 390,…
## $ insert_ad_to_newspaper             <dbl> 71, 85, 87, 87, 77, 80, 78, 91, 11…
## $ take_interview                     <dbl> 76, 83, 73, 70, 72, 91, 103, 118, …
## $ look_place_equip_to_est_bus        <dbl> 11, 8, 10, 11, 14, 14, 15, 12, 11,…
## $ look_credit_license_to_est_bus     <dbl> 6, 3, 7, 8, 9, 9, 7, 7, 3, 4, 7, 9…
## $ wait_call_from_emp_office          <dbl> 172, 165, 152, 154, 145, 156, 169,…
## $ wait_result_of_app                 <dbl> 464, 478, 478, 425, 432, 458, 496,…
## $ wait_result_of_comp_for_public_sec <dbl> 57, 50, 33, 28, 30, 65, 109, 142, …
## $ others                             <dbl> 3, 3, 2, 1, 0, 0, 0, 1, 2, 2, 2, 1…

Dataset 2 - Employed & Unemployed by Educational Level

The original dataset can be found at this link.

This dataset includes the number of employed and unemployed people by their educational levels and gender. The data is between January 2014 and August 2020.

Since the data in the file does not start with A1 cell, the ranges are given as parameters during the import process.
All columns are renamed accordingly during the import process.
There are two empty columns in the file. These empty columns are eliminated by using select function.
The original dataset has some empty cells for the year column. Therefore, empty cells filled with the correct year value.
The data table converted to a tibble.
month column includes both Turkish and English month names. By using str_split_fixed function, only the English name is taken.
There is a difference in two January values in the month column which causes these values to be equal to empty strings. Therefore, these values are being found and updated as January.

All Genders Dataset

educational_level_overall <- import("https://github.com/pjournal/mef04g-rhapsody/blob/gh-pages/Project_Data/educational_level.xls?raw=true", 
                                    range = "TÜRKİYE!A7:S92", 
                                    col_names = c("year", "month", "lf_illeterate", "lf_less_than_hs", "lf_highschool", "lf_voc_hs", "lf_higher_ed", "empty_col_1", "emp_illeterate", "emp_less_than_hs", "emp_highschool", "emp_voc_hs", "emp_higher_ed", "empty_col_2", "unemp_illeterate", "unemp_less_than_hs", "unemp_highschool", "unemp_voc_hs", "unemp_higher_ed" )) %>% 
  select(-empty_col_1, -empty_col_2) %>% 
  fill(year, .direction = "down") %>%
  as_tibble()

educational_level_overall$month <- str_split_fixed(educational_level_overall$month, " - ", 2)[,2]

educational_level_overall$month[educational_level_overall$month == ""] = "January"

Final tibble is as follows.

## Rows: 86
## Columns: 17
## $ year               <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 20…
## $ month              <chr> "January", "February", "March", "April", "May", "J…
## $ lf_illeterate      <dbl> 1018, 1087, 1141, 1210, 1248, 1285, 1199, 1179, 11…
## $ lf_less_than_hs    <dbl> 15209, 15587, 15954, 16287, 16492, 16585, 16388, 1…
## $ lf_highschool      <dbl> 2991, 3028, 3020, 2987, 3028, 3012, 2992, 2962, 29…
## $ lf_voc_hs          <dbl> 2672, 2733, 2726, 2765, 2795, 2849, 2979, 2961, 29…
## $ lf_higher_ed       <dbl> 5370, 5390, 5489, 5524, 5527, 5509, 5719, 5826, 59…
## $ emp_illeterate     <dbl> 942, 1008, 1061, 1135, 1180, 1226, 1139, 1117, 111…
## $ emp_less_than_hs   <dbl> 13610, 13979, 14400, 14858, 15110, 15186, 14942, 1…
## $ emp_highschool     <dbl> 2646, 2680, 2667, 2656, 2667, 2642, 2601, 2602, 25…
## $ emp_voc_hs         <dbl> 2381, 2430, 2444, 2502, 2539, 2573, 2666, 2645, 26…
## $ emp_higher_ed      <dbl> 4876, 4902, 5011, 5043, 5043, 4959, 5061, 5083, 51…
## $ unemp_illeterate   <dbl> 76, 78, 79, 75, 68, 59, 59, 62, 75, 86, 86, 79, 73…
## $ unemp_less_than_hs <dbl> 1599, 1607, 1554, 1429, 1382, 1400, 1447, 1462, 15…
## $ unemp_highschool   <dbl> 344, 348, 353, 331, 360, 370, 390, 361, 348, 341, …
## $ unemp_voc_hs       <dbl> 291, 303, 282, 262, 257, 276, 313, 316, 323, 329, …
## $ unemp_higher_ed    <dbl> 494, 488, 478, 481, 484, 550, 657, 743, 769, 725, …

Dataset 3 - Unemployment by Occupational Group

The original dataset can be found at this link.

This dataset includes the number of unemployed people by their occupational group and gender. The data is between January 2014 and August 2020.

Since the data in the file does not start with A1 cell, the ranges are given as parameters during the import process.
All columns are renamed accordingly during the import process.
The original dataset has some empty cells for the year column. Therefore, empty cells filled with the correct year value.
The data table converted to a tibble.
month column includes both Turkish and English month names. By using str_split_fixed function, only the English name is taken.

All Genders Dataset

occ_group_overall <- import("https://github.com/pjournal/mef04g-rhapsody/blob/gh-pages/Project_Data/occupational_group.xls?raw=true", 
                            range = "TÜRKİYE!A7:L92", 
                            col_names = c("year", "month", "total_unemployed", "manager", "prof", "tech", "cleric", "service", "agricul", "trade", "operator", "elemantary")) %>%
  fill(year, .direction = "down") %>%
  as_tibble()

occ_group_overall$month <- str_split_fixed(occ_group_overall$month, " - ", 2)[,2]

Final tibble is as follows.

## Rows: 86
## Columns: 12
## $ year             <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
## $ month            <chr> "January", "February", "March", "April", "May", "Jun…
## $ total_unemployed <dbl> 2804, 2825, 2747, 2579, 2551, 2654, 2867, 2944, 3064…
## $ manager          <dbl> 63, 69, 79, 69, 60, 62, 66, 66, 60, 52, 64, 67, 65, …
## $ prof             <dbl> 238, 217, 204, 227, 244, 286, 342, 390, 396, 351, 31…
## $ tech             <dbl> 180, 169, 166, 157, 170, 180, 206, 218, 236, 250, 25…
## $ cleric           <dbl> 342, 350, 354, 351, 376, 395, 415, 434, 461, 462, 44…
## $ service          <dbl> 653, 691, 693, 623, 618, 653, 732, 710, 704, 670, 69…
## $ agricul          <dbl> 32, 27, 26, 23, 25, 25, 24, 23, 22, 20, 20, 20, 24, …
## $ trade            <dbl> 479, 472, 441, 399, 384, 368, 392, 382, 406, 404, 43…
## $ operator         <dbl> 257, 274, 260, 271, 253, 258, 255, 258, 271, 283, 30…
## $ elemantary       <dbl> 560, 555, 523, 458, 421, 427, 435, 462, 507, 552, 56…

Male Dataset

occ_group_male <- import("https://github.com/pjournal/mef04g-rhapsody/blob/gh-pages/Project_Data/occupational_group.xls?raw=true", 
                         range = "TÜRKİYE!N7:Y92", 
                         col_names = c("year", "month", "total_unemployed", "manager", "prof", "tech", "cleric", "service", "agricul", "trade", "operator", "elemantary")) %>%
  fill(year, .direction = "down") %>%
  as_tibble()

occ_group_male$month <- str_split_fixed(occ_group_male$month, " - ", 2)[,2]

Final tibble is as follows.

## Rows: 86
## Columns: 12
## $ year             <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
## $ month            <chr> "January", "February", "March", "April", "May", "Jun…
## $ total_unemployed <dbl> 1889, 1882, 1803, 1675, 1616, 1683, 1801, 1810, 1844…
## $ manager          <dbl> 47, 51, 56, 52, 45, 49, 53, 53, 49, 43, 50, 50, 50, …
## $ prof             <dbl> 106, 97, 86, 96, 100, 115, 146, 152, 151, 131, 139, …
## $ tech             <dbl> 97, 91, 89, 87, 86, 97, 108, 120, 118, 121, 125, 135…
## $ cleric           <dbl> 144, 133, 135, 113, 130, 137, 143, 143, 155, 161, 15…
## $ service          <dbl> 408, 418, 422, 381, 380, 406, 450, 439, 409, 394, 41…
## $ agricul          <dbl> 29, 25, 22, 20, 21, 22, 22, 20, 20, 18, 19, 19, 21, …
## $ trade            <dbl> 434, 424, 395, 366, 346, 325, 350, 337, 364, 358, 40…
## $ operator         <dbl> 240, 259, 248, 251, 233, 236, 235, 241, 252, 267, 27…
## $ elemantary       <dbl> 385, 384, 349, 310, 275, 296, 293, 305, 326, 358, 36…

Female Dataset

occ_group_female <- import("https://github.com/pjournal/mef04g-rhapsody/blob/gh-pages/Project_Data/occupational_group.xls?raw=true", 
                           range = "TÜRKİYE!AA7:AL92", 
                           col_names = c("year", "month", "total_unemployed", "manager", "prof", "tech", "cleric", "service", "agricul", "trade", "operator", "elemantary")) %>%
  fill(year, .direction = "down") %>%
  as_tibble()

occ_group_female$month <- str_split_fixed(occ_group_female$month, " - ", 2)[,2]

Final tibble is as follows.

## Rows: 86
## Columns: 12
## $ year             <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014…
## $ month            <chr> "January", "February", "March", "April", "May", "Jun…
## $ total_unemployed <dbl> 915, 942, 944, 903, 935, 971, 1065, 1133, 1220, 1194…
## $ manager          <dbl> 16, 18, 23, 18, 16, 13, 12, 14, 11, 9, 14, 17, 15, 2…
## $ prof             <dbl> 132, 120, 118, 130, 144, 171, 196, 238, 245, 220, 17…
## $ tech             <dbl> 84, 77, 77, 71, 84, 83, 97, 98, 118, 129, 133, 118, …
## $ cleric           <dbl> 199, 217, 220, 238, 246, 258, 272, 292, 306, 301, 28…
## $ service          <dbl> 244, 273, 271, 242, 238, 247, 282, 271, 295, 277, 28…
## $ agricul          <dbl> 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1, 2…
## $ trade            <dbl> 45, 48, 46, 34, 38, 43, 42, 45, 42, 46, 36, 41, 42, …
## $ operator         <dbl> 17, 16, 12, 20, 21, 22, 20, 17, 19, 16, 23, 32, 19, …
## $ elemantary       <dbl> 175, 171, 174, 148, 146, 131, 142, 157, 181, 194, 19…

Dataset 4 - Unemployment of Higher Education Graduates by Major

The original dataset can be found at this link.

It consists of annual numbers of employed and unemployed people by field of education. The dataset consists of the statistics from 2014 to 2019

Since the data in the file does not start with A1 cell, the ranges are given as parameters during the import process.
All columns are renamed accordingly during the import process.
The original dataset has some empty cells for the year column. Therefore, empty cells filled with the correct year value.
The data table converted to a tibble.

Dataset

last_graduated_major <- import("https://github.com/pjournal/mef04g-rhapsody/blob/gh-pages/Project_Data/major_field.xls?raw=true", 
                               range = "TURKIYE!A6:Y41", 
                               col_names = c("year", "statistics", "higher_ed_grad", "education", "arts", "humanities", "languages", "social_sci", "journalism", "business", "law", "biology_env_related_sci", "physical_sci", "math_stat", "info_communication_tech", "engineering", "manufacturing_processing", "architecture_construction", "agriculture_forestry_fishery", "veterinary", "health", "welfare", "personal_services", "occupational_health_transport_services", "security_services")) %>%
  fill(year, .direction = "down") %>%
  as_tibble()

Final tibble is as follows.

## Rows: 36
## Columns: 25
## $ year                                   <dbl> 2014, 2014, 2014, 2014, 2014, …
## $ statistics                             <chr> "İşgücü                      \…
## $ higher_ed_grad                         <dbl> 5691.00000, 606.00000, 5085.00…
## $ education                              <dbl> 786.000000, 58.000000, 728.000…
## $ arts                                   <dbl> 141.00000, 23.00000, 118.00000…
## $ humanities                             <dbl> 164.000000, 11.000000, 153.000…
## $ languages                              <dbl> 111.000000, 9.000000, 102.0000…
## $ social_sci                             <dbl> 527.00000, 59.00000, 468.00000…
## $ journalism                             <dbl> 24.00000, 7.00000, 17.00000, 2…
## $ business                               <dbl> 1547.00000, 211.00000, 1336.00…
## $ law                                    <dbl> 109.000000, 8.000000, 101.0000…
## $ biology_env_related_sci                <dbl> 75.00000, 11.00000, 64.00000, …
## $ physical_sci                           <dbl> 154.00000, 22.00000, 132.00000…
## $ math_stat                              <dbl> 86.000000, 8.000000, 78.000000…
## $ info_communication_tech                <dbl> 119.00000, 20.00000, 99.00000,…
## $ engineering                            <dbl> 662.000000, 58.000000, 604.000…
## $ manufacturing_processing               <dbl> 129.00000, 17.00000, 112.00000…
## $ architecture_construction              <dbl> 230.00000, 25.00000, 205.00000…
## $ agriculture_forestry_fishery           <dbl> 129.00000, 15.00000, 114.00000…
## $ veterinary                             <dbl> 45.000000, 3.000000, 42.000000…
## $ health                                 <dbl> 350.000000, 11.000000, 339.000…
## $ welfare                                <dbl> 26.00000, 5.00000, 21.00000, 1…
## $ personal_services                      <dbl> 139.00000, 18.00000, 121.00000…
## $ occupational_health_transport_services <dbl> 16.00000, 2.00000, 14.00000, 1…
## $ security_services                      <dbl> 119.000000, 3.000000, 116.0000…

Creating the .Rdata File

After preparing all the datasets, we can create a .Rdata file. In order to do this, all tibbles are saved into a single file named project_all_data.RData. In further analysis, loading this .Rdata file will be sufficient to reach all the necessary data.

save(job_search_overall, job_search_male, job_search_female, educational_level_overall,
     occ_group_overall, occ_group_male, occ_group_female, last_graduated_major, 
     file = "project_all_data.RData")

The created .Rdata file can be reached through this link.

References

Back to our progress journal.

Data Preprocessing

Group Rhapsody

26 Dec, 2020

Importing Necessary Libraries

Reading & Preparing Datasets

Dataset 1 - Unemployed Job Searching People by the Channel

All Genders Dataset

Male Dataset

Female Dataset

Dataset 2 - Employed & Unemployed by Educational Level

All Genders Dataset

Dataset 3 - Unemployment by Occupational Group

All Genders Dataset

Male Dataset

Female Dataset

Dataset 4 - Unemployment of Higher Education Graduates by Major

Dataset

Creating the .Rdata File

References