Main purpose of this document is to introduce a major data manipulation package, dplyr
, with a contemporary subject. There are seven fundamental dplyr
functions: select
/rename
, filter
, distinct
, arrange
, mutate
/transmute
, group_by
and summarise
. In this document, we will cover every one of them and there will be supplementary functions to carry out data operations. Also, pipe operator (%>%
) will be briefly introduced.
There are two prerequisites to start: Install tidyverse
package and putting the relevant data set into the working directory (write getwd()
in the console to locate your working directory). In this document, topic of the data set is the ballot box level result data of Istanbul Mayoral Elections of June 23, 2019 (p.s. it is a re-run of March 2019 election).
To install the package run install.packages("tidyverse")
in the console and select a mirror (first one is quite ok). Once you install the library you can always call it with library(tidyverse)
command (no need to reinstall). You can download the data set from its GitHub Repository.
library(tidyverse) #tidyverse is a package group which includes dplyr as well
raw_df <- readRDS("local_election_20190623_education.rds")
First of those above commands calls the package (different from installing) The second command assigns the data to raw_df
variable. There are two types of assignment operators in R: <-
and =
. No operation in R is permanent unless you assign it to somewhere (There are exceptions, though. See data.table
package for instance.). We will benefit from this property in this document as well. No matter how many operations we do on each example we will always start from the original data frame.
Let’s do a simple sanity check. The output of the following command reads “31,186 x 14”" in the first line, which means there are 31,186 rows and 14 columns in the data frame. There are abbreviations of data types under each column name. These are usually character/string (
print(raw_df,n=3)
## # A tibble: 31,186 x 14
## sandik_ID sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3445931 1025 3 ADALAR 95627 MADEN MAH.
## 2 3445970 1020 3 ADALAR 95627 MADEN MAH.
## 3 3445991 1032 3 ADALAR 95628 NİZAM MAH.
## # … with 3.118e+04 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## # oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## # AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## # bagimsiz_TOPLAM_OY <dbl>
Also we can use glimpse
function to inspect. Using glimpse
each column is represented in a row with its data type and first few entries.
raw_df %>% glimpse()
## Observations: 31,186
## Variables: 14
## $ sandik_ID <chr> "3445931", "3445970", "3445991", "3445…
## $ sandik_NO <chr> "1025", "1020", "1032", "1019", "1029"…
## $ ilce_ID <chr> "3", "3", "3", "3", "3", "3", "3", "3"…
## $ ilce_ADI <chr> "ADALAR", "ADALAR", "ADALAR", "ADALAR"…
## $ muhtarlik_ID <chr> "95627", "95627", "95628", "95627", "9…
## $ muhtarlik_ADI <chr> "MADEN MAH.", "MADEN MAH.", "NİZAM MAH…
## $ secmen_SAYISI <dbl> 329, 283, 338, 283, 292, 336, 330, 328…
## $ oy_KULLANAN_SECMEN_SAYISI <dbl> 257, 242, 271, 248, 248, 282, 271, 270…
## $ gecerli_OY_TOPLAMI <dbl> 252, 234, 271, 246, 244, 272, 269, 265…
## $ CHP <dbl> 160, 176, 189, 189, 168, 175, 187, 179…
## $ AK_PARTI <dbl> 90, 56, 81, 57, 76, 97, 82, 86, 75, 93…
## $ SAADET <dbl> 2, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0,…
## $ VATAN_PARTISI <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0,…
## $ bagimsiz_TOPLAM_OY <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Did you notice the expression we used this time? Pipe operator makes data analysis and transformation very easy and civilized. We will use pipes frequently in this document and in the future.
We can connect many functions without calling the variable multiple times with the help of the pipe operator.
In this sections basic functions of dplyr
are introduced. Every function will be used in the following examples after it has been introduced. To limit the number of displayed rows, print command is used. You can ignore that part in your exercises.
select
/rename
Select, as the name suggests, is used to select columns. For instance, suppose we only use district (ilçe) and neighborhood (muhtarlık) names, and the number of electorates (seçmen sayısı).
raw_df %>% select(ilce_ADI,muhtarlik_ADI,secmen_SAYISI) %>% print(n=3)
## # A tibble: 31,186 x 3
## ilce_ADI muhtarlik_ADI secmen_SAYISI
## <chr> <chr> <dbl>
## 1 ADALAR MADEN MAH. 329
## 2 ADALAR MADEN MAH. 283
## 3 ADALAR NİZAM MAH. 338
## # … with 3.118e+04 more rows
If we wanted to write the above expression without the pipe operator, we could go with the sad expression below. You can extrapolate how complicated things can be without the pipe.
print(select(raw_df,ilce_ADI,muhtarlik_ADI,secmen_SAYISI),n=3)
We can use rename
to rename columns (again as the name suggests). Let’s change sandik_ID to sandik_id.
raw_df %>% rename(sandik_id = sandik_ID) %>% print(n=3)
## # A tibble: 31,186 x 14
## sandik_id sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3445931 1025 3 ADALAR 95627 MADEN MAH.
## 2 3445970 1020 3 ADALAR 95627 MADEN MAH.
## 3 3445991 1032 3 ADALAR 95628 NİZAM MAH.
## # … with 3.118e+04 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## # oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## # AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## # bagimsiz_TOPLAM_OY <dbl>
p.s. We can rename columns inside select function.
Select has many convenient sub operators and special expressions. If we know the order of columns, we can use the scope (:
) expression to get all the columns determined by the scope. Suppose, we want ballot box id (sandik_ID) and number of votes per party.
raw_df %>% select(sandik_id=sandik_ID,CHP:bagimsiz_TOPLAM_OY) %>% print(n=3)
## # A tibble: 31,186 x 6
## sandik_id CHP AK_PARTI SAADET VATAN_PARTISI bagimsiz_TOPLAM_OY
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3445931 160 90 2 0 0
## 2 3445970 176 56 2 0 0
## 3 3445991 189 81 0 1 0
## # … with 3.118e+04 more rows
We can eliminate unwanted columns by putting -
before the names. Suppose I am not interested in ID values, want to remove all ID related columns from the data set, and all ID related column names end with "_ID". We can do it using -
and ends_with
.
raw_df %>% select(-ends_with("_ID")) %>% print(n=3)
## # A tibble: 31,186 x 11
## sandik_NO ilce_ADI muhtarlik_ADI secmen_SAYISI oy_KULLANAN_SEC…
## <chr> <chr> <chr> <dbl> <dbl>
## 1 1025 ADALAR MADEN MAH. 329 257
## 2 1020 ADALAR MADEN MAH. 283 242
## 3 1032 ADALAR NİZAM MAH. 338 271
## # … with 3.118e+04 more rows, and 6 more variables:
## # gecerli_OY_TOPLAMI <dbl>, CHP <dbl>, AK_PARTI <dbl>, SAADET <dbl>,
## # VATAN_PARTISI <dbl>, bagimsiz_TOPLAM_OY <dbl>
There are similar expressions for other purposes, such as starts_with
and contains
. You can see all the expressions in the Cheat Sheet link given at the end of this document.
filter
Filter (no more “as the name suggests”, as you already figured it out) helps filter rows according to given criteria. It is highly similar with Excel’s filter functionality (but much much more flexible and reliable).
Let’s say we only want the ballot boxes in SARIYER district.
raw_df %>% filter(ilce_ADI == "SARIYER") %>% print(n=3)
## # A tibble: 753 x 14
## sandik_ID sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3471320 1009 497 SARIYER 96135 BAHÇEKÖY MER…
## 2 3471321 1010 497 SARIYER 96135 BAHÇEKÖY MER…
## 3 3471317 1006 497 SARIYER 96135 BAHÇEKÖY MER…
## # … with 750 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## # oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## # AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## # bagimsiz_TOPLAM_OY <dbl>
By using ==
operator, we bring the values in ilce_ADI (district) column which are exactly equal to “SARIYER”. There are other expressions such as not equal to (!=
), greater than (or equal to) (>
,>=
), smaller than (or equal to) (<
,<=
), in (%in%
) and some more.
Suppose we want the rows that are not in Beşiktaş or Beyoğlu.
raw_df %>% filter(!(ilce_ADI %in% c("BEŞİKTAŞ","BEYOĞLU"))) %>% print(n=3)
## # A tibble: 30,269 x 14
## sandik_ID sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3445931 1025 3 ADALAR 95627 MADEN MAH.
## 2 3445970 1020 3 ADALAR 95627 MADEN MAH.
## 3 3445991 1032 3 ADALAR 95628 NİZAM MAH.
## # … with 3.027e+04 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## # oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## # AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## # bagimsiz_TOPLAM_OY <dbl>
At the same time we can make comparisons between columns and combine multiple criteria to create more complex filters. Here we use AND (&
) and OR (|
) operators to combine criteria.
Suppose we want to see neighborhood names (muhtarlik_ADI) of the rows in ZEYTİNBURNU where also CHP had more votes than AK_PARTI.
raw_df %>% filter(ilce_ADI == "ZEYTİNBURNU" & CHP > AK_PARTI) %>% select(muhtarlik_ADI) %>% print(n=3)
## # A tibble: 345 x 1
## muhtarlik_ADI
## <chr>
## 1 KAZLIÇEŞME MAH.
## 2 TELSİZ MAH.
## 3 GÖKALP MAH.
## # … with 342 more rows
distinct
Distinct returns unique rows, based on the values of one or a combination of several columns. In the above example we listed all rows that match the given criteria and saw 345 names. Instead, we need the name of each eligible neighborhood only once.
raw_df %>% filter(ilce_ADI == "ZEYTİNBURNU" & CHP > AK_PARTI) %>% distinct(muhtarlik_ADI) %>% print(n=3)
## # A tibble: 13 x 1
## muhtarlik_ADI
## <chr>
## 1 KAZLIÇEŞME MAH.
## 2 TELSİZ MAH.
## 3 GÖKALP MAH.
## # … with 10 more rows
Suppose we don’t want to lose other columns, then we change .keep_all
parameter to TRUE (default is FALSE.
raw_df %>% filter(ilce_ADI == "ZEYTİNBURNU" & CHP > AK_PARTI) %>% distinct(muhtarlik_ADI,.keep_all = TRUE) %>% print(n=3)
## # A tibble: 13 x 14
## sandik_ID sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3464008 1046 632 ZEYTİNB… 96320 KAZLIÇEŞME M…
## 2 3466714 2273 632 ZEYTİNB… 96326 TELSİZ MAH.
## 3 3463809 1010 632 ZEYTİNB… 96319 GÖKALP MAH.
## # … with 10 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## # oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## # AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## # bagimsiz_TOPLAM_OY <dbl>
Distinct keeps the first occurance and removes any duplicates. For instance if the same value is seen on 3rd and 5th rows, it keeps 3rd row and removes 5th row.
arrange
Arrange sorts rows from A to Z or smallest to largest. It has great similarity with Excel’s Sort functionality.
Let’s arrange by neighborhood names (muhtarlik_ADI).
raw_df %>% arrange(muhtarlik_ADI) %>% distinct(muhtarlik_ADI,ilce_ADI) %>% print(n=3)
## # A tibble: 980 x 2
## muhtarlik_ADI ilce_ADI
## <chr> <chr>
## 1 100. YIL MAH. BAĞCILAR
## 2 15 TEMMUZ MAH. BAĞCILAR
## 3 19 MAYIS MAH. BÜYÜKÇEKMECE
## # … with 977 more rows
To reverse the order (Z-A, largest to smallest), we should wrap columns with desc
. We can also do operations inside arrange. For instance let’s order Saadet Party’s (SAADET) highest voting rates with respect to valid votes (gecerli_OY_TOPLAMI).
raw_df %>% arrange(desc(SAADET/gecerli_OY_TOPLAMI)) %>% select(sandik_ID,ilce_ADI,muhtarlik_ADI,gecerli_OY_TOPLAMI,SAADET) %>% print(n=3)
## # A tibble: 31,186 x 5
## sandik_ID ilce_ADI muhtarlik_ADI gecerli_OY_TOPLAMI SAADET
## <chr> <chr> <chr> <dbl> <dbl>
## 1 3452129 FATİH BALAT MAH. 286 14
## 2 3446252 ATAŞEHİR BARBAROS MAH. 267 13
## 3 3469136 SULTANGAZİ CEBECİ MAH. 290 14
## # … with 3.118e+04 more rows
We can use multiple columns inside arrange which every sort will take place where values of the previous sort are equal. For instance let’s sort by number of voters and valid votes count.
raw_df %>% select(sandik_ID,ilce_ADI,muhtarlik_ADI,secmen_SAYISI,gecerli_OY_TOPLAMI) %>% arrange(desc(secmen_SAYISI),desc(gecerli_OY_TOPLAMI)) %>% print(n=3)
## # A tibble: 31,186 x 5
## sandik_ID ilce_ADI muhtarlik_ADI secmen_SAYISI gecerli_OY_TOPLAMI
## <chr> <chr> <chr> <dbl> <dbl>
## 1 3471189 BEYLİKDÜZÜ BARIŞ MAH. 355 306
## 2 3470764 ÜMRANİYE MEHMET AKİF MAH. 355 306
## 3 3471191 BEYLİKDÜZÜ BARIŞ MAH. 355 298
## # … with 3.118e+04 more rows
mutate
/transmute
Mutate is the function when we do operations and calculations using other columns. For instance let’s calculate AK PARTI’s vote percentages. (round
function is used for rounding)
raw_df %>% select(sandik_ID,muhtarlik_ADI,ilce_ADI,gecerli_OY_TOPLAMI,AK_PARTI) %>% mutate(AK_PARTI_oran = round(AK_PARTI/gecerli_OY_TOPLAMI,2)) %>% print(n=3)
## # A tibble: 31,186 x 6
## sandik_ID muhtarlik_ADI ilce_ADI gecerli_OY_TOPLA… AK_PARTI AK_PARTI_oran
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 3445931 MADEN MAH. ADALAR 252 90 0.36
## 2 3445970 MADEN MAH. ADALAR 234 56 0.24
## 3 3445991 NİZAM MAH. ADALAR 271 81 0.3
## # … with 3.118e+04 more rows
You can use many R functions (from both base functions and other packages). For instance to calculate competition between two major parties, we can use the following expression.
raw_df %>% mutate(winner = ifelse(AK_PARTI > CHP, "AK Parti", "CHP"),winner=ifelse(AK_PARTI==CHP,"Tie",winner)) %>% select(sandik_ID,muhtarlik_ADI,ilce_ADI,gecerli_OY_TOPLAMI,winner) %>% print(n=3)
## # A tibble: 31,186 x 5
## sandik_ID muhtarlik_ADI ilce_ADI gecerli_OY_TOPLAMI winner
## <chr> <chr> <chr> <dbl> <chr>
## 1 3445931 MADEN MAH. ADALAR 252 CHP
## 2 3445970 MADEN MAH. ADALAR 234 CHP
## 3 3445991 NİZAM MAH. ADALAR 271 CHP
## # … with 3.118e+04 more rows
Transmute has the same functionality as mutate with the additional property similar to select. Transmute returns only the columns included in the function. Suppose we want to calculate percentage vote differences between AK PARTI and CHP.
raw_df %>% transmute(sandik_ID,muhtarlik_ADI,ilce_ADI,gecerli_OY_TOPLAMI,percentage_diff=round((AK_PARTI-CHP)/gecerli_OY_TOPLAMI,2)) %>% arrange(desc(percentage_diff)) %>% print(n=3)
## # A tibble: 31,186 x 5
## sandik_ID muhtarlik_ADI ilce_ADI gecerli_OY_TOPLAMI percentage_diff
## <chr> <chr> <chr> <dbl> <dbl>
## 1 3454863 AKFIRAT MAH. TUZLA 313 0.95
## 2 3454874 AKFIRAT MAH. TUZLA 314 0.92
## 3 3454884 AKFIRAT MAH. TUZLA 310 0.9
## # … with 3.118e+04 more rows
group_by
/summarise
Finally we will learn how to calculate summary tables. It is similar to Pivot Tables in Excel. group_by
is the grouping function, summarise
is the summarising function.
For instance let’s calculate number of ballot boxes (i.e. rows) at each district. We will use a special function n()
to calculate number of rows.
raw_df %>% group_by(ilce_ADI) %>% summarise(n_of_boxes=n()) %>% print(n=3)
## # A tibble: 39 x 2
## ilce_ADI n_of_boxes
## <chr> <int>
## 1 ADALAR 39
## 2 ARNAVUTKÖY 524
## 3 ATAŞEHİR 885
## # … with 36 more rows
There are some other summarizing functions for instance total (sum
), minimum (min
), maximum (max
), average/mean (mean
), median (median
) that can be used with summarise.
We can make groups with the combination of multiple columns. Let’s find neighborhoods with the highest number of voters. Some functions can be used inside groups, so we need to remove grouping attribute using ungroup
function after we summarise.
raw_df %>% group_by(muhtarlik_ADI,ilce_ADI) %>% summarise(toplam_secmen=sum(secmen_SAYISI)) %>% ungroup() %>% arrange(desc(toplam_secmen)) %>% print(n=3)
## # A tibble: 980 x 3
## muhtarlik_ADI ilce_ADI toplam_secmen
## <chr> <chr> <dbl>
## 1 ATAKENT MAH. KÜÇÜKÇEKMECE 64836
## 2 ADNAN KAHVECİ MAH. BEYLİKDÜZÜ 61582
## 3 ZAFER MAH. BAHÇELİEVLER 58084
## # … with 977 more rows
## # A tibble: 2 x 1
## ilce_ADI
## <chr>
## 1 EYÜPSULTAN
## 2 GÜNGÖREN
## # A tibble: 798 x 4
## muhtarlik_ADI ilce_ADI gecerli_OY_TOPLA… CHP
## <chr> <chr> <dbl> <dbl>
## 1 SİLİVRİ 3 NOLU L TİPİ KAPALI CEZA İNFA… SİLİVRİ 166 161
## 2 HUZUR MAH. SARIYER 297 288
## 3 SİLİVRİ 6 NOLU L TİPİ KAPALI CEZA İNFA… SİLİVRİ 155 150
## 4 AKAT MAH. BEŞİKTAŞ 267 258
## 5 SİLİVRİ 1 NOLU L TİPİ KAPALI CEZA İNFA… SİLİVRİ 108 104
## 6 GÖZTEPE MAH. KADIKÖY 268 258
## 7 SİLİVRİ 2 NOLU L TİPİ KAPALI CEZA İNFA… SİLİVRİ 163 156
## 8 CADDEBOSTAN MAH. KADIKÖY 276 264
## 9 FENERBAHÇE MAH. KADIKÖY 275 263
## 10 BOSTANCI MAH. KADIKÖY 272 260
## # … with 788 more rows
## # A tibble: 1 x 7
## # Groups: muhtarlik_ID [1]
## muhtarlik_ID muhtarlik_ADI CHP AK_PARTI SAADET VATAN_PARTISI Diger
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 96035 ÇAĞLAYAN MAH. 0.563 0.431 0.004 0.001 0.001
## # A tibble: 39 x 2
## ilce_ADI gecerli_oran
## <chr> <dbl>
## 1 ARNAVUTKÖY 0.973
## 2 SULTANBEYLİ 0.974
## 3 ESENYURT 0.976
## 4 ÇATALCA 0.977
## 5 SİLİVRİ 0.977
## 6 ZEYTİNBURNU 0.977
## 7 SULTANGAZİ 0.977
## 8 ESENLER 0.978
## 9 BAĞCILAR 0.978
## 10 GAZİOSMANPAŞA 0.978
## # … with 29 more rows
## # A tibble: 39 x 2
## ilce_ADI kazanilan_muht_orani
## <chr> <dbl>
## 1 SULTANBEYLİ 1
## 2 ESENLER 0.941
## 3 BAĞCILAR 0.909
## 4 SULTANGAZİ 0.733
## 5 ARNAVUTKÖY 0.710
## 6 PENDİK 0.649
## 7 ŞİLE 0.645
## 8 ÜMRANİYE 0.590
## 9 SANCAKTEPE 0.579
## 10 BEYKOZ 0.556
## # … with 29 more rows
abs
function to get absolute values e.g. abs(CHP-AK_PARTI))## # A tibble: 39 x 4
## ilce_ADI toplam_oy CHP AK_PARTI
## <chr> <dbl> <dbl> <dbl>
## 1 FATİH 221380 0.495 0.494
## 2 ŞİLE 25022 0.498 0.494
## 3 BEYKOZ 153583 0.498 0.493
## 4 GÜNGÖREN 169957 0.488 0.503
## 5 KAĞITHANE 259726 0.486 0.506
## 6 ÇEKMEKÖY 145908 0.507 0.486
## 7 BAYRAMPAŞA 166233 0.507 0.485
## 8 BAŞAKŞEHİR 219934 0.476 0.514
## 9 BEYOĞLU 135354 0.516 0.475
## 10 ÜMRANİYE 406548 0.475 0.516
## # … with 29 more rows
abs
function inside mutate)## # A tibble: 603 x 6
## muhtarlik_ADI ilce_ADI secmen_SAYISI AK_PARTI CHP fark
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 YAKUPLU MAH. BEYLİKDÜZÜ 29865 11829 11824 5
## 2 BALIKYOLU MAH. ESENYURT 17660 6850 6838 12
## 3 CUMHURİYET MAH. ÜSKÜDAR 27713 11689 11671 18
## 4 HARMANTEPE MAH. KAĞITHANE 15577 6201 6180 21
## 5 TAHTAKALE MAH. AVCILAR 34244 14312 14334 -22
## 6 CUMHURİYET MAH. ÇEKMEKÖY 8930 3640 3663 -23
## 7 PINAR MAH. SARIYER 8093 3355 3322 33
## 8 BULGURLU MAH. ÜSKÜDAR 23573 9935 9975 -40
## 9 ORHAN GAZİ MAH. ESENYURT 7333 2897 2940 -43
## 10 GÜRPINAR MAH. BEYLİKDÜZÜ 13878 5597 5551 46
## # … with 593 more rows
## # A tibble: 39 x 2
## ilce_ADI yuksek_katilim
## <chr> <dbl>
## 1 ŞİLE 0.258
## 2 SİLİVRİ 0.222
## 3 ÇATALCA 0.154
## 4 MALTEPE 0.143
## 5 BEYKOZ 0.0667
## 6 BAKIRKÖY 0.0625
## 7 ESENLER 0.0588
## 8 TUZLA 0.0588
## 9 ÜMRANİYE 0.0513
## 10 ÇEKMEKÖY 0.0476
## # … with 29 more rows
## # A tibble: 39 x 2
## ilce_ADI oran
## <chr> <dbl>
## 1 BEŞİKTAŞ 0.73
## 2 KADIKÖY 0.7
## 3 BAKIRKÖY 0.59
## 4 ŞİŞLİ 0.36
## 5 SARIYER 0.17
## 6 ATAŞEHİR 0.15
## 7 MALTEPE 0.1
## 8 ADALAR 0.08
## 9 BEYLİKDÜZÜ 0.08
## 10 BAŞAKŞEHİR 0.07
## # … with 29 more rows
Fundamental dplyr
functions provide very convenient tools for data analysis. It can also be used to generate the features required for modelling. You can process few million rows of data without breaking a sweat (for larger data sets you can use data.table
), you can prepare functions instead of manual Excel operations. With R Markdown system, which this tutorial is prepared in, you can create reproducible documents and automatize the reports. You can use ggplot2
for visualizations, which is also part of the tidyverse package ecosystem.
Following topics are slightly advanced and not covered in this document.
_all
, _at
, _if
suffixes to many fundamental commands.lead
,lag
functionspivot_wider
,pivot_longer
functionscase_when
function.left_join
, inner_join
, full_join
etc.