Introduction

Main purpose of this document is to introduce a major data manipulation package, dplyr, with a contemporary subject. There are seven fundamental dplyr functions: select/rename, filter, distinct, arrange, mutate/transmute, group_by and summarise. In this document, we will cover every one of them and there will be supplementary functions to carry out data operations. Also, pipe operator (%>%) will be briefly introduced.

Preparations

There are two prerequisites to start: Install tidyverse package and putting the relevant data set into the working directory (write getwd() in the console to locate your working directory). In this document, topic of the data set is the ballot box level result data of Istanbul Mayoral Elections of June 23, 2019 (p.s. it is a re-run of March 2019 election).

To install the package run install.packages("tidyverse") in the console and select a mirror (first one is quite ok). Once you install the library you can always call it with library(tidyverse) command (no need to reinstall). You can download the data set from its GitHub Repository.

library(tidyverse) #tidyverse is a package group which includes dplyr as well
raw_df <- readRDS("local_election_20190623_education.rds")

First of those above commands calls the package (different from installing) The second command assigns the data to raw_df variable. There are two types of assignment operators in R: <- and =. No operation in R is permanent unless you assign it to somewhere (There are exceptions, though. See data.table package for instance.). We will benefit from this property in this document as well. No matter how many operations we do on each example we will always start from the original data frame.

Let’s do a simple sanity check. The output of the following command reads “31,186 x 14”" in the first line, which means there are 31,186 rows and 14 columns in the data frame. There are abbreviations of data types under each column name. These are usually character/string (), numeric (, if integer ), logical (TRUE/FALSE, logical) (), factor (). In the introduction phase we will only use character, numeric and logical data types.

print(raw_df,n=3)

## # A tibble: 31,186 x 14
##   sandik_ID sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
##   <chr>     <chr>     <chr>   <chr>    <chr>        <chr>        
## 1 3445931   1025      3       ADALAR   95627        MADEN MAH.   
## 2 3445970   1020      3       ADALAR   95627        MADEN MAH.   
## 3 3445991   1032      3       ADALAR   95628        NİZAM MAH.   
## # … with 3.118e+04 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## #   oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## #   AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## #   bagimsiz_TOPLAM_OY <dbl>

Also we can use glimpse function to inspect. Using glimpse each column is represented in a row with its data type and first few entries.

raw_df %>% glimpse()

## Observations: 31,186
## Variables: 14
## $ sandik_ID                 <chr> "3445931", "3445970", "3445991", "3445…
## $ sandik_NO                 <chr> "1025", "1020", "1032", "1019", "1029"…
## $ ilce_ID                   <chr> "3", "3", "3", "3", "3", "3", "3", "3"…
## $ ilce_ADI                  <chr> "ADALAR", "ADALAR", "ADALAR", "ADALAR"…
## $ muhtarlik_ID              <chr> "95627", "95627", "95628", "95627", "9…
## $ muhtarlik_ADI             <chr> "MADEN MAH.", "MADEN MAH.", "NİZAM MAH…
## $ secmen_SAYISI             <dbl> 329, 283, 338, 283, 292, 336, 330, 328…
## $ oy_KULLANAN_SECMEN_SAYISI <dbl> 257, 242, 271, 248, 248, 282, 271, 270…
## $ gecerli_OY_TOPLAMI        <dbl> 252, 234, 271, 246, 244, 272, 269, 265…
## $ CHP                       <dbl> 160, 176, 189, 189, 168, 175, 187, 179…
## $ AK_PARTI                  <dbl> 90, 56, 81, 57, 76, 97, 82, 86, 75, 93…
## $ SAADET                    <dbl> 2, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0,…
## $ VATAN_PARTISI             <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0,…
## $ bagimsiz_TOPLAM_OY        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Did you notice the expression we used this time? Pipe operator makes data analysis and transformation very easy and civilized. We will use pipes frequently in this document and in the future.

We can connect many functions without calling the variable multiple times with the help of the pipe operator.

Fundamentals

In this sections basic functions of dplyr are introduced. Every function will be used in the following examples after it has been introduced. To limit the number of displayed rows, print command is used. You can ignore that part in your exercises.

`select`/`rename`

Select, as the name suggests, is used to select columns. For instance, suppose we only use district (ilçe) and neighborhood (muhtarlık) names, and the number of electorates (seçmen sayısı).

raw_df %>% select(ilce_ADI,muhtarlik_ADI,secmen_SAYISI) %>% print(n=3)

## # A tibble: 31,186 x 3
##   ilce_ADI muhtarlik_ADI secmen_SAYISI
##   <chr>    <chr>                 <dbl>
## 1 ADALAR   MADEN MAH.              329
## 2 ADALAR   MADEN MAH.              283
## 3 ADALAR   NİZAM MAH.              338
## # … with 3.118e+04 more rows

If we wanted to write the above expression without the pipe operator, we could go with the sad expression below. You can extrapolate how complicated things can be without the pipe.

print(select(raw_df,ilce_ADI,muhtarlik_ADI,secmen_SAYISI),n=3)

We can use rename to rename columns (again as the name suggests). Let’s change sandik_ID to sandik_id.

raw_df %>% rename(sandik_id = sandik_ID) %>% print(n=3)

## # A tibble: 31,186 x 14
##   sandik_id sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
##   <chr>     <chr>     <chr>   <chr>    <chr>        <chr>        
## 1 3445931   1025      3       ADALAR   95627        MADEN MAH.   
## 2 3445970   1020      3       ADALAR   95627        MADEN MAH.   
## 3 3445991   1032      3       ADALAR   95628        NİZAM MAH.   
## # … with 3.118e+04 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## #   oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## #   AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## #   bagimsiz_TOPLAM_OY <dbl>

p.s. We can rename columns inside select function.

Select has many convenient sub operators and special expressions. If we know the order of columns, we can use the scope (:) expression to get all the columns determined by the scope. Suppose, we want ballot box id (sandik_ID) and number of votes per party.

raw_df %>% select(sandik_id=sandik_ID,CHP:bagimsiz_TOPLAM_OY) %>% print(n=3)

## # A tibble: 31,186 x 6
##   sandik_id   CHP AK_PARTI SAADET VATAN_PARTISI bagimsiz_TOPLAM_OY
##   <chr>     <dbl>    <dbl>  <dbl>         <dbl>              <dbl>
## 1 3445931     160       90      2             0                  0
## 2 3445970     176       56      2             0                  0
## 3 3445991     189       81      0             1                  0
## # … with 3.118e+04 more rows

We can eliminate unwanted columns by putting - before the names. Suppose I am not interested in ID values, want to remove all ID related columns from the data set, and all ID related column names end with "_ID". We can do it using - and ends_with.

raw_df %>% select(-ends_with("_ID")) %>% print(n=3)

## # A tibble: 31,186 x 11
##   sandik_NO ilce_ADI muhtarlik_ADI secmen_SAYISI oy_KULLANAN_SEC…
##   <chr>     <chr>    <chr>                 <dbl>            <dbl>
## 1 1025      ADALAR   MADEN MAH.              329              257
## 2 1020      ADALAR   MADEN MAH.              283              242
## 3 1032      ADALAR   NİZAM MAH.              338              271
## # … with 3.118e+04 more rows, and 6 more variables:
## #   gecerli_OY_TOPLAMI <dbl>, CHP <dbl>, AK_PARTI <dbl>, SAADET <dbl>,
## #   VATAN_PARTISI <dbl>, bagimsiz_TOPLAM_OY <dbl>

There are similar expressions for other purposes, such as starts_with and contains. You can see all the expressions in the Cheat Sheet link given at the end of this document.

`filter`

Filter (no more “as the name suggests”, as you already figured it out) helps filter rows according to given criteria. It is highly similar with Excel’s filter functionality (but much much more flexible and reliable).

Let’s say we only want the ballot boxes in SARIYER district.

raw_df %>% filter(ilce_ADI == "SARIYER") %>% print(n=3)

## # A tibble: 753 x 14
##   sandik_ID sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
##   <chr>     <chr>     <chr>   <chr>    <chr>        <chr>        
## 1 3471320   1009      497     SARIYER  96135        BAHÇEKÖY MER…
## 2 3471321   1010      497     SARIYER  96135        BAHÇEKÖY MER…
## 3 3471317   1006      497     SARIYER  96135        BAHÇEKÖY MER…
## # … with 750 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## #   oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## #   AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## #   bagimsiz_TOPLAM_OY <dbl>

By using == operator, we bring the values in ilce_ADI (district) column which are exactly equal to “SARIYER”. There are other expressions such as not equal to (!=), greater than (or equal to) (>,>=), smaller than (or equal to) (<,<=), in (%in%) and some more.

Suppose we want the rows that are not in Beşiktaş or Beyoğlu.

raw_df %>% filter(!(ilce_ADI %in% c("BEŞİKTAŞ","BEYOĞLU"))) %>% print(n=3)

## # A tibble: 30,269 x 14
##   sandik_ID sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
##   <chr>     <chr>     <chr>   <chr>    <chr>        <chr>        
## 1 3445931   1025      3       ADALAR   95627        MADEN MAH.   
## 2 3445970   1020      3       ADALAR   95627        MADEN MAH.   
## 3 3445991   1032      3       ADALAR   95628        NİZAM MAH.   
## # … with 3.027e+04 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## #   oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## #   AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## #   bagimsiz_TOPLAM_OY <dbl>

At the same time we can make comparisons between columns and combine multiple criteria to create more complex filters. Here we use AND (&) and OR (|) operators to combine criteria.

Suppose we want to see neighborhood names (muhtarlik_ADI) of the rows in ZEYTİNBURNU where also CHP had more votes than AK_PARTI.

raw_df %>% filter(ilce_ADI == "ZEYTİNBURNU" & CHP > AK_PARTI) %>% select(muhtarlik_ADI)  %>% print(n=3)

## # A tibble: 345 x 1
##   muhtarlik_ADI  
##   <chr>          
## 1 KAZLIÇEŞME MAH.
## 2 TELSİZ MAH.    
## 3 GÖKALP MAH.    
## # … with 342 more rows

`distinct`

Distinct returns unique rows, based on the values of one or a combination of several columns. In the above example we listed all rows that match the given criteria and saw 345 names. Instead, we need the name of each eligible neighborhood only once.

raw_df %>% filter(ilce_ADI == "ZEYTİNBURNU" & CHP > AK_PARTI) %>% distinct(muhtarlik_ADI)  %>% print(n=3)

## # A tibble: 13 x 1
##   muhtarlik_ADI  
##   <chr>          
## 1 KAZLIÇEŞME MAH.
## 2 TELSİZ MAH.    
## 3 GÖKALP MAH.    
## # … with 10 more rows

Suppose we don’t want to lose other columns, then we change .keep_all parameter to TRUE (default is FALSE.

raw_df %>% filter(ilce_ADI == "ZEYTİNBURNU" & CHP > AK_PARTI) %>% distinct(muhtarlik_ADI,.keep_all = TRUE)  %>% print(n=3)

## # A tibble: 13 x 14
##   sandik_ID sandik_NO ilce_ID ilce_ADI muhtarlik_ID muhtarlik_ADI
##   <chr>     <chr>     <chr>   <chr>    <chr>        <chr>        
## 1 3464008   1046      632     ZEYTİNB… 96320        KAZLIÇEŞME M…
## 2 3466714   2273      632     ZEYTİNB… 96326        TELSİZ MAH.  
## 3 3463809   1010      632     ZEYTİNB… 96319        GÖKALP MAH.  
## # … with 10 more rows, and 8 more variables: secmen_SAYISI <dbl>,
## #   oy_KULLANAN_SECMEN_SAYISI <dbl>, gecerli_OY_TOPLAMI <dbl>, CHP <dbl>,
## #   AK_PARTI <dbl>, SAADET <dbl>, VATAN_PARTISI <dbl>,
## #   bagimsiz_TOPLAM_OY <dbl>

Distinct keeps the first occurance and removes any duplicates. For instance if the same value is seen on 3rd and 5th rows, it keeps 3rd row and removes 5th row.

`arrange`

Arrange sorts rows from A to Z or smallest to largest. It has great similarity with Excel’s Sort functionality.

Let’s arrange by neighborhood names (muhtarlik_ADI).

raw_df %>% arrange(muhtarlik_ADI) %>% distinct(muhtarlik_ADI,ilce_ADI) %>% print(n=3)

## # A tibble: 980 x 2
##   muhtarlik_ADI  ilce_ADI    
##   <chr>          <chr>       
## 1 100. YIL MAH.  BAĞCILAR    
## 2 15 TEMMUZ MAH. BAĞCILAR    
## 3 19 MAYIS MAH.  BÜYÜKÇEKMECE
## # … with 977 more rows

To reverse the order (Z-A, largest to smallest), we should wrap columns with desc. We can also do operations inside arrange. For instance let’s order Saadet Party’s (SAADET) highest voting rates with respect to valid votes (gecerli_OY_TOPLAMI).

raw_df %>% arrange(desc(SAADET/gecerli_OY_TOPLAMI)) %>% select(sandik_ID,ilce_ADI,muhtarlik_ADI,gecerli_OY_TOPLAMI,SAADET) %>% print(n=3)

## # A tibble: 31,186 x 5
##   sandik_ID ilce_ADI   muhtarlik_ADI gecerli_OY_TOPLAMI SAADET
##   <chr>     <chr>      <chr>                      <dbl>  <dbl>
## 1 3452129   FATİH      BALAT MAH.                   286     14
## 2 3446252   ATAŞEHİR   BARBAROS MAH.                267     13
## 3 3469136   SULTANGAZİ CEBECİ MAH.                  290     14
## # … with 3.118e+04 more rows

We can use multiple columns inside arrange which every sort will take place where values of the previous sort are equal. For instance let’s sort by number of voters and valid votes count.

raw_df %>% select(sandik_ID,ilce_ADI,muhtarlik_ADI,secmen_SAYISI,gecerli_OY_TOPLAMI) %>% arrange(desc(secmen_SAYISI),desc(gecerli_OY_TOPLAMI)) %>% print(n=3)

## # A tibble: 31,186 x 5
##   sandik_ID ilce_ADI   muhtarlik_ADI    secmen_SAYISI gecerli_OY_TOPLAMI
##   <chr>     <chr>      <chr>                    <dbl>              <dbl>
## 1 3471189   BEYLİKDÜZÜ BARIŞ MAH.                 355                306
## 2 3470764   ÜMRANİYE   MEHMET AKİF MAH.           355                306
## 3 3471191   BEYLİKDÜZÜ BARIŞ MAH.                 355                298
## # … with 3.118e+04 more rows

`mutate`/`transmute`

Mutate is the function when we do operations and calculations using other columns. For instance let’s calculate AK PARTI’s vote percentages. (round function is used for rounding)

raw_df %>% select(sandik_ID,muhtarlik_ADI,ilce_ADI,gecerli_OY_TOPLAMI,AK_PARTI) %>% mutate(AK_PARTI_oran = round(AK_PARTI/gecerli_OY_TOPLAMI,2)) %>% print(n=3)

## # A tibble: 31,186 x 6
##   sandik_ID muhtarlik_ADI ilce_ADI gecerli_OY_TOPLA… AK_PARTI AK_PARTI_oran
##   <chr>     <chr>         <chr>                <dbl>    <dbl>         <dbl>
## 1 3445931   MADEN MAH.    ADALAR                 252       90          0.36
## 2 3445970   MADEN MAH.    ADALAR                 234       56          0.24
## 3 3445991   NİZAM MAH.    ADALAR                 271       81          0.3 
## # … with 3.118e+04 more rows

You can use many R functions (from both base functions and other packages). For instance to calculate competition between two major parties, we can use the following expression.

raw_df %>% mutate(winner = ifelse(AK_PARTI > CHP, "AK Parti", "CHP"),winner=ifelse(AK_PARTI==CHP,"Tie",winner)) %>% select(sandik_ID,muhtarlik_ADI,ilce_ADI,gecerli_OY_TOPLAMI,winner) %>% print(n=3)

## # A tibble: 31,186 x 5
##   sandik_ID muhtarlik_ADI ilce_ADI gecerli_OY_TOPLAMI winner
##   <chr>     <chr>         <chr>                 <dbl> <chr> 
## 1 3445931   MADEN MAH.    ADALAR                  252 CHP   
## 2 3445970   MADEN MAH.    ADALAR                  234 CHP   
## 3 3445991   NİZAM MAH.    ADALAR                  271 CHP   
## # … with 3.118e+04 more rows

Transmute has the same functionality as mutate with the additional property similar to select. Transmute returns only the columns included in the function. Suppose we want to calculate percentage vote differences between AK PARTI and CHP.

raw_df %>% transmute(sandik_ID,muhtarlik_ADI,ilce_ADI,gecerli_OY_TOPLAMI,percentage_diff=round((AK_PARTI-CHP)/gecerli_OY_TOPLAMI,2)) %>% arrange(desc(percentage_diff)) %>% print(n=3)

## # A tibble: 31,186 x 5
##   sandik_ID muhtarlik_ADI ilce_ADI gecerli_OY_TOPLAMI percentage_diff
##   <chr>     <chr>         <chr>                 <dbl>           <dbl>
## 1 3454863   AKFIRAT MAH.  TUZLA                   313            0.95
## 2 3454874   AKFIRAT MAH.  TUZLA                   314            0.92
## 3 3454884   AKFIRAT MAH.  TUZLA                   310            0.9 
## # … with 3.118e+04 more rows

`group_by`/`summarise`

Finally we will learn how to calculate summary tables. It is similar to Pivot Tables in Excel. group_by is the grouping function, summarise is the summarising function.

For instance let’s calculate number of ballot boxes (i.e. rows) at each district. We will use a special function n() to calculate number of rows.

raw_df %>% group_by(ilce_ADI) %>% summarise(n_of_boxes=n()) %>% print(n=3)

## # A tibble: 39 x 2
##   ilce_ADI   n_of_boxes
##   <chr>           <int>
## 1 ADALAR             39
## 2 ARNAVUTKÖY        524
## 3 ATAŞEHİR          885
## # … with 36 more rows

There are some other summarizing functions for instance total (sum), minimum (min), maximum (max), average/mean (mean), median (median) that can be used with summarise.

We can make groups with the combination of multiple columns. Let’s find neighborhoods with the highest number of voters. Some functions can be used inside groups, so we need to remove grouping attribute using ungroup function after we summarise.

raw_df %>% group_by(muhtarlik_ADI,ilce_ADI) %>% summarise(toplam_secmen=sum(secmen_SAYISI)) %>% ungroup() %>% arrange(desc(toplam_secmen)) %>% print(n=3)

## # A tibble: 980 x 3
##   muhtarlik_ADI      ilce_ADI     toplam_secmen
##   <chr>              <chr>                <dbl>
## 1 ATAKENT MAH.       KÜÇÜKÇEKMECE         64836
## 2 ADNAN KAHVECİ MAH. BEYLİKDÜZÜ           61582
## 3 ZAFER MAH.         BAHÇELİEVLER         58084
## # … with 977 more rows

Exercises

Basic

List the districts where Vatan Partisi (VATAN_PARTISI) got 7 or more votes in at least one ballot box.

## # A tibble: 2 x 1
##   ilce_ADI  
##   <chr>     
## 1 EYÜPSULTAN
## 2 GÜNGÖREN

List the neighborhoods ordered by CHP has the highest percentage of valid votes and the highest number of valid votes. (warning: There can be same neighborhood names in different districts.)

## # A tibble: 798 x 4
##    muhtarlik_ADI                           ilce_ADI gecerli_OY_TOPLA…   CHP
##    <chr>                                   <chr>                <dbl> <dbl>
##  1 SİLİVRİ 3 NOLU L TİPİ KAPALI CEZA İNFA… SİLİVRİ                166   161
##  2 HUZUR MAH.                              SARIYER                297   288
##  3 SİLİVRİ 6 NOLU L TİPİ KAPALI CEZA İNFA… SİLİVRİ                155   150
##  4 AKAT MAH.                               BEŞİKTAŞ               267   258
##  5 SİLİVRİ 1 NOLU L TİPİ KAPALI CEZA İNFA… SİLİVRİ                108   104
##  6 GÖZTEPE MAH.                            KADIKÖY                268   258
##  7 SİLİVRİ 2 NOLU L TİPİ KAPALI CEZA İNFA… SİLİVRİ                163   156
##  8 CADDEBOSTAN MAH.                        KADIKÖY                276   264
##  9 FENERBAHÇE MAH.                         KADIKÖY                275   263
## 10 BOSTANCI MAH.                           KADIKÖY                272   260
## # … with 788 more rows

Display the vote distribution of a neighborhood which you like. You can filter the neighborhood by its ID.

## # A tibble: 1 x 7
## # Groups:   muhtarlik_ID [1]
##   muhtarlik_ID muhtarlik_ADI   CHP AK_PARTI SAADET VATAN_PARTISI Diger
##   <chr>        <chr>         <dbl>    <dbl>  <dbl>         <dbl> <dbl>
## 1 96035        ÇAĞLAYAN MAH. 0.563    0.431  0.004         0.001 0.001

Intermediate

Arrange the districts from the lowest valid vote ratio to the highest.

## # A tibble: 39 x 2
##    ilce_ADI      gecerli_oran
##    <chr>                <dbl>
##  1 ARNAVUTKÖY           0.973
##  2 SULTANBEYLİ          0.974
##  3 ESENYURT             0.976
##  4 ÇATALCA              0.977
##  5 SİLİVRİ              0.977
##  6 ZEYTİNBURNU          0.977
##  7 SULTANGAZİ           0.977
##  8 ESENLER              0.978
##  9 BAĞCILAR             0.978
## 10 GAZİOSMANPAŞA        0.978
## # … with 29 more rows

List the percentage of neighborhoods won by AK PARTİ (vote percentage of neighborhood > 0.5) and arrange by from the highest ratio to lowest.

## # A tibble: 39 x 2
##    ilce_ADI    kazanilan_muht_orani
##    <chr>                      <dbl>
##  1 SULTANBEYLİ                1    
##  2 ESENLER                    0.941
##  3 BAĞCILAR                   0.909
##  4 SULTANGAZİ                 0.733
##  5 ARNAVUTKÖY                 0.710
##  6 PENDİK                     0.649
##  7 ŞİLE                       0.645
##  8 ÜMRANİYE                   0.590
##  9 SANCAKTEPE                 0.579
## 10 BEYKOZ                     0.556
## # … with 29 more rows

List the districts where CHP and AK PARTI have very close vote percentages. (You can use abs function to get absolute values e.g. abs(CHP-AK_PARTI))

## # A tibble: 39 x 4
##    ilce_ADI   toplam_oy   CHP AK_PARTI
##    <chr>          <dbl> <dbl>    <dbl>
##  1 FATİH         221380 0.495    0.494
##  2 ŞİLE           25022 0.498    0.494
##  3 BEYKOZ        153583 0.498    0.493
##  4 GÜNGÖREN      169957 0.488    0.503
##  5 KAĞITHANE     259726 0.486    0.506
##  6 ÇEKMEKÖY      145908 0.507    0.486
##  7 BAYRAMPAŞA    166233 0.507    0.485
##  8 BAŞAKŞEHİR    219934 0.476    0.514
##  9 BEYOĞLU       135354 0.516    0.475
## 10 ÜMRANİYE      406548 0.475    0.516
## # … with 29 more rows

Advanced

Arrange the neighborhoods by CHP and AK Parti are closest to a tie and where number of voters are greater than or equal to 5000. (You can use abs function inside mutate)

## # A tibble: 603 x 6
##    muhtarlik_ADI   ilce_ADI   secmen_SAYISI AK_PARTI   CHP  fark
##    <chr>           <chr>              <dbl>    <dbl> <dbl> <dbl>
##  1 YAKUPLU MAH.    BEYLİKDÜZÜ         29865    11829 11824     5
##  2 BALIKYOLU MAH.  ESENYURT           17660     6850  6838    12
##  3 CUMHURİYET MAH. ÜSKÜDAR            27713    11689 11671    18
##  4 HARMANTEPE MAH. KAĞITHANE          15577     6201  6180    21
##  5 TAHTAKALE MAH.  AVCILAR            34244    14312 14334   -22
##  6 CUMHURİYET MAH. ÇEKMEKÖY            8930     3640  3663   -23
##  7 PINAR MAH.      SARIYER             8093     3355  3322    33
##  8 BULGURLU  MAH.  ÜSKÜDAR            23573     9935  9975   -40
##  9 ORHAN GAZİ MAH. ESENYURT            7333     2897  2940   -43
## 10 GÜRPINAR MAH.   BEYLİKDÜZÜ         13878     5597  5551    46
## # … with 593 more rows

Find the ratio of “high turnout” neighborhoods in every district and arrange descending. High turnout is defined by the ratio of valid votes to total voters greater than or equal to 90%.

## # A tibble: 39 x 2
##    ilce_ADI yuksek_katilim
##    <chr>             <dbl>
##  1 ŞİLE             0.258 
##  2 SİLİVRİ          0.222 
##  3 ÇATALCA          0.154 
##  4 MALTEPE          0.143 
##  5 BEYKOZ           0.0667
##  6 BAKIRKÖY         0.0625
##  7 ESENLER          0.0588
##  8 TUZLA            0.0588
##  9 ÜMRANİYE         0.0513
## 10 ÇEKMEKÖY         0.0476
## # … with 29 more rows

Calculate the ratio of total voters where in their ballot boxes either AK PARTI or CHP got 80% or more of the votes, to the total number of voters in each districts and find the most heterogeneous 5 districts (i.e. with the highest ratio). (Tip: You can use TRUE/FALSE as 1/0 automatically.)

## # A tibble: 39 x 2
##    ilce_ADI    oran
##    <chr>      <dbl>
##  1 BEŞİKTAŞ    0.73
##  2 KADIKÖY     0.7 
##  3 BAKIRKÖY    0.59
##  4 ŞİŞLİ       0.36
##  5 SARIYER     0.17
##  6 ATAŞEHİR    0.15
##  7 MALTEPE     0.1 
##  8 ADALAR      0.08
##  9 BEYLİKDÜZÜ  0.08
## 10 BAŞAKŞEHİR  0.07
## # … with 29 more rows

Conclusion

Fundamental dplyr functions provide very convenient tools for data analysis. It can also be used to generate the features required for modelling. You can process few million rows of data without breaking a sweat (for larger data sets you can use data.table), you can prepare functions instead of manual Excel operations. With R Markdown system, which this tutorial is prepared in, you can create reproducible documents and automatize the reports. You can use ggplot2 for visualizations, which is also part of the tidyverse package ecosystem.

Appendix

Advanced Topics

Following topics are slightly advanced and not covered in this document.

Collective operations: _all, _at, _if suffixes to many fundamental commands.
Offsetting rows: lead,lag functions
Long/wide format: pivot_wider,pivot_longer functions
Multiple conditions: case_when function.
Join functions: left_join, inner_join, full_join etc.

References

Data Set: 23 Haziran 2019 İstanbul Büyükşehir Belediye Başkanlığı Yenileme Seçimi Sonuçları
Tidyverse: https://www.tidyverse.org/
R for Data Science: https://r4ds.had.co.nz/
Cheatsheet (Data Transformation) (Turkish Version)

Introduction to dplyr: Istanbul Local Elections June 2019

Berk Orbay

Introduction

Preparations

Fundamentals

`select`/`rename`

`filter`

`distinct`

`arrange`

`mutate`/`transmute`

`group_by`/`summarise`

Exercises

Basic

Intermediate

Advanced

Conclusion

Appendix

Advanced Topics

References

Introduction to dplyr: Istanbul Local Elections June 2019

Berk Orbay

Introduction

Preparations

Fundamentals

select/rename

filter

distinct

arrange

mutate/transmute

group_by/summarise

Exercises

Basic

Intermediate

Advanced

Conclusion

Appendix

Advanced Topics

References

`select`/`rename`

`filter`

`distinct`

`arrange`

`mutate`/`transmute`

`group_by`/`summarise`