Introduction

Main purpose of this document is to introduce a major data manipulation package, dplyr, with a contemporary subject. There are seven fundamental dplyr functions: select/rename, filter, distinct, arrange, mutate/transmute, group_by and summarise with a number of assisting functions used either in these functions or separately. In this document, we will cover every one of them and there will be supplementary functions to carry out data operations. Also, pipe operator (%>%) will be briefly introduced. This document is updated for dplyr 1.0.0 and R 4.0+ to show you new stuff as well. We will also use lubridate package for date time operations but we will not cover it except for quick explanations (check appendix at the end of this document for a mini tutorial). You can use advanced features in the “Advance Usage” subsections.

Preparations

There are two prerequisites to start: Install tidyverse package and putting the relevant data set into the working directory (write getwd() in the console to locate your working directory). In this document, topic of the data set is the hourly licensed and unlicensed renewable energy production data between January 1, 2018 and May 31, 2020.

To install the package run install.packages("tidyverse") in the console and select a mirror (first one is quite ok). Once you install the library you can always call it with library(tidyverse) command (no need to reinstall). You can download the data set from its GitHub Repository.

library(tidyverse) #tidyverse is a package group which includes dplyr as well
library(lubridate)
raw_df <- readRDS("rp_201801_202005_df.rds")

First of those above commands calls the package (different from installing) The second command assigns the data to raw_df variable. There are two types of assignment operators in R: <- and =. No operation in R is permanent unless you assign it to somewhere (There are exceptions, though. See data.table package for instance.). We will benefit from this property in this document as well. No matter how many operations we do on each example we will always start from the original data frame.

Let’s do a simple sanity check. The output of the following command reads “21,168 x 17” in the first line, which means there are 21,168 rows and 17 columns in the tibble. Tibble is the name of the data.frame type of dplyr. It usually is data frame plus some advanced features. There are abbreviations of data types under each column name. These are usually character/string (), numeric (, if integer ), logical (TRUE/FALSE, logical) (), factor (), date () and datetime (). In the introduction phase we will only use character, numeric and logical data types.

print(raw_df,n=3)
## # A tibble: 21,168 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-05-31 23:00:00    1434.           913.       75.8         2585.
## 2 2020-05-31 22:00:00    1577.           908.       75.6         2631.
## 3 2020-05-31 21:00:00    1858.           901.       75.4         2585.
## # … with 21,165 more rows, and 12 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## #   canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

Also we can use glimpse function to inspect. Using glimpse each column is represented in a row with its data type and first few entries. Our data consists of hourly renewable electricity production of YEKDEM plants from different origins and license types. YEKDEM is a type of feed-in-tariff incentive framework. Suffixes with "_lic" represents licensed (larger scale) plants and "_ul" represents unlicensed (smaller scale) production. canalType, riverType and reservoir columns represent hydro power.

raw_df %>% glimpse()
## Rows: 21,168
## Columns: 17
## $ dt              <dttm> 2020-05-31 23:00:00, 2020-05-31 22:00:00, 2020-05-31…
## $ wind_lic        <dbl> 1433.8132, 1577.1419, 1857.5492, 1933.0142, 2031.7862…
## $ geothermal_lic  <dbl> 912.7882, 907.9303, 900.5844, 888.4561, 864.5402, 847…
## $ biogas_lic      <dbl> 75.8047, 75.6163, 75.3906, 76.7873, 76.9707, 77.5750,…
## $ canalType_lic   <dbl> 2584.930, 2630.602, 2585.038, 2542.381, 2594.459, 262…
## $ riverType_lic   <dbl> 316.5538, 316.6800, 356.7637, 350.1544, 377.5312, 379…
## $ biomass_lic     <dbl> 262.4994, 253.0814, 246.9268, 249.9152, 248.2336, 246…
## $ landfillGas_lic <dbl> 100.3971, 101.1378, 100.4442, 100.7307, 102.2474, 102…
## $ sun_lic         <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 2.0594, 14.2800, 48.0…
## $ reservoir_lic   <dbl> 2306.303, 2296.045, 2279.266, 2308.918, 2792.313, 318…
## $ others_lic      <dbl> 48.3833, 48.4011, 48.4041, 48.4199, 48.4653, 48.5485,…
## $ wind_ul         <dbl> 3.7751, 4.8375, 7.6659, 11.8121, 13.1070, 13.1830, 10…
## $ biogas_ul       <dbl> 16.9293, 16.9227, 16.9052, 16.7517, 16.2928, 16.5989,…
## $ canalType_ul    <dbl> 4.1749, 4.4221, 4.4658, 4.6020, 4.6195, 4.5146, 4.661…
## $ biomass_ul      <dbl> 15.4698, 15.3609, 16.0483, 15.2271, 15.5563, 15.5007,…
## $ sun_ul          <dbl> 0.0582, 0.0320, 0.0335, 1.3121, 103.3267, 555.5787, 1…
## $ others_ul       <dbl> 0.0610, 0.0395, 0.4136, 0.5508, 0.7106, 1.3775, 2.746…

Did you notice the expression we used this time? Pipe operator makes data analysis and transformation very easy and civilized. We will use pipes frequently in this document and in the future.

We can connect many functions without calling the variable multiple times with the help of the pipe operator.

Fundamentals

In this chapter fundamental functions of dplyr are introduced. Every function will be used in the following examples after it has been introduced. To limit the number of displayed rows, we use the following global option. You can ignore this part in your exercises.

options(tibble.print_max = 3, tibble.print_min = 3)

select/rename

Select, as the name suggests, is used to select columns. For instance, suppose we only want licensed wind production (wind_lic) and date-time (dt) columns.

raw_df %>% select(dt,wind_lic)
## # A tibble: 21,168 x 2
##   dt                  wind_lic
##   <dttm>                 <dbl>
## 1 2020-05-31 23:00:00    1434.
## 2 2020-05-31 22:00:00    1577.
## 3 2020-05-31 21:00:00    1858.
## # … with 21,165 more rows

If we wanted to write the above expression without the pipe operator, we could go with the sad expression below. You can extrapolate how complicated things can be without the pipe.

select(raw_df,dt,wind_lic)

We can use rename to rename columns (again as the name suggests). Let’s change dt to date_time.

raw_df %>% rename(date_time = dt)
## # A tibble: 21,168 x 17
##   date_time           wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-05-31 23:00:00    1434.           913.       75.8         2585.
## 2 2020-05-31 22:00:00    1577.           908.       75.6         2631.
## 3 2020-05-31 21:00:00    1858.           901.       75.4         2585.
## # … with 21,165 more rows, and 12 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## #   canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

p.s. We can rename columns inside select function.

Select has many convenient sub operators and special expressions. If we know the order of columns, we can use the scope (:) expression to get all the columns determined by the scope. Suppose, we want date-time (dt) and licensed production.

raw_df %>% select(date_time=dt,wind_lic:others_lic)
## # A tibble: 21,168 x 11
##   date_time           wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-05-31 23:00:00    1434.           913.       75.8         2585.
## 2 2020-05-31 22:00:00    1577.           908.       75.6         2631.
## 3 2020-05-31 21:00:00    1858.           901.       75.4         2585.
## # … with 21,165 more rows, and 6 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>

We can eliminate unwanted columns by putting - before the names. Suppose I am not interested in wind values, want to remove all other related columns from the data set, and all other related column names start with “wind_”. We can do it using - and starts_with.

raw_df %>% select(-starts_with("wind_")) 
## # A tibble: 21,168 x 15
##   dt                  geothermal_lic biogas_lic canalType_lic riverType_lic
##   <dttm>                       <dbl>      <dbl>         <dbl>         <dbl>
## 1 2020-05-31 23:00:00           913.       75.8         2585.          317.
## 2 2020-05-31 22:00:00           908.       75.6         2631.          317.
## 3 2020-05-31 21:00:00           901.       75.4         2585.          357.
## # … with 21,165 more rows, and 10 more variables: biomass_lic <dbl>,
## #   landfillGas_lic <dbl>, sun_lic <dbl>, reservoir_lic <dbl>,
## #   others_lic <dbl>, biogas_ul <dbl>, canalType_ul <dbl>, biomass_ul <dbl>,
## #   sun_ul <dbl>, others_ul <dbl>

There are similar expressions for other purposes, such as starts_with, everything and contains. You can see all the expressions in the Cheat Sheet link given at the end of this document.

dplyr 1.0.0 Feature: Sometimes you just want to change the order of the columns. Then use relocate. Suppose we want to show solar and wind production with date-time. But we want to get licensed wind together with licensed solar.

raw_df %>% select(dt,starts_with("sun_"),starts_with("wind")) %>% relocate(wind_lic,.before=sun_ul)
## # A tibble: 21,168 x 5
##   dt                  sun_lic wind_lic sun_ul wind_ul
##   <dttm>                <dbl>    <dbl>  <dbl>   <dbl>
## 1 2020-05-31 23:00:00       0    1434. 0.0582    3.78
## 2 2020-05-31 22:00:00       0    1577. 0.032     4.84
## 3 2020-05-31 21:00:00       0    1858. 0.0335    7.67
## # … with 21,165 more rows

If we specify nothing, it will be in the first place.

raw_df %>% select(dt,starts_with("sun_"),starts_with("wind")) %>% relocate(wind_lic)
## # A tibble: 21,168 x 5
##   wind_lic dt                  sun_lic sun_ul wind_ul
##      <dbl> <dttm>                <dbl>  <dbl>   <dbl>
## 1    1434. 2020-05-31 23:00:00       0 0.0582    3.78
## 2    1577. 2020-05-31 22:00:00       0 0.032     4.84
## 3    1858. 2020-05-31 21:00:00       0 0.0335    7.67
## # … with 21,165 more rows

We use last_col() if we want to take a column to the end.

raw_df %>% select(dt,starts_with("sun_"),starts_with("wind")) %>% relocate(dt,.after=last_col())
## # A tibble: 21,168 x 5
##   sun_lic sun_ul wind_lic wind_ul dt                 
##     <dbl>  <dbl>    <dbl>   <dbl> <dttm>             
## 1       0 0.0582    1434.    3.78 2020-05-31 23:00:00
## 2       0 0.032     1577.    4.84 2020-05-31 22:00:00
## 3       0 0.0335    1858.    7.67 2020-05-31 21:00:00
## # … with 21,165 more rows

Honestly, relocate is a very convenient function.

select/rename advanced usage

Advanced usage subsection introduces extra functionality which can be a bit confusing at the first phase. But, once you get a grasp on the fundamentals check back here as reference. There are several features not available for versions before dplyr 1.0.0.

We can use rename_with function to rename columns with given criteria. In pipe version of the function, first parameter is the function and the second parameter is the criterion. Let’s replace all “Type” with "_type“. For instance it should change”canalType" to “canal_type”.

raw_df %>% rename_with(~gsub("Type","_type",.),contains("Type")) %>% glimpse()
## Rows: 21,168
## Columns: 17
## $ dt              <dttm> 2020-05-31 23:00:00, 2020-05-31 22:00:00, 2020-05-31…
## $ wind_lic        <dbl> 1433.8132, 1577.1419, 1857.5492, 1933.0142, 2031.7862…
## $ geothermal_lic  <dbl> 912.7882, 907.9303, 900.5844, 888.4561, 864.5402, 847…
## $ biogas_lic      <dbl> 75.8047, 75.6163, 75.3906, 76.7873, 76.9707, 77.5750,…
## $ canal_type_lic  <dbl> 2584.930, 2630.602, 2585.038, 2542.381, 2594.459, 262…
## $ river_type_lic  <dbl> 316.5538, 316.6800, 356.7637, 350.1544, 377.5312, 379…
## $ biomass_lic     <dbl> 262.4994, 253.0814, 246.9268, 249.9152, 248.2336, 246…
## $ landfillGas_lic <dbl> 100.3971, 101.1378, 100.4442, 100.7307, 102.2474, 102…
## $ sun_lic         <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 2.0594, 14.2800, 48.0…
## $ reservoir_lic   <dbl> 2306.303, 2296.045, 2279.266, 2308.918, 2792.313, 318…
## $ others_lic      <dbl> 48.3833, 48.4011, 48.4041, 48.4199, 48.4653, 48.5485,…
## $ wind_ul         <dbl> 3.7751, 4.8375, 7.6659, 11.8121, 13.1070, 13.1830, 10…
## $ biogas_ul       <dbl> 16.9293, 16.9227, 16.9052, 16.7517, 16.2928, 16.5989,…
## $ canal_type_ul   <dbl> 4.1749, 4.4221, 4.4658, 4.6020, 4.6195, 4.5146, 4.661…
## $ biomass_ul      <dbl> 15.4698, 15.3609, 16.0483, 15.2271, 15.5563, 15.5007,…
## $ sun_ul          <dbl> 0.0582, 0.0320, 0.0335, 1.3121, 103.3267, 555.5787, 1…
## $ others_ul       <dbl> 0.0610, 0.0395, 0.4136, 0.5508, 0.7106, 1.3775, 2.746…

Did you notice ~ and . in the function? Dot (.) is a representation of the entity. Depending on the situation it can be the latest version of the tibble in the pipe chain, a specific column or something else. ~ is a special character notifying that function evaluation will be done using the dot notation. We will see more examples of that.

Let’s introduce where. If is a function from tidyselect package to select variables with a function where it returns TRUE. It is quite handy.

raw_df %>% select(dt,starts_with("sun_"),starts_with("wind")) %>% relocate(where(is.numeric))
## # A tibble: 21,168 x 5
##   sun_lic sun_ul wind_lic wind_ul dt                 
##     <dbl>  <dbl>    <dbl>   <dbl> <dttm>             
## 1       0 0.0582    1434.    3.78 2020-05-31 23:00:00
## 2       0 0.032     1577.    4.84 2020-05-31 22:00:00
## 3       0 0.0335    1858.    7.67 2020-05-31 21:00:00
## # … with 21,165 more rows

We can also use any_of or all_of functions in select. The main difference is while the former returns as much as it can, the latter will throw an error if any one of the criteria is not fulfilled. Let’s try to select “dt”, “others_lic” and “nuclear_lic”. Since this data does not include nuclear power production we should not see it.

raw_df %>% select(any_of(c("dt","others_lic","nuclear_lic")))
## # A tibble: 21,168 x 2
##   dt                  others_lic
##   <dttm>                   <dbl>
## 1 2020-05-31 23:00:00       48.4
## 2 2020-05-31 22:00:00       48.4
## 3 2020-05-31 21:00:00       48.4
## # … with 21,165 more rows

In order not to break our notebook, we wrap it around try (error handling is another topic).

try(raw_df %>% select(all_of(c("dt","others_lic","nuclear_lic"))))
## Error : Can't subset columns that don't exist.
## x Column `nuclear_lic` doesn't exist.

filter/distinct

Filter (no more “as the name suggests”, as you already figured it out) helps filter rows according to given criteria. It is highly similar with Excel’s filter functionality (but much much more flexible and reliable).

Let’s see the production at 2020-05-08 16:00:00.

raw_df %>% filter(dt == "2020-05-08 16:00:00")
## # A tibble: 1 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-05-08 16:00:00    2618.           856.       79.8         3896.
## # … with 12 more variables: riverType_lic <dbl>, biomass_lic <dbl>,
## #   landfillGas_lic <dbl>, sun_lic <dbl>, reservoir_lic <dbl>,
## #   others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>, canalType_ul <dbl>,
## #   biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

By using == operator, we bring the values in dt column which are equal to the hour we desired. There are other expressions such as not equal to (!=), greater than (or equal to) (>,>=), smaller than (or equal to) (<,<=), in (%in%) and some more.

At the same time we can make comparisons between columns and combine multiple criteria to create more complex filters. Here we use AND (&) and OR (|) operators to combine criteria.

Suppose we want to find our the times when licensed wind production is greater than all of hydro type licensed production.

raw_df %>% filter(wind_lic > canalType_lic & wind_lic > reservoir_lic & wind_lic > riverType_lic)
## # A tibble: 11,287 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-05-27 19:00:00    3303.           930.       74.7         2969.
## 2 2020-05-27 18:00:00    3596.           914.       75.0         2953.
## 3 2020-05-27 17:00:00    3551.           900.       76.3         2954.
## # … with 11,284 more rows, and 12 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## #   canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

We can add numeric operations as well. Suppose we want to find the total solar production is greater than total wind production.

raw_df %>% filter(wind_lic + wind_ul < sun_lic + sun_ul)
## # A tibble: 4,949 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-05-31 16:00:00    2036.           843.       76.8         2616.
## 2 2020-05-31 15:00:00    1875.           845.       77.4         2685.
## 3 2020-05-31 14:00:00    1755.           853.       77.2         2715.
## # … with 4,946 more rows, and 12 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## #   canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

Suppose we want to filter only the unique values. Then we simply use distinct command. Let’s get unique rounded licensed wind production values.

raw_df %>% distinct(round(wind_lic))
## # A tibble: 4,893 x 1
##   `round(wind_lic)`
##               <dbl>
## 1              1434
## 2              1577
## 3              1858
## # … with 4,890 more rows

If we want to keep all columns we simply make the parameter .keep=TRUE.

raw_df %>% distinct(round(wind_lic),.keep=TRUE)
## # A tibble: 4,893 x 2
##   `round(wind_lic)` .keep
##               <dbl> <lgl>
## 1              1434 TRUE 
## 2              1577 TRUE 
## 3              1858 TRUE 
## # … with 4,890 more rows

filter/distinct advanced usage

Let’s introduce slice. This function helps return rows by its row number. Suppose we want the top 5 rows.

raw_df %>% slice(1:5) %>% print(n=5)
## # A tibble: 5 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-05-31 23:00:00    1434.           913.       75.8         2585.
## 2 2020-05-31 22:00:00    1577.           908.       75.6         2631.
## 3 2020-05-31 21:00:00    1858.           901.       75.4         2585.
## 4 2020-05-31 20:00:00    1933.           888.       76.8         2542.
## 5 2020-05-31 19:00:00    2032.           865.       77.0         2594.
## # … with 12 more variables: riverType_lic <dbl>, biomass_lic <dbl>,
## #   landfillGas_lic <dbl>, sun_lic <dbl>, reservoir_lic <dbl>,
## #   others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>, canalType_ul <dbl>,
## #   biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

If we want to return random rows we have slice_sample. Let’s bring 5 random rows.

raw_df %>% slice_sample(n=5)
## # A tibble: 5 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-01-28 12:00:00    1363.           946.       78.0         1077.
## 2 2019-07-08 05:00:00    1229.           813.       77.8         1073.
## 3 2019-10-03 21:00:00    3178.           729.       79.1          679.
## # … with 2 more rows, and 12 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## #   canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

If we want to do it proportionately, we have the prop parameter. Let’s say we want 0.1% of the data frame.

raw_df %>% slice_sample(prop=0.001)
## # A tibble: 21 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2018-02-22 05:00:00    2059.           723.       68.8         1324.
## 2 2020-02-29 09:00:00    4043.           972.       78.5         2319.
## 3 2019-05-29 16:00:00    1660.           671.       77.3         3120.
## # … with 18 more rows, and 12 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## #   canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

There are other slice_* type functions. These are slice_head/slice_tail for first/last n or percentage of rows. slice_min/slice_max for the top/bottom n rows according to an ordering criteria.

arrange

Arrange sorts rows from A to Z or smallest to largest. It has great similarity with Excel’s Sort functionality.

Let’s sort licensed reservoir production from largest to smallest.

raw_df %>% select(dt,reservoir_lic) %>% arrange(desc(reservoir_lic))
## # A tibble: 21,168 x 2
##   dt                  reservoir_lic
##   <dttm>                      <dbl>
## 1 2019-05-10 23:00:00         5058.
## 2 2019-05-10 21:00:00         5035.
## 3 2019-05-15 02:00:00         5019.
## # … with 21,165 more rows

Do you see desc() function inside arrange? By default arrange sorts a column by first to last or A-Z. desc reverses this.

You can also use multiple sorting criteria and use operations inside arrange. Let’s arrange by licensed wind production rounded down (floor) in 100s range (e.g. 5634 = 5600 and 5693 = 5600 as well). Then we sort by date time to see the first time the production entered a 100-range in the data time period.

raw_df %>% arrange(desc(floor(wind_lic/100)*100),dt)
## # A tibble: 21,168 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2019-09-15 17:00:00    5767.           691.       83.0          922.
## 2 2018-09-26 19:00:00    5622.           672.       67.8          951.
## 3 2019-09-15 15:00:00    5628.           692.       81.4          901.
## # … with 21,165 more rows, and 12 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## #   canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

mutate/transmute

Mutate is the function when we do operations and calculations using other columns.

For instance let’s calculate wind power’s share in total renewables production at each hour.

raw_df %>% mutate(wind_lic_perc = wind_lic / (wind_lic + geothermal_lic + biogas_lic + canalType_lic + riverType_lic + biomass_lic + landfillGas_lic + sun_lic + reservoir_lic + others_lic + wind_ul + biogas_ul + canalType_ul + biomass_ul + sun_ul + others_ul)) %>% select(dt, wind_lic_perc)
## # A tibble: 21,168 x 2
##   dt                  wind_lic_perc
##   <dttm>                      <dbl>
## 1 2020-05-31 23:00:00         0.177
## 2 2020-05-31 22:00:00         0.191
## 3 2020-05-31 21:00:00         0.219
## # … with 21,165 more rows

You can use many R functions (from both base functions and other packages). For instance to calculate “competition” wind and solar we can use the following expression.

raw_df %>% mutate(wind_or_solar = ifelse(wind_lic + wind_ul > sun_lic + sun_ul, "wind", "solar")) %>% select(dt,wind_or_solar)
## # A tibble: 21,168 x 2
##   dt                  wind_or_solar
##   <dttm>              <chr>        
## 1 2020-05-31 23:00:00 wind         
## 2 2020-05-31 22:00:00 wind         
## 3 2020-05-31 21:00:00 wind         
## # … with 21,165 more rows

Transmute has the same functionality as mutate with the additional property similar to select. Transmute returns only the columns included in the function. Suppose we also want to calculate the difference between total wind and total solar.

raw_df %>% transmute(dt, wind_or_solar = ifelse(wind_lic + wind_ul > sun_lic + sun_ul, "wind", "solar"), absdiff = abs(wind_lic + wind_ul - sun_lic - sun_ul))
## # A tibble: 21,168 x 3
##   dt                  wind_or_solar absdiff
##   <dttm>              <chr>           <dbl>
## 1 2020-05-31 23:00:00 wind            1438.
## 2 2020-05-31 22:00:00 wind            1582.
## 3 2020-05-31 21:00:00 wind            1865.
## # … with 21,165 more rows

mutate/transmute advanced usage

Suppose we want to see the difference between the previous and next hour’s production. We offset rows using lead and lag functions. But remember lead and lag does not actually give you “next/previous hour’s” values, just the rows. You may need to arrange your data.

raw_df %>% transmute(dt, wind_lic, wind_lic_prev_h = lead(wind_lic,1), wind_lic_next_h = lag(wind_lic,1))
## # A tibble: 21,168 x 4
##   dt                  wind_lic wind_lic_prev_h wind_lic_next_h
##   <dttm>                 <dbl>           <dbl>           <dbl>
## 1 2020-05-31 23:00:00    1434.           1577.             NA 
## 2 2020-05-31 22:00:00    1577.           1858.           1434.
## 3 2020-05-31 21:00:00    1858.           1933.           1577.
## # … with 21,165 more rows

If you want to use the same function over several columns, you can use the new across function. Let’s round every numeric column to one significant digit.

raw_df %>% mutate(across(where(is.numeric),~round(.,1)))
## # A tibble: 21,168 x 17
##   dt                  wind_lic geothermal_lic biogas_lic canalType_lic
##   <dttm>                 <dbl>          <dbl>      <dbl>         <dbl>
## 1 2020-05-31 23:00:00    1434.           913.       75.8         2585.
## 2 2020-05-31 22:00:00    1577.           908.       75.6         2631.
## 3 2020-05-31 21:00:00    1858.           901.       75.4         2585 
## # … with 21,165 more rows, and 12 more variables: riverType_lic <dbl>,
## #   biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## #   reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## #   canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>

We can also specify columns. Let’s see the comparative production of wind and reservoir hydro against the unlicensed solar production. We just increment solar production by 1 to prevent any infinity values.

raw_df %>% mutate(sun_ul = sun_ul + 1) %>% transmute(dt,across(c(wind_lic,reservoir_lic),~round(./sun_ul,2)))
## # A tibble: 21,168 x 3
##   dt                  wind_lic reservoir_lic
##   <dttm>                 <dbl>         <dbl>
## 1 2020-05-31 23:00:00    1355.         2179.
## 2 2020-05-31 22:00:00    1528.         2225.
## 3 2020-05-31 21:00:00    1797.         2205.
## # … with 21,165 more rows

If there are multiple conditions ifelse is not enough. It is possible to use case_when to specify multiple conditions and outcomes.

raw_df %>% transmute(dt, solar_production_level = case_when(sun_ul > quantile(sun_ul,0.9) ~ "very high", sun_ul > quantile(sun_ul, 0.75) ~ "high", sun_ul > quantile(sun_ul, 0.5) ~ "above median", TRUE ~ "below median"))  %>% slice(7:9)
## # A tibble: 3 x 2
##   dt                  solar_production_level
##   <dttm>              <chr>                 
## 1 2020-05-31 17:00:00 above median          
## 2 2020-05-31 16:00:00 high                  
## 3 2020-05-31 15:00:00 very high

rowwise is actually a type of group_by/summarise function but as the name suggests it allows us to do row-wise operations. Let’s calculate row sums by using c_across function and rowwise. Oh and also now, experimentally, you can use relocate functionality in mutate/transmute. So, conveniently we can place it after date time.

raw_df %>% slice_head(n=5) %>% rowwise() %>% mutate(total_prod = sum(c_across(where(is.numeric))),.after=dt)
## # A tibble: 5 x 18
## # Rowwise: 
##   dt                  total_prod wind_lic geothermal_lic biogas_lic
##   <dttm>                   <dbl>    <dbl>          <dbl>      <dbl>
## 1 2020-05-31 23:00:00      8082.    1434.           913.       75.8
## 2 2020-05-31 22:00:00      8248.    1577.           908.       75.6
## 3 2020-05-31 21:00:00      8496.    1858.           901.       75.4
## # … with 2 more rows, and 13 more variables: canalType_lic <dbl>,
## #   riverType_lic <dbl>, biomass_lic <dbl>, landfillGas_lic <dbl>,
## #   sun_lic <dbl>, reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>,
## #   biogas_ul <dbl>, canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>,
## #   others_ul <dbl>

group_by/summarise

Finally we will learn how to calculate summary tables. It is similar to Pivot Tables in Excel. group_by is the grouping function, summarise is the summarising function.

For instance let’s calculate number of hours where wind production is above 3000 MWh. We will use a special function n() to calculate number of rows. We can define groupings just like mutate.

raw_df %>% group_by(production_group = cut(wind_lic + wind_ul,breaks=c(0,1000,2000,3000,4000,5000,6000),include.lowest = TRUE)) %>% summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
##   production_group count
##   <fct>            <int>
## 1 [0,1e+03]         4294
## 2 (1e+03,2e+03]     5733
## 3 (2e+03,3e+03]     4725
## # … with 3 more rows

Normally, we get one result for each group and summary function. From dplyr 1.0.0 we can have multiple row summarise for each group. Let’s say we want to find the minimum and maximum licensed wind production ranges for each year. But, be warned, it can be a little confusing.

raw_df %>% group_by(year = lubridate::year(dt)) %>% summarise(wind_lic_range = range(wind_lic))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 6 x 2
## # Groups:   year [3]
##    year wind_lic_range
##   <dbl>          <dbl>
## 1  2018           46.2
## 2  2018         5622. 
## 3  2019           32.8
## # … with 3 more rows

group_by/summarise advanced usage

Just like mutate/transmute you can use across in summarise as well. Let’s see median production of each year and each source. In this example we can also use the named list version of the functions and additional names structure.

raw_df %>% group_by(year = lubridate::year(dt)) %>% summarise(across(where(is.numeric),list(med=median),.names="{fn}_{col}"))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 17
##    year med_wind_lic med_geothermal_… med_biogas_lic med_canalType_l…
##   <dbl>        <dbl>            <dbl>          <dbl>            <dbl>
## 1  2018        1989.             694.           66.4            1480.
## 2  2019        2136.             814.           80.9            1575.
## 3  2020        2297.             951.           78.5            2779.
## # … with 12 more variables: med_riverType_lic <dbl>, med_biomass_lic <dbl>,
## #   med_landfillGas_lic <dbl>, med_sun_lic <dbl>, med_reservoir_lic <dbl>,
## #   med_others_lic <dbl>, med_wind_ul <dbl>, med_biogas_ul <dbl>,
## #   med_canalType_ul <dbl>, med_biomass_ul <dbl>, med_sun_ul <dbl>,
## #   med_others_ul <dbl>

Exercises

Solve the following exercises. Outputs are given below, you are expected write code to match the outputs.

  1. Find the mean and standard deviation of licensed geothermal productions in all years. (Tip: Use lubridate::year to get years from date data.)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
##    year mean_geo sd_geo
##   <dbl>    <dbl>  <dbl>
## 1  2018     681.   65.2
## 2  2019     799.   74.2
## 3  2020     935.   59.0
  1. Find the hourly average unlicensed solar (sun_ul) production levels for May 2020.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 24 x 2
##    hour avg_prod
##   <int>    <dbl>
## 1     0     0.17
## 2     1     0.37
## 3     2     0.7 
## # … with 21 more rows
  1. Find the average daily percentage change of licensed biomass (biomass_lic) in 2019. (e.g. Suppose daily production is 50 in day 1 and 53 in day 2, then the change should be (53-50)/50 -1 = 0.06) (Tip: Use lubridate::as_date to convert date time to date. Use lag and lead functions to offset values.)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 1
##   average_change
##            <dbl>
## 1        0.00282
  1. Find the yearly total production levels in TWh (Current values are in MWh. 1 GWh is 1000 MWh and 1 TWh is 1000 GWh). (Tip: In order to avoid a lengthy summation you can use tidyr::pivot_longer to get a long format.)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##    year total_production
##   <dbl>            <dbl>
## 1  2018             62.6
## 2  2019             76.7
## 3  2020             37.3

Conclusion

Fundamental dplyr functions provide very convenient tools for data analysis. It can also be used to generate the features required for modelling. You can process few million rows of data without breaking a sweat (for larger data sets you can use data.table), you can prepare functions instead of manual Excel operations. With R Markdown system, which this tutorial is prepared in, you can create reproducible documents and automatize the reports. You can use ggplot2 for visualizations, which is also part of the tidyverse package ecosystem.

Appendix

Mini lubridate tutorial

In this tutorial we use a small portion of a very powerful package, lubridate. You can see the official website here.

Let’s take just 3 dates at random from our data set.

set.seed(5)
lub_df <- 
raw_df %>% 
  select(dt) %>%
  sample_n(3)

print(lub_df)
## # A tibble: 3 x 1
##   dt                 
##   <dttm>             
## 1 2018-12-02 06:00:00
## 2 2019-01-12 05:00:00
## 3 2018-04-16 09:00:00

Since we called lubridate at the beginning of this tutorial we do not need to call by package reference (lubridate::) but it is generally good practice.

lub_df %>% 
  mutate(
    year = lubridate::year(dt),
    month = lubridate::month(dt),
    day = lubridate::day(dt),
    week_day = lubridate::wday(dt),
    wday_label = lubridate::wday(dt,label=TRUE),
    hour = lubridate::hour(dt),
    minute = lubridate::minute(dt),
    second = lubridate::second(dt)
  )
## # A tibble: 3 x 9
##   dt                   year month   day week_day wday_label  hour minute second
##   <dttm>              <dbl> <dbl> <int>    <dbl> <ord>      <int>  <int>  <dbl>
## 1 2018-12-02 06:00:00  2018    12     2        1 Sun            6      0      0
## 2 2019-01-12 05:00:00  2019     1    12        7 Sat            5      0      0
## 3 2018-04-16 09:00:00  2018     4    16        2 Mon            9      0      0

References