Main purpose of this document is to introduce a major data manipulation package, dplyr
, with a contemporary subject. There are seven fundamental dplyr
functions: select
/rename
, filter
, distinct
, arrange
, mutate
/transmute
, group_by
and summarise
with a number of assisting functions used either in these functions or separately. In this document, we will cover every one of them and there will be supplementary functions to carry out data operations. Also, pipe operator (%>%
) will be briefly introduced. This document is updated for dplyr 1.0.0
and R 4.0+
to show you new stuff as well. We will also use lubridate
package for date time operations but we will not cover it except for quick explanations (check appendix at the end of this document for a mini tutorial). You can use advanced features in the “Advance Usage” subsections.
There are two prerequisites to start: Install tidyverse
package and putting the relevant data set into the working directory (write getwd()
in the console to locate your working directory). In this document, topic of the data set is the hourly licensed and unlicensed renewable energy production data between January 1, 2018 and May 31, 2020.
To install the package run install.packages("tidyverse")
in the console and select a mirror (first one is quite ok). Once you install the library you can always call it with library(tidyverse)
command (no need to reinstall). You can download the data set from its GitHub Repository.
library(tidyverse) #tidyverse is a package group which includes dplyr as well
library(lubridate)
raw_df <- readRDS("rp_201801_202005_df.rds")
First of those above commands calls the package (different from installing) The second command assigns the data to raw_df
variable. There are two types of assignment operators in R: <-
and =
. No operation in R is permanent unless you assign it to somewhere (There are exceptions, though. See data.table
package for instance.). We will benefit from this property in this document as well. No matter how many operations we do on each example we will always start from the original data frame.
Let’s do a simple sanity check. The output of the following command reads “21,168 x 17” in the first line, which means there are 21,168 rows and 17 columns in the tibble. Tibble is the name of the data.frame type of dplyr. It usually is data frame plus some advanced features. There are abbreviations of data types under each column name. These are usually character/string (
print(raw_df,n=3)
## # A tibble: 21,168 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 1434. 913. 75.8 2585.
## 2 2020-05-31 22:00:00 1577. 908. 75.6 2631.
## 3 2020-05-31 21:00:00 1858. 901. 75.4 2585.
## # … with 21,165 more rows, and 12 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## # canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
Also we can use glimpse
function to inspect. Using glimpse
each column is represented in a row with its data type and first few entries. Our data consists of hourly renewable electricity production of YEKDEM plants from different origins and license types. YEKDEM is a type of feed-in-tariff incentive framework. Suffixes with "_lic" represents licensed (larger scale) plants and "_ul" represents unlicensed (smaller scale) production. canalType, riverType and reservoir columns represent hydro power.
raw_df %>% glimpse()
## Rows: 21,168
## Columns: 17
## $ dt <dttm> 2020-05-31 23:00:00, 2020-05-31 22:00:00, 2020-05-31…
## $ wind_lic <dbl> 1433.8132, 1577.1419, 1857.5492, 1933.0142, 2031.7862…
## $ geothermal_lic <dbl> 912.7882, 907.9303, 900.5844, 888.4561, 864.5402, 847…
## $ biogas_lic <dbl> 75.8047, 75.6163, 75.3906, 76.7873, 76.9707, 77.5750,…
## $ canalType_lic <dbl> 2584.930, 2630.602, 2585.038, 2542.381, 2594.459, 262…
## $ riverType_lic <dbl> 316.5538, 316.6800, 356.7637, 350.1544, 377.5312, 379…
## $ biomass_lic <dbl> 262.4994, 253.0814, 246.9268, 249.9152, 248.2336, 246…
## $ landfillGas_lic <dbl> 100.3971, 101.1378, 100.4442, 100.7307, 102.2474, 102…
## $ sun_lic <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 2.0594, 14.2800, 48.0…
## $ reservoir_lic <dbl> 2306.303, 2296.045, 2279.266, 2308.918, 2792.313, 318…
## $ others_lic <dbl> 48.3833, 48.4011, 48.4041, 48.4199, 48.4653, 48.5485,…
## $ wind_ul <dbl> 3.7751, 4.8375, 7.6659, 11.8121, 13.1070, 13.1830, 10…
## $ biogas_ul <dbl> 16.9293, 16.9227, 16.9052, 16.7517, 16.2928, 16.5989,…
## $ canalType_ul <dbl> 4.1749, 4.4221, 4.4658, 4.6020, 4.6195, 4.5146, 4.661…
## $ biomass_ul <dbl> 15.4698, 15.3609, 16.0483, 15.2271, 15.5563, 15.5007,…
## $ sun_ul <dbl> 0.0582, 0.0320, 0.0335, 1.3121, 103.3267, 555.5787, 1…
## $ others_ul <dbl> 0.0610, 0.0395, 0.4136, 0.5508, 0.7106, 1.3775, 2.746…
Did you notice the expression we used this time? Pipe operator makes data analysis and transformation very easy and civilized. We will use pipes frequently in this document and in the future.
We can connect many functions without calling the variable multiple times with the help of the pipe operator.
In this chapter fundamental functions of dplyr
are introduced. Every function will be used in the following examples after it has been introduced. To limit the number of displayed rows, we use the following global option. You can ignore this part in your exercises.
options(tibble.print_max = 3, tibble.print_min = 3)
select
/rename
Select, as the name suggests, is used to select columns. For instance, suppose we only want licensed wind production (wind_lic) and date-time (dt) columns.
raw_df %>% select(dt,wind_lic)
## # A tibble: 21,168 x 2
## dt wind_lic
## <dttm> <dbl>
## 1 2020-05-31 23:00:00 1434.
## 2 2020-05-31 22:00:00 1577.
## 3 2020-05-31 21:00:00 1858.
## # … with 21,165 more rows
If we wanted to write the above expression without the pipe operator, we could go with the sad expression below. You can extrapolate how complicated things can be without the pipe.
select(raw_df,dt,wind_lic)
We can use rename
to rename columns (again as the name suggests). Let’s change dt to date_time.
raw_df %>% rename(date_time = dt)
## # A tibble: 21,168 x 17
## date_time wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 1434. 913. 75.8 2585.
## 2 2020-05-31 22:00:00 1577. 908. 75.6 2631.
## 3 2020-05-31 21:00:00 1858. 901. 75.4 2585.
## # … with 21,165 more rows, and 12 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## # canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
p.s. We can rename columns inside select function.
Select has many convenient sub operators and special expressions. If we know the order of columns, we can use the scope (:
) expression to get all the columns determined by the scope. Suppose, we want date-time (dt) and licensed production.
raw_df %>% select(date_time=dt,wind_lic:others_lic)
## # A tibble: 21,168 x 11
## date_time wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 1434. 913. 75.8 2585.
## 2 2020-05-31 22:00:00 1577. 908. 75.6 2631.
## 3 2020-05-31 21:00:00 1858. 901. 75.4 2585.
## # … with 21,165 more rows, and 6 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>
We can eliminate unwanted columns by putting -
before the names. Suppose I am not interested in wind values, want to remove all other related columns from the data set, and all other related column names start with “wind_”. We can do it using -
and starts_with
.
raw_df %>% select(-starts_with("wind_"))
## # A tibble: 21,168 x 15
## dt geothermal_lic biogas_lic canalType_lic riverType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 913. 75.8 2585. 317.
## 2 2020-05-31 22:00:00 908. 75.6 2631. 317.
## 3 2020-05-31 21:00:00 901. 75.4 2585. 357.
## # … with 21,165 more rows, and 10 more variables: biomass_lic <dbl>,
## # landfillGas_lic <dbl>, sun_lic <dbl>, reservoir_lic <dbl>,
## # others_lic <dbl>, biogas_ul <dbl>, canalType_ul <dbl>, biomass_ul <dbl>,
## # sun_ul <dbl>, others_ul <dbl>
There are similar expressions for other purposes, such as starts_with
, everything
and contains
. You can see all the expressions in the Cheat Sheet link given at the end of this document.
dplyr 1.0.0
Feature: Sometimes you just want to change the order of the columns. Then use relocate
. Suppose we want to show solar and wind production with date-time. But we want to get licensed wind together with licensed solar.
raw_df %>% select(dt,starts_with("sun_"),starts_with("wind")) %>% relocate(wind_lic,.before=sun_ul)
## # A tibble: 21,168 x 5
## dt sun_lic wind_lic sun_ul wind_ul
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 0 1434. 0.0582 3.78
## 2 2020-05-31 22:00:00 0 1577. 0.032 4.84
## 3 2020-05-31 21:00:00 0 1858. 0.0335 7.67
## # … with 21,165 more rows
If we specify nothing, it will be in the first place.
raw_df %>% select(dt,starts_with("sun_"),starts_with("wind")) %>% relocate(wind_lic)
## # A tibble: 21,168 x 5
## wind_lic dt sun_lic sun_ul wind_ul
## <dbl> <dttm> <dbl> <dbl> <dbl>
## 1 1434. 2020-05-31 23:00:00 0 0.0582 3.78
## 2 1577. 2020-05-31 22:00:00 0 0.032 4.84
## 3 1858. 2020-05-31 21:00:00 0 0.0335 7.67
## # … with 21,165 more rows
We use last_col()
if we want to take a column to the end.
raw_df %>% select(dt,starts_with("sun_"),starts_with("wind")) %>% relocate(dt,.after=last_col())
## # A tibble: 21,168 x 5
## sun_lic sun_ul wind_lic wind_ul dt
## <dbl> <dbl> <dbl> <dbl> <dttm>
## 1 0 0.0582 1434. 3.78 2020-05-31 23:00:00
## 2 0 0.032 1577. 4.84 2020-05-31 22:00:00
## 3 0 0.0335 1858. 7.67 2020-05-31 21:00:00
## # … with 21,165 more rows
Honestly, relocate
is a very convenient function.
select/rename
advanced usageAdvanced usage subsection introduces extra functionality which can be a bit confusing at the first phase. But, once you get a grasp on the fundamentals check back here as reference. There are several features not available for versions before dplyr 1.0.0
.
We can use rename_with
function to rename columns with given criteria. In pipe version of the function, first parameter is the function and the second parameter is the criterion. Let’s replace all “Type” with "_type“. For instance it should change”canalType" to “canal_type”.
raw_df %>% rename_with(~gsub("Type","_type",.),contains("Type")) %>% glimpse()
## Rows: 21,168
## Columns: 17
## $ dt <dttm> 2020-05-31 23:00:00, 2020-05-31 22:00:00, 2020-05-31…
## $ wind_lic <dbl> 1433.8132, 1577.1419, 1857.5492, 1933.0142, 2031.7862…
## $ geothermal_lic <dbl> 912.7882, 907.9303, 900.5844, 888.4561, 864.5402, 847…
## $ biogas_lic <dbl> 75.8047, 75.6163, 75.3906, 76.7873, 76.9707, 77.5750,…
## $ canal_type_lic <dbl> 2584.930, 2630.602, 2585.038, 2542.381, 2594.459, 262…
## $ river_type_lic <dbl> 316.5538, 316.6800, 356.7637, 350.1544, 377.5312, 379…
## $ biomass_lic <dbl> 262.4994, 253.0814, 246.9268, 249.9152, 248.2336, 246…
## $ landfillGas_lic <dbl> 100.3971, 101.1378, 100.4442, 100.7307, 102.2474, 102…
## $ sun_lic <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 2.0594, 14.2800, 48.0…
## $ reservoir_lic <dbl> 2306.303, 2296.045, 2279.266, 2308.918, 2792.313, 318…
## $ others_lic <dbl> 48.3833, 48.4011, 48.4041, 48.4199, 48.4653, 48.5485,…
## $ wind_ul <dbl> 3.7751, 4.8375, 7.6659, 11.8121, 13.1070, 13.1830, 10…
## $ biogas_ul <dbl> 16.9293, 16.9227, 16.9052, 16.7517, 16.2928, 16.5989,…
## $ canal_type_ul <dbl> 4.1749, 4.4221, 4.4658, 4.6020, 4.6195, 4.5146, 4.661…
## $ biomass_ul <dbl> 15.4698, 15.3609, 16.0483, 15.2271, 15.5563, 15.5007,…
## $ sun_ul <dbl> 0.0582, 0.0320, 0.0335, 1.3121, 103.3267, 555.5787, 1…
## $ others_ul <dbl> 0.0610, 0.0395, 0.4136, 0.5508, 0.7106, 1.3775, 2.746…
Did you notice ~
and .
in the function? Dot (.
) is a representation of the entity. Depending on the situation it can be the latest version of the tibble in the pipe chain, a specific column or something else. ~
is a special character notifying that function evaluation will be done using the dot notation. We will see more examples of that.
Let’s introduce where
. If is a function from tidyselect
package to select variables with a function where it returns TRUE. It is quite handy.
raw_df %>% select(dt,starts_with("sun_"),starts_with("wind")) %>% relocate(where(is.numeric))
## # A tibble: 21,168 x 5
## sun_lic sun_ul wind_lic wind_ul dt
## <dbl> <dbl> <dbl> <dbl> <dttm>
## 1 0 0.0582 1434. 3.78 2020-05-31 23:00:00
## 2 0 0.032 1577. 4.84 2020-05-31 22:00:00
## 3 0 0.0335 1858. 7.67 2020-05-31 21:00:00
## # … with 21,165 more rows
We can also use any_of
or all_of
functions in select
. The main difference is while the former returns as much as it can, the latter will throw an error if any one of the criteria is not fulfilled. Let’s try to select “dt”, “others_lic” and “nuclear_lic”. Since this data does not include nuclear power production we should not see it.
raw_df %>% select(any_of(c("dt","others_lic","nuclear_lic")))
## # A tibble: 21,168 x 2
## dt others_lic
## <dttm> <dbl>
## 1 2020-05-31 23:00:00 48.4
## 2 2020-05-31 22:00:00 48.4
## 3 2020-05-31 21:00:00 48.4
## # … with 21,165 more rows
In order not to break our notebook, we wrap it around try
(error handling is another topic).
try(raw_df %>% select(all_of(c("dt","others_lic","nuclear_lic"))))
## Error : Can't subset columns that don't exist.
## x Column `nuclear_lic` doesn't exist.
filter/distinct
Filter (no more “as the name suggests”, as you already figured it out) helps filter rows according to given criteria. It is highly similar with Excel’s filter functionality (but much much more flexible and reliable).
Let’s see the production at 2020-05-08 16:00:00
.
raw_df %>% filter(dt == "2020-05-08 16:00:00")
## # A tibble: 1 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-08 16:00:00 2618. 856. 79.8 3896.
## # … with 12 more variables: riverType_lic <dbl>, biomass_lic <dbl>,
## # landfillGas_lic <dbl>, sun_lic <dbl>, reservoir_lic <dbl>,
## # others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>, canalType_ul <dbl>,
## # biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
By using ==
operator, we bring the values in dt
column which are equal to the hour we desired. There are other expressions such as not equal to (!=
), greater than (or equal to) (>
,>=
), smaller than (or equal to) (<
,<=
), in (%in%
) and some more.
At the same time we can make comparisons between columns and combine multiple criteria to create more complex filters. Here we use AND (&
) and OR (|
) operators to combine criteria.
Suppose we want to find our the times when licensed wind production is greater than all of hydro type licensed production.
raw_df %>% filter(wind_lic > canalType_lic & wind_lic > reservoir_lic & wind_lic > riverType_lic)
## # A tibble: 11,287 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-27 19:00:00 3303. 930. 74.7 2969.
## 2 2020-05-27 18:00:00 3596. 914. 75.0 2953.
## 3 2020-05-27 17:00:00 3551. 900. 76.3 2954.
## # … with 11,284 more rows, and 12 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## # canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
We can add numeric operations as well. Suppose we want to find the total solar production is greater than total wind production.
raw_df %>% filter(wind_lic + wind_ul < sun_lic + sun_ul)
## # A tibble: 4,949 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 16:00:00 2036. 843. 76.8 2616.
## 2 2020-05-31 15:00:00 1875. 845. 77.4 2685.
## 3 2020-05-31 14:00:00 1755. 853. 77.2 2715.
## # … with 4,946 more rows, and 12 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## # canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
Suppose we want to filter only the unique values. Then we simply use distinct
command. Let’s get unique rounded licensed wind production values.
raw_df %>% distinct(round(wind_lic))
## # A tibble: 4,893 x 1
## `round(wind_lic)`
## <dbl>
## 1 1434
## 2 1577
## 3 1858
## # … with 4,890 more rows
If we want to keep all columns we simply make the parameter .keep=TRUE
.
raw_df %>% distinct(round(wind_lic),.keep=TRUE)
## # A tibble: 4,893 x 2
## `round(wind_lic)` .keep
## <dbl> <lgl>
## 1 1434 TRUE
## 2 1577 TRUE
## 3 1858 TRUE
## # … with 4,890 more rows
filter/distinct
advanced usageLet’s introduce slice
. This function helps return rows by its row number. Suppose we want the top 5 rows.
raw_df %>% slice(1:5) %>% print(n=5)
## # A tibble: 5 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 1434. 913. 75.8 2585.
## 2 2020-05-31 22:00:00 1577. 908. 75.6 2631.
## 3 2020-05-31 21:00:00 1858. 901. 75.4 2585.
## 4 2020-05-31 20:00:00 1933. 888. 76.8 2542.
## 5 2020-05-31 19:00:00 2032. 865. 77.0 2594.
## # … with 12 more variables: riverType_lic <dbl>, biomass_lic <dbl>,
## # landfillGas_lic <dbl>, sun_lic <dbl>, reservoir_lic <dbl>,
## # others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>, canalType_ul <dbl>,
## # biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
If we want to return random rows we have slice_sample
. Let’s bring 5 random rows.
raw_df %>% slice_sample(n=5)
## # A tibble: 5 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-01-28 12:00:00 1363. 946. 78.0 1077.
## 2 2019-07-08 05:00:00 1229. 813. 77.8 1073.
## 3 2019-10-03 21:00:00 3178. 729. 79.1 679.
## # … with 2 more rows, and 12 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## # canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
If we want to do it proportionately, we have the prop
parameter. Let’s say we want 0.1% of the data frame.
raw_df %>% slice_sample(prop=0.001)
## # A tibble: 21 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2018-02-22 05:00:00 2059. 723. 68.8 1324.
## 2 2020-02-29 09:00:00 4043. 972. 78.5 2319.
## 3 2019-05-29 16:00:00 1660. 671. 77.3 3120.
## # … with 18 more rows, and 12 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## # canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
There are other slice_*
type functions. These are slice_head
/slice_tail
for first/last n or percentage of rows. slice_min
/slice_max
for the top/bottom n rows according to an ordering criteria.
arrange
Arrange sorts rows from A to Z or smallest to largest. It has great similarity with Excel’s Sort functionality.
Let’s sort licensed reservoir production from largest to smallest.
raw_df %>% select(dt,reservoir_lic) %>% arrange(desc(reservoir_lic))
## # A tibble: 21,168 x 2
## dt reservoir_lic
## <dttm> <dbl>
## 1 2019-05-10 23:00:00 5058.
## 2 2019-05-10 21:00:00 5035.
## 3 2019-05-15 02:00:00 5019.
## # … with 21,165 more rows
Do you see desc()
function inside arrange
? By default arrange
sorts a column by first to last or A-Z. desc
reverses this.
You can also use multiple sorting criteria and use operations inside arrange
. Let’s arrange by licensed wind production rounded down (floor
) in 100s range (e.g. 5634 = 5600 and 5693 = 5600 as well). Then we sort by date time to see the first time the production entered a 100-range in the data time period.
raw_df %>% arrange(desc(floor(wind_lic/100)*100),dt)
## # A tibble: 21,168 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2019-09-15 17:00:00 5767. 691. 83.0 922.
## 2 2018-09-26 19:00:00 5622. 672. 67.8 951.
## 3 2019-09-15 15:00:00 5628. 692. 81.4 901.
## # … with 21,165 more rows, and 12 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## # canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
mutate
/transmute
Mutate is the function when we do operations and calculations using other columns.
For instance let’s calculate wind power’s share in total renewables production at each hour.
raw_df %>% mutate(wind_lic_perc = wind_lic / (wind_lic + geothermal_lic + biogas_lic + canalType_lic + riverType_lic + biomass_lic + landfillGas_lic + sun_lic + reservoir_lic + others_lic + wind_ul + biogas_ul + canalType_ul + biomass_ul + sun_ul + others_ul)) %>% select(dt, wind_lic_perc)
## # A tibble: 21,168 x 2
## dt wind_lic_perc
## <dttm> <dbl>
## 1 2020-05-31 23:00:00 0.177
## 2 2020-05-31 22:00:00 0.191
## 3 2020-05-31 21:00:00 0.219
## # … with 21,165 more rows
You can use many R functions (from both base functions and other packages). For instance to calculate “competition” wind and solar we can use the following expression.
raw_df %>% mutate(wind_or_solar = ifelse(wind_lic + wind_ul > sun_lic + sun_ul, "wind", "solar")) %>% select(dt,wind_or_solar)
## # A tibble: 21,168 x 2
## dt wind_or_solar
## <dttm> <chr>
## 1 2020-05-31 23:00:00 wind
## 2 2020-05-31 22:00:00 wind
## 3 2020-05-31 21:00:00 wind
## # … with 21,165 more rows
Transmute has the same functionality as mutate with the additional property similar to select
. Transmute returns only the columns included in the function. Suppose we also want to calculate the difference between total wind and total solar.
raw_df %>% transmute(dt, wind_or_solar = ifelse(wind_lic + wind_ul > sun_lic + sun_ul, "wind", "solar"), absdiff = abs(wind_lic + wind_ul - sun_lic - sun_ul))
## # A tibble: 21,168 x 3
## dt wind_or_solar absdiff
## <dttm> <chr> <dbl>
## 1 2020-05-31 23:00:00 wind 1438.
## 2 2020-05-31 22:00:00 wind 1582.
## 3 2020-05-31 21:00:00 wind 1865.
## # … with 21,165 more rows
mutate/transmute
advanced usageSuppose we want to see the difference between the previous and next hour’s production. We offset rows using lead
and lag
functions. But remember lead and lag does not actually give you “next/previous hour’s” values, just the rows. You may need to arrange your data.
raw_df %>% transmute(dt, wind_lic, wind_lic_prev_h = lead(wind_lic,1), wind_lic_next_h = lag(wind_lic,1))
## # A tibble: 21,168 x 4
## dt wind_lic wind_lic_prev_h wind_lic_next_h
## <dttm> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 1434. 1577. NA
## 2 2020-05-31 22:00:00 1577. 1858. 1434.
## 3 2020-05-31 21:00:00 1858. 1933. 1577.
## # … with 21,165 more rows
If you want to use the same function over several columns, you can use the new across
function. Let’s round every numeric column to one significant digit.
raw_df %>% mutate(across(where(is.numeric),~round(.,1)))
## # A tibble: 21,168 x 17
## dt wind_lic geothermal_lic biogas_lic canalType_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 1434. 913. 75.8 2585.
## 2 2020-05-31 22:00:00 1577. 908. 75.6 2631.
## 3 2020-05-31 21:00:00 1858. 901. 75.4 2585
## # … with 21,165 more rows, and 12 more variables: riverType_lic <dbl>,
## # biomass_lic <dbl>, landfillGas_lic <dbl>, sun_lic <dbl>,
## # reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>, biogas_ul <dbl>,
## # canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>, others_ul <dbl>
We can also specify columns. Let’s see the comparative production of wind and reservoir hydro against the unlicensed solar production. We just increment solar production by 1 to prevent any infinity values.
raw_df %>% mutate(sun_ul = sun_ul + 1) %>% transmute(dt,across(c(wind_lic,reservoir_lic),~round(./sun_ul,2)))
## # A tibble: 21,168 x 3
## dt wind_lic reservoir_lic
## <dttm> <dbl> <dbl>
## 1 2020-05-31 23:00:00 1355. 2179.
## 2 2020-05-31 22:00:00 1528. 2225.
## 3 2020-05-31 21:00:00 1797. 2205.
## # … with 21,165 more rows
If there are multiple conditions ifelse
is not enough. It is possible to use case_when
to specify multiple conditions and outcomes.
raw_df %>% transmute(dt, solar_production_level = case_when(sun_ul > quantile(sun_ul,0.9) ~ "very high", sun_ul > quantile(sun_ul, 0.75) ~ "high", sun_ul > quantile(sun_ul, 0.5) ~ "above median", TRUE ~ "below median")) %>% slice(7:9)
## # A tibble: 3 x 2
## dt solar_production_level
## <dttm> <chr>
## 1 2020-05-31 17:00:00 above median
## 2 2020-05-31 16:00:00 high
## 3 2020-05-31 15:00:00 very high
rowwise
is actually a type of group_by/summarise
function but as the name suggests it allows us to do row-wise operations. Let’s calculate row sums by using c_across
function and rowwise
. Oh and also now, experimentally, you can use relocate functionality in mutate/transmute
. So, conveniently we can place it after date time.
raw_df %>% slice_head(n=5) %>% rowwise() %>% mutate(total_prod = sum(c_across(where(is.numeric))),.after=dt)
## # A tibble: 5 x 18
## # Rowwise:
## dt total_prod wind_lic geothermal_lic biogas_lic
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-31 23:00:00 8082. 1434. 913. 75.8
## 2 2020-05-31 22:00:00 8248. 1577. 908. 75.6
## 3 2020-05-31 21:00:00 8496. 1858. 901. 75.4
## # … with 2 more rows, and 13 more variables: canalType_lic <dbl>,
## # riverType_lic <dbl>, biomass_lic <dbl>, landfillGas_lic <dbl>,
## # sun_lic <dbl>, reservoir_lic <dbl>, others_lic <dbl>, wind_ul <dbl>,
## # biogas_ul <dbl>, canalType_ul <dbl>, biomass_ul <dbl>, sun_ul <dbl>,
## # others_ul <dbl>
group_by
/summarise
Finally we will learn how to calculate summary tables. It is similar to Pivot Tables in Excel. group_by
is the grouping function, summarise
is the summarising function.
For instance let’s calculate number of hours where wind production is above 3000 MWh. We will use a special function n()
to calculate number of rows. We can define groupings just like mutate
.
raw_df %>% group_by(production_group = cut(wind_lic + wind_ul,breaks=c(0,1000,2000,3000,4000,5000,6000),include.lowest = TRUE)) %>% summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
## production_group count
## <fct> <int>
## 1 [0,1e+03] 4294
## 2 (1e+03,2e+03] 5733
## 3 (2e+03,3e+03] 4725
## # … with 3 more rows
Normally, we get one result for each group and summary function. From dplyr 1.0.0
we can have multiple row summarise for each group. Let’s say we want to find the minimum and maximum licensed wind production ranges for each year. But, be warned, it can be a little confusing.
raw_df %>% group_by(year = lubridate::year(dt)) %>% summarise(wind_lic_range = range(wind_lic))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 6 x 2
## # Groups: year [3]
## year wind_lic_range
## <dbl> <dbl>
## 1 2018 46.2
## 2 2018 5622.
## 3 2019 32.8
## # … with 3 more rows
group_by
/summarise
advanced usageJust like mutate/transmute
you can use across
in summarise
as well. Let’s see median production of each year and each source. In this example we can also use the named list version of the functions and additional names structure.
raw_df %>% group_by(year = lubridate::year(dt)) %>% summarise(across(where(is.numeric),list(med=median),.names="{fn}_{col}"))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 17
## year med_wind_lic med_geothermal_… med_biogas_lic med_canalType_l…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2018 1989. 694. 66.4 1480.
## 2 2019 2136. 814. 80.9 1575.
## 3 2020 2297. 951. 78.5 2779.
## # … with 12 more variables: med_riverType_lic <dbl>, med_biomass_lic <dbl>,
## # med_landfillGas_lic <dbl>, med_sun_lic <dbl>, med_reservoir_lic <dbl>,
## # med_others_lic <dbl>, med_wind_ul <dbl>, med_biogas_ul <dbl>,
## # med_canalType_ul <dbl>, med_biomass_ul <dbl>, med_sun_ul <dbl>,
## # med_others_ul <dbl>
Solve the following exercises. Outputs are given below, you are expected write code to match the outputs.
lubridate::year
to get years from date data.)## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
## year mean_geo sd_geo
## <dbl> <dbl> <dbl>
## 1 2018 681. 65.2
## 2 2019 799. 74.2
## 3 2020 935. 59.0
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 24 x 2
## hour avg_prod
## <int> <dbl>
## 1 0 0.17
## 2 1 0.37
## 3 2 0.7
## # … with 21 more rows
lubridate::as_date
to convert date time to date. Use lag
and lead
functions to offset values.)## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 1
## average_change
## <dbl>
## 1 0.00282
tidyr::pivot_longer
to get a long format.)## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## year total_production
## <dbl> <dbl>
## 1 2018 62.6
## 2 2019 76.7
## 3 2020 37.3
Fundamental dplyr
functions provide very convenient tools for data analysis. It can also be used to generate the features required for modelling. You can process few million rows of data without breaking a sweat (for larger data sets you can use data.table
), you can prepare functions instead of manual Excel operations. With R Markdown system, which this tutorial is prepared in, you can create reproducible documents and automatize the reports. You can use ggplot2
for visualizations, which is also part of the tidyverse package ecosystem.
In this tutorial we use a small portion of a very powerful package, lubridate
. You can see the official website here.
Let’s take just 3 dates at random from our data set.
set.seed(5)
lub_df <-
raw_df %>%
select(dt) %>%
sample_n(3)
print(lub_df)
## # A tibble: 3 x 1
## dt
## <dttm>
## 1 2018-12-02 06:00:00
## 2 2019-01-12 05:00:00
## 3 2018-04-16 09:00:00
Since we called lubridate at the beginning of this tutorial we do not need to call by package reference (lubridate::
) but it is generally good practice.
lub_df %>%
mutate(
year = lubridate::year(dt),
month = lubridate::month(dt),
day = lubridate::day(dt),
week_day = lubridate::wday(dt),
wday_label = lubridate::wday(dt,label=TRUE),
hour = lubridate::hour(dt),
minute = lubridate::minute(dt),
second = lubridate::second(dt)
)
## # A tibble: 3 x 9
## dt year month day week_day wday_label hour minute second
## <dttm> <dbl> <dbl> <int> <dbl> <ord> <int> <int> <dbl>
## 1 2018-12-02 06:00:00 2018 12 2 1 Sun 6 0 0
## 2 2019-01-12 05:00:00 2019 1 12 7 Sat 5 0 0
## 3 2018-04-16 09:00:00 2018 4 16 2 Mon 9 0 0