Week2: Dplyr Exercises

Exercises

Solve the following exercises. Outputs are given below, you are expected write code to match the outputs.

Q1. Find the mean and standard deviation of licensed geothermal productions in all years. (Tip: Use lubridate::year to get years from date data.)

## # A tibble: 3 x 3
##    year mean_geo sd_geo
##   <dbl>    <dbl>  <dbl>
## 1  2018     681.   65.2
## 2  2019     799.   74.2
## 3  2020     935.   59.0

Solution 1

Let’s take the time and licensed geothermal production column.

raw_df %>% select(date_time=dt,geothermal_lic)

## # A tibble: 21,168 x 2
##    date_time           geothermal_lic
##    <dttm>                       <dbl>
##  1 2020-05-31 23:00:00           913.
##  2 2020-05-31 22:00:00           908.
##  3 2020-05-31 21:00:00           901.
##  4 2020-05-31 20:00:00           888.
##  5 2020-05-31 19:00:00           865.
##  6 2020-05-31 18:00:00           848.
##  7 2020-05-31 17:00:00           848.
##  8 2020-05-31 16:00:00           843.
##  9 2020-05-31 15:00:00           845.
## 10 2020-05-31 14:00:00           853.
## # … with 21,158 more rows

Then, add a year column with lubridate function and pivot the data on year. Calculate the mean and sd with the summarise function.

raw_df %>% 
  group_by(year = lubridate::year(dt))  %>% 
  summarise(mean_geo = round(mean(geothermal_lic)), sd_geo = round(sd(geothermal_lic),1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 3
##    year mean_geo sd_geo
##   <dbl>    <dbl>  <dbl>
## 1  2018      681   65.2
## 2  2019      799   74.2
## 3  2020      935   59

Q2. Find the hourly average unlicensed solar (sun_ul) production levels for May 2020.

## # A tibble: 24 x 2
##    hour avg_prod
##   <int>    <dbl>
## 1     0     0.17
## 2     1     0.37
## 3     2     0.7 
## # … with 21 more rows

Solution 2

raw_df %>% 
  mutate(hour = lubridate::hour(dt), year=lubridate::year(dt), month=lubridate::month(dt)) %>%
  filter(year== 2020, month== 05) %>%
  group_by(hour) %>%
  summarise(avg_prod = round(mean(sun_ul),2)) %>%
  select(hour, avg_prod)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 24 x 2
##     hour avg_prod
##    <int>    <dbl>
##  1     0     0.17
##  2     1     0.37
##  3     2     0.7 
##  4     3     0.91
##  5     4     1.26
##  6     5    22.7 
##  7     6   305.  
##  8     7  1156.  
##  9     8  2316.  
## 10     9  3330.  
## # … with 14 more rows

Q3. Find the average daily percentage change of licensed biomass (biomass_lic) in 2019. (e.g. Suppose daily production is 50 in day 1 and 53 in day 2, then the change should be (53-50)/50 -1 = 0.06) (Tip: Use lubridate::as_date to convert date time to date. Use lag and lead functions to offset values.)

## # A tibble: 1 x 1
##   average_change
##            <dbl>
## 1        0.00282

Solution 3

raw_df %>% 
  mutate(date = lubridate::date(dt), year=lubridate::year(dt)) %>%
  filter(year == 2019) %>%
  select(date,year,biomass_lic) %>%
  arrange(date) %>% 
  group_by(date) %>% 
  summarise(sum_bl=sum(biomass_lic)) %>%
  transmute(date, sum_bl, biomass_lic_next = lag(sum_bl,1)) %>%
  summarise(sum_bl, biomass_lic_next, p = ((sum_bl-biomass_lic_next)/biomass_lic_next) -1) %>%
  summarise(average_change = mean(c(NA, diff(p)),na.rm=TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 1 x 1
##   average_change
##            <dbl>
## 1       0.000176

Q4. Find the yearly total production levels in TWh (Current values are in MWh. 1 GWh is 1000 MWh and 1 TWh is 1000 GWh). (Tip: In order to avoid a lengthy summation you can use tidyr::pivot_longer to get a long format.)

## # A tibble: 3 x 2
##    year total_production
##   <dbl>            <dbl>
## 1  2018             62.6
## 2  2019             76.7
## 3  2020             37.3

Solution 4

raw_df %>%
  group_by(year=lubridate::year(dt)) %>%
  rowwise() %>%
  summarise(total_prod = sum(c_across(where(is.numeric)))) %>%
  summarise(tp = sum(total_prod)/1000000) %>%
  transmute(year, total_production = round(tp,digits=1))

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##    year total_production
##   <dbl>            <dbl>
## 1  2018             62.6
## 2  2019             76.7
## 3  2020             37.3