Main purpose of this document is to introduce a major data visualization package, ggplot2 with a contemporary subject. Grammar of graphics (hence the name, gg), is book written by Leland Wilkinson and it heavily influences ggplot2.
There are several basic concepts to create fundamental graphs. It is possible to do much more with additional functions and parameters but let’s first focus on the fundamentals: ggplot object, + operator to connect layers, aes function to set aesthetics (e.g. x and y), geom_* functions to define the plot.
There are two prerequisites to start: Install tidyverse package and putting the relevant data set into the working directory (write getwd() in the console to locate your working directory). In this document, topic of the data set is the hourly licensed and unlicensed renewable energy production data between January 1, 2018 and May 31, 2020.
To install the package run install.packages("tidyverse") in the console and select a mirror (first one is quite ok). Once you install the library you can always call it with library(tidyverse) command (no need to reinstall). You can download the data set from its GitHub Repository.
library(tidyverse) #tidyverse is a package group which includes dplyr as well
library(lubridate)
raw_df <- readRDS("rp_201801_202005_df.rds")Note: We will use the same data and preparation process as the dplyr tutorial. We will also use some dplyr functionality as well. So, it is recommended to have an understanding of dplyr first.
We are going to use all these elements in plot examples.
We use geom_point for scatter plot. Let’s create a scatterplot of licensed wind versus unlicensed solar production in May, 2020 and between hours 10 and 17. Let’s first create our data set plot_df1.
plot_df1 <- raw_df %>% filter(dt >= "2020-05-01" & dt < "2020-06-01" & lubridate::hour(dt) >= 10 & lubridate::hour(dt) <= 17) %>% transmute(hour_of_day = lubridate::hour(dt),wind_lic,sun_ul)
print(plot_df1)## # A tibble: 248 x 3
##   hour_of_day wind_lic sun_ul
##         <int>    <dbl>  <dbl>
## 1          17    2055.  1442.
## 2          16    2036.  2523.
## 3          15    1875.  3402.
## # … with 245 more rowsWe will start with the canvas (ggplot object) then add the geom_point layer with +. The rest will be handled by ggplot2.
ggplot(plot_df1) + geom_point(aes(x = wind_lic, y = sun_ul, color=as.character(hour_of_day)))We could have defined the aesthetics in the first object. ggplot2 is quite flexible about it.
ggplot(plot_df1,aes(x = wind_lic, y = sun_ul, color=as.character(hour_of_day))) + geom_point()Time series can be beautifully represented with a line plot. ggplot2 has special properties and aggregations for date-time columns.
plot_df2 <- raw_df %>% filter(dt >= "2020-05-01" & dt < "2020-06-01") %>% select(dt,wind_lic,sun_ul) 
print(plot_df2)## # A tibble: 744 x 3
##   dt                  wind_lic sun_ul
##   <dttm>                 <dbl>  <dbl>
## 1 2020-05-31 23:00:00    1434. 0.0582
## 2 2020-05-31 22:00:00    1577. 0.032 
## 3 2020-05-31 21:00:00    1858. 0.0335
## # … with 741 more rowsIt is also possible to combine dplyr with ggplot2, although excessive use is not recommended. We are going to use pivot_longer function to convert the data from wide to long format.
plot_df2 %>% pivot_longer(.,-dt) %>% ggplot(.,aes(x=dt,y=value,color=name)) + geom_line()You can create bar charts in a very similar manner. First, let’s create a data set where May 2020’s production is used and production type is differentiated as Licensed and Unlicensed in a different column.
plot_df3 <- raw_df %>% filter(dt >= "2020-05-01" & dt < "2020-06-01") %>% summarise(across(-dt,sum)) %>% pivot_longer(.,everything()) %>% mutate(type = ifelse( grepl("_lic+$",name),"Licensed","Unlicensed"))
print(plot_df3)## # A tibble: 16 x 3
##   name              value type    
##   <chr>             <dbl> <chr>   
## 1 wind_lic       1346260. Licensed
## 2 geothermal_lic  654089. Licensed
## 3 biogas_lic       56839. Licensed
## # … with 13 more rowsWe use geom_bar for bar chart and use fill instead of color for understandable reasons (use fill to see the change). We also ordered the columns according to their value with reorder.
ggplot(plot_df3,aes(x=reorder(name,-value),y=value)) + geom_bar(stat="identity",aes(fill=type))Pie chart is actually a bar chart with an extra (coord_polar).
ggplot(plot_df3 %>% filter(type=="Licensed"),aes(x="",y=value,fill=name)) + geom_bar(stat="identity",width=1) + coord_polar("y")Creating plots with some beautiful colors is easy. Publication grade plots need several more steps. Consider the scatter plot. Let’s assign it to a variable.
sc_plot <- ggplot(plot_df1) + geom_point(aes(x = wind_lic, y = sun_ul, color=as.character(hour_of_day)))
sc_plotIn ggplot2 there are some ready to use themes. One of them is theme_minimal.
sc_plot2 <- sc_plot + theme_minimal()
sc_plot2 It already looks better. Let’s improve the plot with some label changes using 
labs.
sc_plot3 <- sc_plot2 + labs(x="Licensed Wind Production (MWh)", y="Unlicensed Solar Production (MWh)", color="Hour of Day", title = "Licensed Wind vs Unlicensed Solar", subtitle = "Renewable production in May 2020, between 10:00-17:00 each day")
  
sc_plot3Finally let’s touch the axes a little bit (thousands separator and angle and position adjustment of x-axis) and move the legend to top. See this Stackoverflow post for the inspiration.
sc_plot3 + theme(legend.position="top",axis.text.x = element_text(angle=45,hjust=1,vjust=1)) + scale_y_continuous(labels=function(x) format(x, big.mark = ".", decimal.mark = ",", scientific = FALSE)) + scale_x_continuous(labels=function(x) format(x, big.mark = ".", decimal.mark = ",", scientific = FALSE))After learning how to masterfully and easily manipulate the data with dplyr, ggplot2 gives you the opportunity to tell great stories with impeccable visualization quite easily. There are also packages to extend ggplot2, such as ggnetwork, ggiraph, tvthemes, or gghighlight. You can find different and amazing packages on CRAN, GitHub and other places.