Main purpose of this document is to introduce a major data visualization package, ggplot2
with a contemporary subject. Grammar of graphics (hence the name, gg), is book written by Leland Wilkinson and it heavily influences ggplot2.
There are several basic concepts to create fundamental graphs. It is possible to do much more with additional functions and parameters but let’s first focus on the fundamentals: ggplot
object, +
operator to connect layers, aes
function to set aesthetics (e.g. x and y), geom_*
functions to define the plot.
There are two prerequisites to start: Install tidyverse
package and putting the relevant data set into the working directory (write getwd()
in the console to locate your working directory). In this document, topic of the data set is the hourly licensed and unlicensed renewable energy production data between January 1, 2018 and May 31, 2020.
To install the package run install.packages("tidyverse")
in the console and select a mirror (first one is quite ok). Once you install the library you can always call it with library(tidyverse)
command (no need to reinstall). You can download the data set from its GitHub Repository.
library(tidyverse) #tidyverse is a package group which includes dplyr as well
library(lubridate)
raw_df <- readRDS("rp_201801_202005_df.rds")
Note: We will use the same data and preparation process as the dplyr
tutorial. We will also use some dplyr
functionality as well. So, it is recommended to have an understanding of dplyr
first.
We are going to use all these elements in plot examples.
We use geom_point
for scatter plot. Let’s create a scatterplot of licensed wind versus unlicensed solar production in May, 2020 and between hours 10 and 17. Let’s first create our data set plot_df1
.
plot_df1 <- raw_df %>% filter(dt >= "2020-05-01" & dt < "2020-06-01" & lubridate::hour(dt) >= 10 & lubridate::hour(dt) <= 17) %>% transmute(hour_of_day = lubridate::hour(dt),wind_lic,sun_ul)
print(plot_df1)
## # A tibble: 248 x 3
## hour_of_day wind_lic sun_ul
## <int> <dbl> <dbl>
## 1 17 2055. 1442.
## 2 16 2036. 2523.
## 3 15 1875. 3402.
## # … with 245 more rows
We will start with the canvas (ggplot
object) then add the geom_point
layer with +
. The rest will be handled by ggplot2.
ggplot(plot_df1) + geom_point(aes(x = wind_lic, y = sun_ul, color=as.character(hour_of_day)))
We could have defined the aesthetics in the first object. ggplot2 is quite flexible about it.
ggplot(plot_df1,aes(x = wind_lic, y = sun_ul, color=as.character(hour_of_day))) + geom_point()
Time series can be beautifully represented with a line plot. ggplot2
has special properties and aggregations for date-time columns.
plot_df2 <- raw_df %>% filter(dt >= "2020-05-01" & dt < "2020-06-01") %>% select(dt,wind_lic,sun_ul)
print(plot_df2)
## # A tibble: 744 x 3
## dt wind_lic sun_ul
## <dttm> <dbl> <dbl>
## 1 2020-05-31 23:00:00 1434. 0.0582
## 2 2020-05-31 22:00:00 1577. 0.032
## 3 2020-05-31 21:00:00 1858. 0.0335
## # … with 741 more rows
It is also possible to combine dplyr
with ggplot2
, although excessive use is not recommended. We are going to use pivot_longer
function to convert the data from wide to long format.
plot_df2 %>% pivot_longer(.,-dt) %>% ggplot(.,aes(x=dt,y=value,color=name)) + geom_line()
You can create bar charts in a very similar manner. First, let’s create a data set where May 2020’s production is used and production type is differentiated as Licensed and Unlicensed in a different column.
plot_df3 <- raw_df %>% filter(dt >= "2020-05-01" & dt < "2020-06-01") %>% summarise(across(-dt,sum)) %>% pivot_longer(.,everything()) %>% mutate(type = ifelse( grepl("_lic+$",name),"Licensed","Unlicensed"))
print(plot_df3)
## # A tibble: 16 x 3
## name value type
## <chr> <dbl> <chr>
## 1 wind_lic 1346260. Licensed
## 2 geothermal_lic 654089. Licensed
## 3 biogas_lic 56839. Licensed
## # … with 13 more rows
We use geom_bar
for bar chart and use fill
instead of color
for understandable reasons (use fill
to see the change). We also ordered the columns according to their value with reorder
.
ggplot(plot_df3,aes(x=reorder(name,-value),y=value)) + geom_bar(stat="identity",aes(fill=type))
Pie chart is actually a bar chart with an extra (coord_polar
).
ggplot(plot_df3 %>% filter(type=="Licensed"),aes(x="",y=value,fill=name)) + geom_bar(stat="identity",width=1) + coord_polar("y")
Creating plots with some beautiful colors is easy. Publication grade plots need several more steps. Consider the scatter plot. Let’s assign it to a variable.
sc_plot <- ggplot(plot_df1) + geom_point(aes(x = wind_lic, y = sun_ul, color=as.character(hour_of_day)))
sc_plot
In ggplot2
there are some ready to use themes. One of them is theme_minimal
.
sc_plot2 <- sc_plot + theme_minimal()
sc_plot2
It already looks better. Let’s improve the plot with some label changes using labs
.
sc_plot3 <- sc_plot2 + labs(x="Licensed Wind Production (MWh)", y="Unlicensed Solar Production (MWh)", color="Hour of Day", title = "Licensed Wind vs Unlicensed Solar", subtitle = "Renewable production in May 2020, between 10:00-17:00 each day")
sc_plot3
Finally let’s touch the axes a little bit (thousands separator and angle and position adjustment of x-axis) and move the legend to top. See this Stackoverflow post for the inspiration.
sc_plot3 + theme(legend.position="top",axis.text.x = element_text(angle=45,hjust=1,vjust=1)) + scale_y_continuous(labels=function(x) format(x, big.mark = ".", decimal.mark = ",", scientific = FALSE)) + scale_x_continuous(labels=function(x) format(x, big.mark = ".", decimal.mark = ",", scientific = FALSE))
After learning how to masterfully and easily manipulate the data with dplyr
, ggplot2
gives you the opportunity to tell great stories with impeccable visualization quite easily. There are also packages to extend ggplot2, such as ggnetwork, ggiraph, tvthemes, or gghighlight. You can find different and amazing packages on CRAN, GitHub and other places.