In this part we will do some exploratory data analysis using the data that we previously gathered, cleaned and processed. That procedure can be seen in this page.
As usual we start by reading the rds file that was created before.
df <- readRDS("natural_gas_data.rds")
This RDS data file contains all the data we need and previously gathered, cleaned and processed. The RDS data can be downloaded here.
Let’s see the structure of the dataframe natural_gas_data.rds
.
str(df)
'data.frame': 1188 obs. of 8 variables:
$ Date : Date, format: "2018-09-01" "2018-09-02" ...
$ Total_Trade_Volume : num 378501 680025 1371846 1018608 1317451 ...
$ week_num : num 35 35 36 36 36 36 36 36 36 37 ...
$ month_num : num 9 9 9 9 9 9 9 9 9 9 ...
$ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 7 1 2 3 4 5 6 7 1 2 ...
$ year_num : num 2018 2018 2018 2018 2018 ...
$ season : chr "Fall" "Fall" "Fall" "Fall" ...
$ Gas_Reference_Price: num 1650 1650 1656 1644 1656 ...
As seen above there are 3 columns with 1188 row and all the values in appropriate format and ready to analysis.
This summary
function give us five-number summary of our whole dataset. The median identifies the centre of a data set; the upper and lower quartiles span the middle half of a data set; and the highest and lowest observations provide additional information about the actual dispersion of the data. That is why we use the five-number summary a lot to overview measure of the spread.
summary(df)
Date Total_Trade_Volume week_num month_num
Min. :2018-09-01 Min. : 176662 Min. : 1.00 Min. : 1.000
1st Qu.:2019-06-24 1st Qu.: 3641882 1st Qu.:15.00 1st Qu.: 4.000
Median :2020-04-16 Median : 5310350 Median :29.00 Median : 7.000
Mean :2020-04-16 Mean : 6653710 Mean :27.79 Mean : 6.793
3rd Qu.:2021-02-07 3rd Qu.: 8139018 3rd Qu.:41.00 3rd Qu.:10.000
Max. :2021-12-01 Max. :40030842 Max. :53.00 Max. :12.000
day_of_week year_num season Gas_Reference_Price
Sun:170 Min. :2018 Length:1188 Min. :1205
Mon:170 1st Qu.:2019 Class :character 1st Qu.:1409
Tue:170 Median :2020 Mode :character Median :1471
Wed:170 Mean :2020 Mean :1645
Thu:169 3rd Qu.:2021 3rd Qu.:1571
Fri:169 Max. :2021 Max. :6903
Sat:170
The code chunk below includes the code that gives us General overview of Gas Reference Prices
using ggplot2 package.
ggplot(df,
aes(x=Date,
y=Gas_Reference_Price)) +
geom_bar(stat = "identity",
aes(fill=Gas_Reference_Price)) +
theme_light() +
geom_hline(yintercept = mean(df$Gas_Reference_Price),
size=1,
color="red") +
scale_fill_gradient(name="Gas Reference Price") +
labs(title="Daily Gas Reference Prices",
x="Date",
y="Gas Reference Price")
You can see the red line and wonder what that indicates. That line is the mean of our y-axis.
With the same approach the code chunk below includes the code that gives us General overview of Total Trade Volume
using ggplot2 package.
ggplot(df,
aes(x=Date,
y=Total_Trade_Volume)) +
geom_bar(stat = "identity",
aes(fill=Total_Trade_Volume)) +
theme_light() +
geom_hline(yintercept = mean(df$Total_Trade_Volume),
size=1,
color="red") +
scale_fill_gradient(name="Total Trade Volume") +
labs(title="Daily Total Trade Volume",
x="Date",
y="Total Trade Volume")
As seen above, compared to reference prices, total trade volume is much more volatile and tend to vary abundantly.
This surely can be observed when we take a look at the standard deviation values of both columns.
sd(df$Total_Trade_Volume)
[1] 4872730
sd(df$Gas_Reference_Price)
[1] 616.1736
In this part, we will use the columns that we have created back in preprocessing procedure. These columns will allow us to overview our data in terms of different time periods.
We start by visualizing our data frame in yearly basis.
tv_years_average <- df %>%
group_by(year_num) %>%
summarise(mean_yearly_tv = mean(Total_Trade_Volume)) %>%
ggplot(aes(x=year_num, y=mean_yearly_tv, fill=mean_yearly_tv)) +
geom_bar(stat="identity") +
geom_hline(yintercept = mean(df$Total_Trade_Volume),
size=1,
color="red") +
labs(
x="Years",
y="Average Yearly Total TV",
title="Average TV over the Years",
fill="Average Yearly Total TV"
)
tv_years_total <- df %>%
group_by(year_num) %>%
summarise(total_yearly_tv = sum(Total_Trade_Volume)) %>%
ggplot(aes(x=year_num, y=total_yearly_tv, fill=total_yearly_tv)) +
geom_bar(stat="identity") +
labs(
x="Years",
y="Total Yearly TV",
title="Total TV over the Years",
fill="Total Yearly TV"
)
gp_years_average <- df %>%
group_by(year_num) %>%
summarise(mean_yearly_gp = mean(Gas_Reference_Price)) %>%
ggplot(aes(x=year_num, y=mean_yearly_gp, fill=mean_yearly_gp)) +
geom_bar(stat="identity") +
geom_hline(yintercept = mean(df$Gas_Reference_Price),
size=1,
color="red") +
labs(
x="Years",
y="Average Yearly GRP",
title="Average GRP over the Years",
fill="Average Yearly GRP"
)
gp_years_total <- df %>%
group_by(year_num) %>%
summarise(total_yearly_gp = sum(Gas_Reference_Price)) %>%
ggplot(aes(x=year_num, y=total_yearly_gp, fill=total_yearly_gp)) +
geom_bar(stat="identity") +
labs(
x="Years",
y="Total Yearly GRP",
title="Total GRP over the Years",
fill="Total Yearly GRP"
)
grid.arrange(gp_years_average, gp_years_total, tv_years_average, tv_years_total, ncol=2)
Then we continue with visualizing in monthly basis.
tv_months_average <- df %>%
group_by(month_num) %>%
summarise(mean_monthly_tv = mean(Total_Trade_Volume)) %>%
ggplot(aes(x=month_num, y=mean_monthly_tv, fill=mean_monthly_tv)) +
geom_bar(stat="identity") +
geom_hline(yintercept = mean(df$Total_Trade_Volume),
size=1,
color="red") +
labs(
x="Months",
y="Average Monthly Total TV",
title="Average TV over the Months",
fill="Average Monthly Total TV"
)
tv_months_total <- df %>%
group_by(month_num) %>%
summarise(total_monthly_tv = sum(Total_Trade_Volume)) %>%
ggplot(aes(x=month_num, y=total_monthly_tv, fill=total_monthly_tv)) +
geom_bar(stat="identity") +
labs(
x="Months",
y="Total Monthly TV",
title="Total TV over the Months",
fill="Total Monthly TV"
)
gp_months_average <- df %>%
group_by(month_num) %>%
summarise(mean_monthly_gp = mean(Gas_Reference_Price)) %>%
ggplot(aes(x=month_num, y=mean_monthly_gp, fill=mean_monthly_gp)) +
geom_bar(stat="identity") +
geom_hline(yintercept = mean(df$Gas_Reference_Price),
size=1,
color="red") +
labs(
x="Months",
y="Average Monthly GRP",
title="Average GRP over the Months",
fill="Average Monthly GRP"
)
gp_months_total <- df %>%
group_by(month_num) %>%
summarise(total_monthly_gp = sum(Gas_Reference_Price)) %>%
ggplot(aes(x=month_num, y=total_monthly_gp, fill=total_monthly_gp)) +
geom_bar(stat="identity") +
labs(
x="Months",
y="Total Monthly GRP",
title="Total GRP over the Months",
fill="Total Monthly GRP"
)
grid.arrange(gp_months_average, gp_months_total, tv_months_average, tv_months_total, ncol=2)
We visualize the data in weekly basis as well.
tv_weeks_average <- df %>%
group_by(week_num) %>%
summarise(mean_weekly_tv = mean(Total_Trade_Volume)) %>%
ggplot(aes(x=week_num, y=mean_weekly_tv, fill=mean_weekly_tv)) +
geom_bar(stat="identity") +
geom_hline(yintercept = mean(df$Total_Trade_Volume),
size=1,
color="red") +
labs(
x="Weeks",
y="Average Weekly Total TV",
title="Average TV over the Weeks",
fill="Average Weekly Total TV"
)
tv_weeks_total <- df %>%
group_by(week_num) %>%
summarise(total_weekly_tv = sum(Total_Trade_Volume)) %>%
ggplot(aes(x=week_num, y=total_weekly_tv, fill=total_weekly_tv)) +
geom_bar(stat="identity") +
labs(
x="Weeks",
y="Total Weekly TV",
title="Total TV over the Weeks",
fill="Total Weekly TV"
)
gp_weeks_average <- df %>%
group_by(week_num) %>%
summarise(mean_weekly_gp = mean(Gas_Reference_Price)) %>%
ggplot(aes(x=week_num, y=mean_weekly_gp, fill=mean_weekly_gp)) +
geom_bar(stat="identity") +
geom_hline(yintercept = mean(df$Gas_Reference_Price),
size=1,
color="red") +
labs(
x="Weeks",
y="Average Weekly GRP",
title="Average GRP over the Weeks",
fill="Average Weekly GRP"
)
gp_weeks_total <- df %>%
group_by(week_num) %>%
summarise(total_weekly_gp = sum(Gas_Reference_Price)) %>%
ggplot(aes(x=week_num, y=total_weekly_gp, fill=total_weekly_gp)) +
geom_bar(stat="identity") +
labs(
x="Weeks",
y="Total Weekly GRP",
title="Total GRP over the Weeks",
fill="Total Weekly GRP"
)
grid.arrange(gp_weeks_average, gp_weeks_total, tv_weeks_average, tv_weeks_total, ncol=2)
And then in days of the week basis. But since prices didn’t vary considering the day of the week we excluded that part in this visualization.
tv_days_average <- df %>%
group_by(day_of_week) %>%
summarise(mean_daily_tv = mean(Total_Trade_Volume)) %>%
ggplot(aes(x=day_of_week, y=mean_daily_tv, fill=mean_daily_tv)) +
geom_bar(stat="identity") +
labs(
x="Days of the Week",
y="Average Daily Total TV",
title="Average TV over the Weekdays",
fill="Average Daily Total TV"
)
tv_days_total <- df %>%
group_by(day_of_week) %>%
summarise(total_daily_tv = sum(Total_Trade_Volume)) %>%
ggplot(aes(x=day_of_week, y=total_daily_tv, fill=total_daily_tv)) +
geom_bar(stat="identity") +
labs(
x="Days of the Week",
y="Daily Total TV",
title="Total TV over the Weekdays",
fill="Daily Total TV"
)
grid.arrange(tv_days_average, tv_days_total)
Lastly, we visualize our data based on the season of the year.
tv_seasons_average <- df %>%
group_by(season) %>%
summarise(mean_seasonal_tv = mean(Total_Trade_Volume)) %>%
ggplot(aes(x=season, y=mean_seasonal_tv, fill=mean_seasonal_tv)) +
geom_bar(stat="identity") +
geom_hline(yintercept = mean(df$Total_Trade_Volume),
size=1,
color="red") +
labs(
x="Seasons",
y="Average Seasonal TV",
title="Average TV over the Seasons",
fill="Average Seasonal TV"
)
tv_seasons_total <- df %>%
group_by(season) %>%
summarise(total_seasonal_tv = sum(Total_Trade_Volume)) %>%
ggplot(aes(x=season, y=total_seasonal_tv, fill=total_seasonal_tv)) +
geom_bar(stat="identity") +
labs(
x="Seasons",
y="Total Seasonal TV",
title="Total TV over the Seasons",
fill="Total Seasonal TV"
)
gp_seasons_average <- df %>%
group_by(season) %>%
summarise(mean_seasonal_gp = mean(Gas_Reference_Price)) %>%
ggplot(aes(x=season, y=mean_seasonal_gp, fill=mean_seasonal_gp)) +
geom_bar(stat="identity") +
geom_hline(yintercept = mean(df$Gas_Reference_Price),
size=1,
color="red") +
labs(
x="Seasons",
y="Average Seasonal GRP",
title="Average GRP over the Seasons",
fill="Average Seasonal GRP"
)
gp_seasons_total <- df %>%
group_by(season) %>%
summarise(total_seasonal_gp = sum(Gas_Reference_Price)) %>%
ggplot(aes(x=season, y=total_seasonal_gp, fill=total_seasonal_gp)) +
geom_bar(stat="identity") +
labs(
x="Seasons",
y="Total Seasonal GRP",
title="Total GRP over the Seasons",
fill="Total Seasonal GRP"
)
grid.arrange(gp_seasons_average, gp_seasons_total, tv_seasons_average, tv_seasons_total, ncol=2)
Thanks for reading…