1  Sun Forest: Assignment 1

Published

November 21, 2022

In this assignment, we analyse the dataset selected from Global Dietary Database website, which includes B12 intake of former Soviet union countries, each row representing the B12 intake estimation for each 5 year between 1990 and 2020 and for each country, based on the variables of gender, residence (urban or rural), age group and education level.

1.1 | IMPORTING THE DATASET AND THE PACKAGES

Code
gdd <- read.csv("https://raw.githubusercontent.com/berkorbay/datasets/master/gdd/gdd_b12_levels.csv") 
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(ggplot2)

1.2 | PREPROCESSING

Code
#1.1) Gender
gdd_g <- gdd %>%
mutate(gender = case_when(female == 1 ~ 'female', female == 0 ~ 'male', female == 999 ~ 'all genders'))
  gdd_g %>%
    head(10)
   iso3 age female urban edu year   median lowerci_95 upperci_95 gender
1   ALB 999      0   999 999 1990 3.634332   2.303403   6.054550   male
2   ALB 999      0   999 999 1995 3.737716   2.433315   5.981505   male
3   ALB 999      0   999 999 2000 3.824925   2.513833   6.216503   male
4   ALB 999      0   999 999 2005 3.938703   2.665851   6.267153   male
5   ALB 999      0   999 999 2010 4.067680   2.738054   6.253339   male
6   ALB 999      0   999 999 2015 4.092445   2.777237   6.344306   male
7   ALB 999      0   999 999 2018 4.070591   2.761252   6.266195   male
8   ALB 999      0   999 999 2020 4.087936   2.801015   6.307502   male
9   ALB 999      1   999 999 1990 3.292622   2.004336   5.514846 female
10  ALB 999      1   999 999 1995 3.399279   2.134920   5.449213 female
Code
#1.2) Urban / Rural
gdd_r <- gdd_g %>%
mutate(residence = case_when(urban == 1 ~ 'urban', urban == 0 ~ 'rural', urban == 999 ~ 'all residences'))
gdd_r %>%
  head(10)
   iso3 age female urban edu year   median lowerci_95 upperci_95 gender
1   ALB 999      0   999 999 1990 3.634332   2.303403   6.054550   male
2   ALB 999      0   999 999 1995 3.737716   2.433315   5.981505   male
3   ALB 999      0   999 999 2000 3.824925   2.513833   6.216503   male
4   ALB 999      0   999 999 2005 3.938703   2.665851   6.267153   male
5   ALB 999      0   999 999 2010 4.067680   2.738054   6.253339   male
6   ALB 999      0   999 999 2015 4.092445   2.777237   6.344306   male
7   ALB 999      0   999 999 2018 4.070591   2.761252   6.266195   male
8   ALB 999      0   999 999 2020 4.087936   2.801015   6.307502   male
9   ALB 999      1   999 999 1990 3.292622   2.004336   5.514846 female
10  ALB 999      1   999 999 1995 3.399279   2.134920   5.449213 female
        residence
1  all residences
2  all residences
3  all residences
4  all residences
5  all residences
6  all residences
7  all residences
8  all residences
9  all residences
10 all residences
Code
#1.3) Education Level
gdd_f <- gdd_r %>%
  mutate(edu_level = case_when(edu == 1 ~ 'low', edu == 2 ~ 'medium', edu == 3 ~ 'high',edu == 999 ~ 'all levels'))
gdd_f %>%
  head(10)
   iso3 age female urban edu year   median lowerci_95 upperci_95 gender
1   ALB 999      0   999 999 1990 3.634332   2.303403   6.054550   male
2   ALB 999      0   999 999 1995 3.737716   2.433315   5.981505   male
3   ALB 999      0   999 999 2000 3.824925   2.513833   6.216503   male
4   ALB 999      0   999 999 2005 3.938703   2.665851   6.267153   male
5   ALB 999      0   999 999 2010 4.067680   2.738054   6.253339   male
6   ALB 999      0   999 999 2015 4.092445   2.777237   6.344306   male
7   ALB 999      0   999 999 2018 4.070591   2.761252   6.266195   male
8   ALB 999      0   999 999 2020 4.087936   2.801015   6.307502   male
9   ALB 999      1   999 999 1990 3.292622   2.004336   5.514846 female
10  ALB 999      1   999 999 1995 3.399279   2.134920   5.449213 female
        residence  edu_level
1  all residences all levels
2  all residences all levels
3  all residences all levels
4  all residences all levels
5  all residences all levels
6  all residences all levels
7  all residences all levels
8  all residences all levels
9  all residences all levels
10 all residences all levels

1.3 | DATA ANALYSIS

1.3.1 | Mean and Median Change With Time for Each Country for General Population:

1.3.1.1 | Prepare The Data:

In order to compare the general country B12 intakes in between, filter all the variables as 999 (representing general population) and take the mean and median for each country and summarise:

Code
gdd_1 <- gdd %>%
  filter(age== 999, female == 999, urban == 999, edu == 999) %>%
  group_by(iso3, year) %>%
  summarise(mean_values = (lowerci_95+upperci_95)/2, median) %>%
arrange(desc(median))
`summarise()` has grouped output by 'iso3'. You can override using the
`.groups` argument.
Code
gdd_1
# A tibble: 232 × 4
# Groups:   iso3 [29]
   iso3   year mean_values median
   <chr> <int>       <dbl>  <dbl>
 1 EST    2010        7.16   7.10
 2 EST    1990        7.15   7.09
 3 EST    2018        7.14   7.08
 4 EST    2005        7.13   7.08
 5 EST    2020        7.13   7.07
 6 EST    2015        7.12   7.07
 7 EST    2000        7.10   7.05
 8 EST    1995        7.07   7.00
 9 BLR    2020        4.50   4.10
10 BLR    2018        4.51   4.07
# ℹ 222 more rows

1.3.1.2 |Median of Countries Across Years:

Code
gdd_2 <- ggplot(gdd_1, aes(x = year, y = median, color = iso3)) + geom_line()
gdd_2

1.3.1.3 |Result for Country Medians Across Years:

None of the countries show an extreme change in median B12 intake throughout the years, TJK as the most changing one. As per the general population intakes of countries, BGR and partly TJK are the outlier values in terms of median with lower intake. Est has the outlier value as higher intake.It is difficult to talk about a trend per country.

1.3.1.4 |Mean of Countries Across Years:

Code
gdd_3 <- ggplot(gdd_1, aes(x = year, y = mean_values, color = iso3)) + geom_line()
gdd_3

1.3.1.5 |Result for Country Means Across Years:

For the countries around average mean value, we can see an extreme upward trend from year 2000 to 2010 and then a decline towards 2015, and remain constant between 2015 and 2020.

In terms of mean, there are outliers below 4 mg intake,but hard to recognise as the colors are similar, and consistently with median graph, EST is the upper outlier with higher intake.

1.3.2 |Comparison of Average B12 Intake of General Population For All The Years In Terms of Country

To show the general population mean B12 intake with a bar chart for each country, first we summarise the table with the average of all the years rather than for each year:

Code
gdd_4 <- gdd %>%
  filter(age== 999, female == 999, urban == 999, edu == 999) %>%
  group_by(iso3) %>%
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
gdd_4
# A tibble: 29 × 2
   iso3  mean_values
   <chr>       <dbl>
 1 EST          7.12
 2 CZE          4.45
 3 BLR          4.43
 4 RUS          4.39
 5 MNG          4.37
 6 LTU          4.37
 7 LVA          4.36
 8 HUN          4.33
 9 SVN          4.33
10 MNE          4.32
# ℹ 19 more rows
Code
#bar chart to compare country average intake:
gdd_5 <- ggplot(gdd_4, aes(x = iso3, y = mean_values, color = iso3)) + geom_col() 
gdd_5

1.3.2.1 |Result:

We can see three BGR, ROU and POL have lower average intake and EST as higher intake more clearly in this chart.

1.3.3 | Try To Understand if Any Specific Age Group Affects Lower Intake Countries’ Values:

1.3.3.1 |For Lower Intake Countries:

We can analyse the countries with outlier values in more detail. To see the effect of age groups for B12 intake, select countries BGR, ROU and POL (lower intake), select for all lifestyles, all education levels, all genders but only different age groups, and visualise the table:

Code
gdd_6 <- gdd %>%
  filter(female == 999, urban == 999, edu == 999, age < 999, iso3 == "BGR" | iso3 == "ROU"| iso3 == "POL") %>%
  group_by(age) %>%
  #again to ensure we take the average of seperate years rather than all the years:
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
gdd_6
# A tibble: 22 × 2
     age mean_values
   <dbl>       <dbl>
 1  22.5        3.61
 2  27.5        3.60
 3  32.5        3.57
 4  17.5        3.57
 5  37.5        3.53
 6  42.5        3.48
 7  12.5        3.43
 8  47.5        3.42
 9  52.5        3.37
10  57.5        3.31
# ℹ 12 more rows
Code
#visualise to see if there is any major change by different age group:
gdd_7 <- ggplot(gdd_6, aes(x = age, y = mean_values, fill = age)) + geom_col() + geom_text(aes(label = round(mean_values, 1))) 
gdd_7

1.3.3.2 |Result for Lower Intake Countries:

We can see that B12 intake for children up tp 10 year is very low (ranging from 1.5 mg to 2.5mg), however for age groups taking higher B12, the average intake is roughly above 3. It seems there is no extreme low intake for some age groups affecting the average of these countries.

1.3.3.3 | All Countries:

To see if for lower intake countries the age group intake is similar to the other countries, we can prepare the same chart including all countries and prepare another bar chart:

Code
gdd_8 <- gdd %>%
  filter(female == 999, urban == 999, edu == 999, age < 999) %>%
  group_by(age) %>%
  #again to ensure we take the average of seperate years rather than all the years:
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
gdd_8
# A tibble: 22 × 2
     age mean_values
   <dbl>       <dbl>
 1  22.5        4.75
 2  27.5        4.74
 3  17.5        4.70
 4  32.5        4.70
 5  37.5        4.64
 6  42.5        4.58
 7  47.5        4.51
 8  52.5        4.44
 9  12.5        4.44
10  57.5        4.37
# ℹ 12 more rows
Code
gdd_9 <- ggplot(gdd_8, aes(x = age, y = mean_values, fill=age)) + geom_col() + geom_text(aes(label = round(mean_values, 1)))
gdd_9

1.3.3.4 |Result for All Countries and Comparison:

It seems the children’s intake trend compared to adolescent and adult intake is in line for lower-average countries when compared to all countries.We can say that the lower intake of a specific group affecting the average is not the case here.

1.3.4 |Comparison of Lower-Intake / Higher-Intake / All Countries In Terms of Gender and Residence:

1.3.4.1 |All Countries:

Code
gdd_10 <- gdd_f %>%
  filter(age== 999, gender == "female" | gender == "male", residence == "urban" | residence == "rural", edu == 999) %>%
  group_by(gender, residence) %>%
  #to ensure we take the average of seperate years rather than all the years:
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
Code
gdd_10
# A tibble: 4 × 3
# Groups:   gender [2]
  gender residence mean_values
  <chr>  <chr>           <dbl>
1 male   urban            4.75
2 female urban            4.27
3 male   rural            4.17
4 female rural            3.75
Code
gdd_11 <- ggplot(gdd_10, aes(x=residence, y=mean_values, fill=gender)) + geom_bar(stat="identity") + expand_limits(x=0) + expand_limits(y=0) + geom_text(aes(label = round(mean_values, 1)), position = position_stack(vjust = 0.5)) 
gdd_11

1.3.4.2 |Lower Intake Countries (BGR, ROU, POL):

Code
gdd_12 <- gdd_f %>%
  filter(age== 999, gender == "female" | gender == "male", residence == "urban" | residence == "rural", edu == 999, iso3 == "BGR" | iso3 == "ROU"| iso3 == "POL") %>%
  group_by(gender, residence) %>%
  #to ensure we take the average of seperate years rather than all the years:
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
Code
gdd_12
# A tibble: 4 × 3
# Groups:   gender [2]
  gender residence mean_values
  <chr>  <chr>           <dbl>
1 male   urban            3.84
2 male   rural            3.37
3 female urban            3.21
4 female rural            2.82
Code
gdd_13 <- ggplot(gdd_12, aes(x=residence, y=mean_values, fill=gender)) + geom_bar(stat="identity") + expand_limits(x=0) + expand_limits(y=0) + geom_text(aes(label = round(mean_values, 1)), position = position_stack(vjust = 0.5)) 
gdd_13

1.3.4.3 |Higher Intake Country (EST):

Code
gdd_14 <- gdd_f %>%
  filter(age== 999, gender == "female" | gender == "male", residence == "urban" | residence == "rural", edu == 999, iso3 == "EST") %>%
   group_by(gender, residence) %>%
  #to ensure we take the average of seperate years rather than all the years:
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
`summarise()` has grouped output by 'gender'. You can override using the
`.groups` argument.
Code
gdd_14
# A tibble: 4 × 3
# Groups:   gender [2]
  gender residence mean_values
  <chr>  <chr>           <dbl>
1 male   urban            7.69
2 female urban            7.20
3 male   rural            6.73
4 female rural            6.30
Code
gdd_15 <- ggplot(gdd_14, aes(x=residence, y=mean_values, fill=gender)) + geom_bar(stat="identity") + expand_limits(x=0) + expand_limits(y=0) + geom_text(aes(label = round(mean_values, 1)), position = position_stack(vjust = 0.5)) 
gdd_15

1.3.4.4 |Result:

The gap between the females and males is lowest for EST, high intake country, in relative to the other groups, and relatively the highest for lower intake countries, which means female intake is significantly lower than males in low-intake countries. The gap between rural an urban areas is similar in three groups, urban areas having higher intake.

1.3.5 |Comparison of Lower-Intake / Higher-Intake / All Countries In Terms of Education Leven and Gender:

1.3.5.1 |All Countries:

Code
gdd_16 <- gdd_f %>%
  filter(age== 999, urban == 999, edu_level == "low" | edu_level == "medium" | edu_level == "high",  gender == "male" | gender == "female") %>%
  group_by(edu_level, gender) %>%
  #to ensure we take the average of seperate years rather than all the years:
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
`summarise()` has grouped output by 'edu_level'. You can override using the
`.groups` argument.
Code
gdd_16
# A tibble: 6 × 3
# Groups:   edu_level [3]
  edu_level gender mean_values
  <chr>     <chr>        <dbl>
1 medium    male          4.58
2 high      male          4.49
3 low       male          4.27
4 medium    female        4.12
5 high      female        4.04
6 low       female        3.84
Code
gdd_17 <- ggplot(gdd_16, aes(x=gender, y=mean_values, fill=edu_level)) + geom_bar(stat="identity") + expand_limits(x=0) + expand_limits(y=0) + geom_text(aes(label = round(mean_values, 1)), position = position_stack(vjust = 0.5))
gdd_17

1.3.5.2 |Lower Intake Countries:

Code
gdd_18 <- gdd_f %>%
  filter(age== 999, urban == 999, edu_level == "low" | edu_level == "medium" | edu_level == "high",  gender == "male" | gender == "female", iso3 == "BGR" | iso3 == "ROU"| iso3 == "POL") %>%
  group_by(edu_level, gender) %>%
  #to ensure we take the average of seperate years rather than all the years:
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
`summarise()` has grouped output by 'edu_level'. You can override using the
`.groups` argument.
Code
gdd_18
# A tibble: 6 × 3
# Groups:   edu_level [3]
  edu_level gender mean_values
  <chr>     <chr>        <dbl>
1 medium    male          3.73
2 high      male          3.64
3 low       male          3.48
4 medium    female        3.12
5 high      female        3.05
6 low       female        2.90
Code
gdd_19 <- ggplot(gdd_18, aes(x=gender, y=mean_values, fill=edu_level)) + geom_bar(stat="identity") + expand_limits(x=0) + expand_limits(y=0) + geom_text(aes(label = round(mean_values, 1)), position = position_stack(vjust = 0.5))
gdd_19

1.3.5.3 |Higher Intake Countries:

Code
gdd_20 <- gdd_f %>%
  filter(age== 999, urban == 999, edu_level == "low" | edu_level == "medium" | edu_level == "high",  gender == "male" | gender == "female", iso3 == "EST") %>%
  group_by(edu_level, gender) %>%
  #to ensure we take the average of seperate years rather than all the years:
  summarise(mean_values = mean(lowerci_95+upperci_95)/2) %>%
  arrange(desc(mean_values))
`summarise()` has grouped output by 'edu_level'. You can override using the
`.groups` argument.
Code
gdd_20
# A tibble: 6 × 3
# Groups:   edu_level [3]
  edu_level gender mean_values
  <chr>     <chr>        <dbl>
1 medium    male          7.50
2 high      male          7.33
3 medium    female        7.02
4 low       male          6.98
5 high      female        6.86
6 low       female        6.53
Code
gdd_21 <- ggplot(gdd_20, aes(x=gender, y=mean_values, fill=edu_level)) + geom_bar( stat="identity") + expand_limits(x=0) + expand_limits(y=0) + geom_text(aes(label = round(mean_values, 1)), position = position_stack(vjust = 0.5))
gdd_21

1.3.5.4 |Result:

No matter the country group, the difference of B12 intake of males vs females does not vary significantly based on the education level.