## Installing dplyr
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
In this in-class assignment, we run three simple analyses using dplyr package on our proposed datasets in Assignment 1. My dataset was “Daily Trending Youtube Videos”.
## Installing dplyr
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
## Importing the dataset
<- read.csv("/Users/nidadonmez/Downloads/trending_yt_videos_113_countries.csv") youtube
Here I check the validity of the dataset by grouping the total number of records by country and filter them by randomly selected dates:
## To check if each country on a given date really contains 50 videos:
%>%
youtube filter(snapshot_date == "2023-11-10") %>%
group_by(country) %>%
summarise(total_count =n(), .groups = "drop")
# A tibble: 113 × 2
country total_count
<chr> <int>
1 AE 50
2 AL 50
3 AM 50
4 AR 50
5 AT 50
6 AU 50
7 AZ 50
8 BA 50
9 BD 50
10 BE 50
# ℹ 103 more rows
## To continue above analysis with another random date:
%>%
youtube filter(snapshot_date == "2023-11-05") %>%
group_by(country) %>%
summarise(total_count =n(), .groups = "drop")
# A tibble: 113 × 2
country total_count
<chr> <int>
1 AE 50
2 AL 50
3 AM 50
4 AR 50
5 AT 50
6 AU 50
7 AZ 50
8 BA 50
9 BD 50
10 BE 50
# ℹ 103 more rows
I want to bring the top 5 videos that have got the highest number of comments worldwide during the time range of the dataset:
%>%
youtube group_by(title) %>%
summarise(total_comment = sum(comment_count)) %>%
%>%
ungroup arrange(desc(total_comment)) %>%
head(5)
# A tibble: 5 × 2
title total_comment
<chr> <int>
1 정국 (Jung Kook) 'Standing Next to You' Official MV 79671285
2 skibidi toilet 67 (part 2) 50070393
3 KALAASTAR - Full Video | Honey 3.0 | Yo Yo Honey Singh & Sonaks… 44020289
4 I Built 100 Wells In Africa 36163225
5 World’s Deadliest Laser Maze! 32026005
To check the engagement rates by country, I want to group the countries by the highest rate of likes out of total views during the time range of the dataset, and bring top 10 countries:
%>%
youtube group_by(country) %>%
summarise(total_view = sum(view_count), like_engagement = sum(like_count)/sum(view_count)) %>%
%>%
ungroup arrange(desc(like_engagement)) %>%
head(10)
# A tibble: 10 × 3
country total_view like_engagement
<chr> <dbl> <dbl>
1 JO 1397804063 0.0683
2 IQ 1487839862 0.0675
3 LY 1580401023 0.0668
4 LB 1838351618 0.0664
5 DZ 1191019610 0.0662
6 YE 1682891866 0.0650
7 BR 785158849 0.0649
8 TN 1800458954 0.0648
9 MA 1191177698 0.0643
10 FR 885176519 0.0643
And bring last 10 countries for lowest engagement rates:
%>%
youtube group_by(country) %>%
summarise(total_view = sum(view_count), like_engagement = sum(like_count)/sum(view_count)) %>%
%>%
ungroup arrange(like_engagement) %>%
head(10)
# A tibble: 10 × 3
country total_view like_engagement
<chr> <dbl> <dbl>
1 LA 16969834067 0.0210
2 TZ 3709647151 0.0214
3 KH 27864624810 0.0216
4 SN 10110132491 0.0228
5 AL 12484143847 0.0230
6 MK 14753659905 0.0234
7 BA 10629627209 0.0235
8 ME 10120936040 0.0241
9 AM 14818966845 0.0242
10 AZ 9488748280 0.0248