2 InClass1

Published

November 10, 2023

In this in-class assignment, we run three simple analyses using dplyr package on our proposed datasets in Assignment 1. My dataset was “Daily Trending Youtube Videos”.

2.1 | Preparation

## Installing dplyr
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

## Importing the dataset
youtube <- read.csv("/Users/nidadonmez/Downloads/trending_yt_videos_113_countries.csv")

2.2 | Analysis 1

Here I check the validity of the dataset by grouping the total number of records by country and filter them by randomly selected dates:

## To check if each country on a given date really contains 50 videos:
youtube %>%
    filter(snapshot_date == "2023-11-10") %>% 
 group_by(country) %>% 
  summarise(total_count =n(), .groups = "drop")

# A tibble: 113 × 2
   country total_count
   <chr>         <int>
 1 AE               50
 2 AL               50
 3 AM               50
 4 AR               50
 5 AT               50
 6 AU               50
 7 AZ               50
 8 BA               50
 9 BD               50
10 BE               50
# ℹ 103 more rows

## To continue above analysis with another random date:
youtube %>%
    filter(snapshot_date == "2023-11-05") %>% 
 group_by(country) %>% 
  summarise(total_count =n(), .groups = "drop")

# A tibble: 113 × 2
   country total_count
   <chr>         <int>
 1 AE               50
 2 AL               50
 3 AM               50
 4 AR               50
 5 AT               50
 6 AU               50
 7 AZ               50
 8 BA               50
 9 BD               50
10 BE               50
# ℹ 103 more rows

2.3 | Analysis 2

I want to bring the top 5 videos that have got the highest number of comments worldwide during the time range of the dataset:

youtube  %>%
  group_by(title) %>%
summarise(total_comment = sum(comment_count)) %>% 
  ungroup %>% 
  arrange(desc(total_comment)) %>% 
head(5)

# A tibble: 5 × 2
  title                                                            total_comment
  <chr>                                                                    <int>
1 정국 (Jung Kook) 'Standing Next to You' Official MV                   79671285
2 skibidi toilet 67 (part 2)                                            50070393
3 KALAASTAR - Full Video | Honey 3.0 | Yo Yo Honey Singh & Sonaks…      44020289
4 I Built 100 Wells In Africa                                           36163225
5 World’s Deadliest Laser Maze!                                         32026005

2.4 | Analysis 3

To check the engagement rates by country, I want to group the countries by the highest rate of likes out of total views during the time range of the dataset, and bring top 10 countries:

youtube  %>%
  group_by(country) %>%
summarise(total_view = sum(view_count), like_engagement = sum(like_count)/sum(view_count)) %>% 
  ungroup %>% 
  arrange(desc(like_engagement)) %>% 
head(10)

# A tibble: 10 × 3
   country total_view like_engagement
   <chr>        <dbl>           <dbl>
 1 JO      1397804063          0.0683
 2 IQ      1487839862          0.0675
 3 LY      1580401023          0.0668
 4 LB      1838351618          0.0664
 5 DZ      1191019610          0.0662
 6 YE      1682891866          0.0650
 7 BR       785158849          0.0649
 8 TN      1800458954          0.0648
 9 MA      1191177698          0.0643
10 FR       885176519          0.0643

And bring last 10 countries for lowest engagement rates:

youtube  %>%
  group_by(country) %>%
summarise(total_view = sum(view_count), like_engagement = sum(like_count)/sum(view_count)) %>% 
  ungroup %>% 
  arrange(like_engagement) %>% 
head(10)

# A tibble: 10 × 3
   country  total_view like_engagement
   <chr>         <dbl>           <dbl>
 1 LA      16969834067          0.0210
 2 TZ       3709647151          0.0214
 3 KH      27864624810          0.0216
 4 SN      10110132491          0.0228
 5 AL      12484143847          0.0230
 6 MK      14753659905          0.0234
 7 BA      10629627209          0.0235
 8 ME      10120936040          0.0241
 9 AM      14818966845          0.0242
10 AZ       9488748280          0.0248