Assignment 3

Analysis Of Netflix

Data Manipulation

We install the packages we need for this analysis.

library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
library(knitr)
library(kableExtra)
library(rmdformats)

We read file and control data type of columns. Also, if we want to show the table in a compact format, we can use the kable function.

df_raw <- read.csv("C:/Users/bmire/Desktop/MEF BDA/R Course/R Studio/netflix_titles.csv", stringsAsFactors = FALSE, na.strings=c("","NA"))
str(df_raw)
## 'data.frame':    6234 obs. of  12 variables:
##  $ show_id     : int  81145628 80117401 70234439 80058654 80125979 80163890 70304989 80164077 80117902 70304990 ...
##  $ type        : chr  "Movie" "Movie" "TV Show" "TV Show" ...
##  $ title       : chr  "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
##  $ director    : chr  "Richard Finn, Tim Maltby" NA NA NA ...
##  $ cast        : chr  "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
##  $ country     : chr  "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
##  $ date_added  : chr  "September 9, 2019" "September 9, 2016" "September 8, 2018" "September 8, 2018" ...
##  $ release_year: int  2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
##  $ rating      : chr  "TV-PG" "TV-MA" "TV-Y7-FV" "TV-Y7" ...
##  $ duration    : chr  "90 min" "94 min" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Children & Family Movies, Comedies" "Stand-Up Comedy" "Kids' TV" "Kids' TV" ...
##  $ description : chr  "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...
kable(df_raw[1:3,], "simple")
show_id type title director cast country date_added release_year rating duration listed_in description
81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first.
80117401 Movie Jandino: Whatever it Takes NA Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of “Sex on Fire” in his comedy show.
70234439 TV Show Transformers Prime NA Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids’ TV With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticons and their leader, Megatron.

We control whether there is any null data. Now, I don’t want to delete the null data. Because we might need in the next steps.

colSums(is.na(df_raw))
##      show_id         type        title     director         cast      country 
##            0            0            0         1969          570          476 
##   date_added release_year       rating     duration    listed_in  description 
##           11            0           10            0            0            0

Then, we delete the columns that we do not need in the analysis.

df_raw$show_id <- NULL
df_raw$description <- NULL

We change the data types of categorical and date columns. Then, we check it again.

df_raw$type <- as.factor(df_raw$type)
df_raw$rating <- as.factor(df_raw$rating)
df_raw$listed_in <- as.factor(df_raw$listed_in)
df_raw$date_added <- mdy(df_raw$date_added)
str(df_raw) #check datatype
## 'data.frame':    6234 obs. of  10 variables:
##  $ type        : Factor w/ 2 levels "Movie","TV Show": 1 1 2 2 1 2 1 1 2 1 ...
##  $ title       : chr  "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
##  $ director    : chr  "Richard Finn, Tim Maltby" NA NA NA ...
##  $ cast        : chr  "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
##  $ country     : chr  "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
##  $ date_added  : Date, format: "2019-09-09" "2016-09-09" ...
##  $ release_year: int  2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
##  $ rating      : Factor w/ 14 levels "G","NC-17","NR",..: 10 9 13 12 7 9 6 9 9 6 ...
##  $ duration    : chr  "90 min" "94 min" "1 Season" "1 Season" ...
##  $ listed_in   : Factor w/ 461 levels "Action & Adventure",..: 111 421 382 382 168 218 338 421 273 58 ...
kable(df_raw[1:3,], "simple")
type title director cast country date_added release_year rating duration listed_in
Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson United States, India, South Korea, China 2019-09-09 2019 TV-PG 90 min Children & Family Movies, Comedies
Movie Jandino: Whatever it Takes NA Jandino Asporaat United Kingdom 2016-09-09 2016 TV-MA 94 min Stand-Up Comedy
TV Show Transformers Prime NA Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle United States 2018-09-08 2013 TV-Y7-FV 1 Season Kids’ TV

We remove duplicate columns.

df_raw = 
    distinct(df_raw, type, title,   director, cast, country, .keep_all = TRUE)
kable(df_raw[1:3,], "simple")
type title director cast country date_added release_year rating duration listed_in
Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson United States, India, South Korea, China 2019-09-09 2019 TV-PG 90 min Children & Family Movies, Comedies
Movie Jandino: Whatever it Takes NA Jandino Asporaat United Kingdom 2016-09-09 2016 TV-MA 94 min Stand-Up Comedy
TV Show Transformers Prime NA Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle United States 2018-09-08 2013 TV-Y7-FV 1 Season Kids’ TV

Data Visualization

Now, we can start the analysis and visualize.

Number Of Types

We show the number of types. We can say that there are more Movie than TV Show on Netflix.

df_type <- df_raw %>% group_by(type) %>% summarise(number_of_type = n())

ggplot(df_type,aes(x=type,y=number_of_type)) + geom_bar(stat="identity",aes(fill=type))

Also, we show the top 3 numbers of listed_in based on their own type. I think Netflix is known mostly TV Show platform. However, according to the result, the number of Dramas, International Movies is highest than others.

df_listed <- df_raw %>% na.omit(df_raw) %>% group_by(type,listed_in) %>% summarise(number_of_type = n()) %>% slice_max(order_by = number_of_type, n = 3)

kable(df_listed, "simple")
type listed_in number_of_type
Movie Dramas, International Movies 237
Movie Stand-Up Comedy 234
Movie Dramas, Independent Movies, International Movies 184
TV Show Crime TV Shows, International TV Shows, TV Dramas 8
TV Show Anime Series, International TV Shows 5
TV Show International TV Shows, Korean TV Shows, Romantic TV Shows 5

Countries Analysis Based On Number Of Types

We show the number of types based on countries. There is just the TV Shows type in Australia, Japan, South Korea, Taiwan, and just the Movies in China, Germany, Hong Kong.

k <- strsplit(df_raw$country, split = ", ")
df_countries<- data.frame(type = rep(df_raw$type, sapply(k, length)), country = unlist(k))
df_countries$country <- as.character(df_countries$country)

df_top10 <- df_countries %>% na.omit(df_countries) %>% group_by(type, country) %>% summarise(number_of_type = n()) %>% 
            slice_max(order_by = number_of_type, n = 10)
ggplot(df_top10, aes(x=country, y = number_of_type)) + 
  geom_bar(stat = "identity",position = position_dodge(), aes(fill = type)) +
  theme(axis.text.x = element_text(angle = 90))

Change Of Number Of Types Based On Year

According to the graph, Netflix has started to grow after the year 2015 and produced more movies. We see that rapidly decreasing in 2020 because the dataset has data until the beginning of 2020.

df_raw$year <- year(df_raw$date_added)

df_ts <- df_raw %>% group_by(year,type) %>% summarise(number_of_type = n()) %>% na.omit(df_raw)

ggplot(df_ts, aes(x=year, y = number_of_type)) + 
  geom_line(aes(colour = type), size = 2)+ 
  geom_point() 

Top 10 Listed_In and Director

Finally, we see the top 10 listed_in and director.

df_genre10 <- df_raw  %>% na.omit(df_raw) %>% group_by(listed_in) %>% summarise(number_of_type = n()) %>% slice_max(order_by = number_of_type, n = 10)

kable(df_genre10, "simple")
listed_in number_of_type
Dramas, International Movies 237
Stand-Up Comedy 234
Dramas, Independent Movies, International Movies 184
Comedies, Dramas, International Movies 168
Documentaries 136
Children & Family Movies, Comedies 118
Comedies, International Movies 110
Dramas, International Movies, Romantic Movies 103
Action & Adventure, Dramas, International Movies 98
Comedies, International Movies, Romantic Movies 93
df_dir10 <- df_raw  %>% na.omit(df_raw) %>% group_by(director) %>% summarise(number_of_type = n()) %>% slice_max(order_by = number_of_type, n = 10)

kable(df_dir10, "simple")
director number_of_type
Raúl Campos, Jan Suter 18
Jay Karas 13
Jay Chapman 12
Marcus Raboy 12
Martin Scorsese 9
Steven Spielberg 9
David Dhawan 8
Johnnie To 8
Cathy Garcia-Molina 7
Hakan Algül 7
Lance Bangs 7
Quentin Tarantino 7
Ryan Polito 7
S.S. Rajamouli 7
Shannon Hartman 7

Reference Link