Assignment 3
Analysis Of Netflix
Data Manipulation
We install the packages we need for this analysis.
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
library(knitr)
library(kableExtra)
library(rmdformats)
We read file and control data type of columns. Also, if we want to show the table in a compact format, we can use the kable function.
<- read.csv("C:/Users/bmire/Desktop/MEF BDA/R Course/R Studio/netflix_titles.csv", stringsAsFactors = FALSE, na.strings=c("","NA"))
df_raw str(df_raw)
## 'data.frame': 6234 obs. of 12 variables:
## $ show_id : int 81145628 80117401 70234439 80058654 80125979 80163890 70304989 80164077 80117902 70304990 ...
## $ type : chr "Movie" "Movie" "TV Show" "TV Show" ...
## $ title : chr "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
## $ director : chr "Richard Finn, Tim Maltby" NA NA NA ...
## $ cast : chr "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
## $ country : chr "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
## $ date_added : chr "September 9, 2019" "September 9, 2016" "September 8, 2018" "September 8, 2018" ...
## $ release_year: int 2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
## $ rating : chr "TV-PG" "TV-MA" "TV-Y7-FV" "TV-Y7" ...
## $ duration : chr "90 min" "94 min" "1 Season" "1 Season" ...
## $ listed_in : chr "Children & Family Movies, Comedies" "Stand-Up Comedy" "Kids' TV" "Kids' TV" ...
## $ description : chr "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...
kable(df_raw[1:3,], "simple")
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description |
---|---|---|---|---|---|---|---|---|---|---|---|
81145628 | Movie | Norm of the North: King Sized Adventure | Richard Finn, Tim Maltby | Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson | United States, India, South Korea, China | September 9, 2019 | 2019 | TV-PG | 90 min | Children & Family Movies, Comedies | Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first. |
80117401 | Movie | Jandino: Whatever it Takes | NA | Jandino Asporaat | United Kingdom | September 9, 2016 | 2016 | TV-MA | 94 min | Stand-Up Comedy | Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of “Sex on Fire” in his comedy show. |
70234439 | TV Show | Transformers Prime | NA | Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle | United States | September 8, 2018 | 2013 | TV-Y7-FV | 1 Season | Kids’ TV | With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticons and their leader, Megatron. |
We control whether there is any null data. Now, I don’t want to delete the null data. Because we might need in the next steps.
colSums(is.na(df_raw))
## show_id type title director cast country
## 0 0 0 1969 570 476
## date_added release_year rating duration listed_in description
## 11 0 10 0 0 0
Then, we delete the columns that we do not need in the analysis.
$show_id <- NULL
df_raw$description <- NULL df_raw
We change the data types of categorical and date columns. Then, we check it again.
$type <- as.factor(df_raw$type)
df_raw$rating <- as.factor(df_raw$rating)
df_raw$listed_in <- as.factor(df_raw$listed_in)
df_raw$date_added <- mdy(df_raw$date_added)
df_rawstr(df_raw) #check datatype
## 'data.frame': 6234 obs. of 10 variables:
## $ type : Factor w/ 2 levels "Movie","TV Show": 1 1 2 2 1 2 1 1 2 1 ...
## $ title : chr "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
## $ director : chr "Richard Finn, Tim Maltby" NA NA NA ...
## $ cast : chr "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
## $ country : chr "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
## $ date_added : Date, format: "2019-09-09" "2016-09-09" ...
## $ release_year: int 2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
## $ rating : Factor w/ 14 levels "G","NC-17","NR",..: 10 9 13 12 7 9 6 9 9 6 ...
## $ duration : chr "90 min" "94 min" "1 Season" "1 Season" ...
## $ listed_in : Factor w/ 461 levels "Action & Adventure",..: 111 421 382 382 168 218 338 421 273 58 ...
kable(df_raw[1:3,], "simple")
type | title | director | cast | country | date_added | release_year | rating | duration | listed_in |
---|---|---|---|---|---|---|---|---|---|
Movie | Norm of the North: King Sized Adventure | Richard Finn, Tim Maltby | Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson | United States, India, South Korea, China | 2019-09-09 | 2019 | TV-PG | 90 min | Children & Family Movies, Comedies |
Movie | Jandino: Whatever it Takes | NA | Jandino Asporaat | United Kingdom | 2016-09-09 | 2016 | TV-MA | 94 min | Stand-Up Comedy |
TV Show | Transformers Prime | NA | Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle | United States | 2018-09-08 | 2013 | TV-Y7-FV | 1 Season | Kids’ TV |
We remove duplicate columns.
=
df_raw distinct(df_raw, type, title, director, cast, country, .keep_all = TRUE)
kable(df_raw[1:3,], "simple")
type | title | director | cast | country | date_added | release_year | rating | duration | listed_in |
---|---|---|---|---|---|---|---|---|---|
Movie | Norm of the North: King Sized Adventure | Richard Finn, Tim Maltby | Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson | United States, India, South Korea, China | 2019-09-09 | 2019 | TV-PG | 90 min | Children & Family Movies, Comedies |
Movie | Jandino: Whatever it Takes | NA | Jandino Asporaat | United Kingdom | 2016-09-09 | 2016 | TV-MA | 94 min | Stand-Up Comedy |
TV Show | Transformers Prime | NA | Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle | United States | 2018-09-08 | 2013 | TV-Y7-FV | 1 Season | Kids’ TV |
Data Visualization
Now, we can start the analysis and visualize.
Number Of Types
We show the number of types. We can say that there are more Movie than TV Show on Netflix.
<- df_raw %>% group_by(type) %>% summarise(number_of_type = n())
df_type
ggplot(df_type,aes(x=type,y=number_of_type)) + geom_bar(stat="identity",aes(fill=type))
Also, we show the top 3 numbers of listed_in based on their own type. I think Netflix is known mostly TV Show platform. However, according to the result, the number of Dramas, International Movies is highest than others.
<- df_raw %>% na.omit(df_raw) %>% group_by(type,listed_in) %>% summarise(number_of_type = n()) %>% slice_max(order_by = number_of_type, n = 3)
df_listed
kable(df_listed, "simple")
type | listed_in | number_of_type |
---|---|---|
Movie | Dramas, International Movies | 237 |
Movie | Stand-Up Comedy | 234 |
Movie | Dramas, Independent Movies, International Movies | 184 |
TV Show | Crime TV Shows, International TV Shows, TV Dramas | 8 |
TV Show | Anime Series, International TV Shows | 5 |
TV Show | International TV Shows, Korean TV Shows, Romantic TV Shows | 5 |
Countries Analysis Based On Number Of Types
We show the number of types based on countries. There is just the TV Shows type in Australia, Japan, South Korea, Taiwan, and just the Movies in China, Germany, Hong Kong.
<- strsplit(df_raw$country, split = ", ")
k <- data.frame(type = rep(df_raw$type, sapply(k, length)), country = unlist(k))
df_countries$country <- as.character(df_countries$country)
df_countries
<- df_countries %>% na.omit(df_countries) %>% group_by(type, country) %>% summarise(number_of_type = n()) %>%
df_top10 slice_max(order_by = number_of_type, n = 10)
ggplot(df_top10, aes(x=country, y = number_of_type)) +
geom_bar(stat = "identity",position = position_dodge(), aes(fill = type)) +
theme(axis.text.x = element_text(angle = 90))
Change Of Number Of Types Based On Year
According to the graph, Netflix has started to grow after the year 2015 and produced more movies. We see that rapidly decreasing in 2020 because the dataset has data until the beginning of 2020.
$year <- year(df_raw$date_added)
df_raw
<- df_raw %>% group_by(year,type) %>% summarise(number_of_type = n()) %>% na.omit(df_raw)
df_ts
ggplot(df_ts, aes(x=year, y = number_of_type)) +
geom_line(aes(colour = type), size = 2)+
geom_point()
Top 10 Listed_In and Director
Finally, we see the top 10 listed_in and director.
<- df_raw %>% na.omit(df_raw) %>% group_by(listed_in) %>% summarise(number_of_type = n()) %>% slice_max(order_by = number_of_type, n = 10)
df_genre10
kable(df_genre10, "simple")
listed_in | number_of_type |
---|---|
Dramas, International Movies | 237 |
Stand-Up Comedy | 234 |
Dramas, Independent Movies, International Movies | 184 |
Comedies, Dramas, International Movies | 168 |
Documentaries | 136 |
Children & Family Movies, Comedies | 118 |
Comedies, International Movies | 110 |
Dramas, International Movies, Romantic Movies | 103 |
Action & Adventure, Dramas, International Movies | 98 |
Comedies, International Movies, Romantic Movies | 93 |
<- df_raw %>% na.omit(df_raw) %>% group_by(director) %>% summarise(number_of_type = n()) %>% slice_max(order_by = number_of_type, n = 10)
df_dir10
kable(df_dir10, "simple")
director | number_of_type |
---|---|
Raúl Campos, Jan Suter | 18 |
Jay Karas | 13 |
Jay Chapman | 12 |
Marcus Raboy | 12 |
Martin Scorsese | 9 |
Steven Spielberg | 9 |
David Dhawan | 8 |
Johnnie To | 8 |
Cathy Garcia-Molina | 7 |
Hakan Algül | 7 |
Lance Bangs | 7 |
Quentin Tarantino | 7 |
Ryan Polito | 7 |
S.S. Rajamouli | 7 |
Shannon Hartman | 7 |