In this assignment i will be exploring the Netflix data.
The data set consists of TV shows and movies available on Netflix as of 2019 and part of 2020.
This analysis will be base on only comedy movies.
Lets start with importing necessary libraries.
library(tidyverse) # data manipulation , visualization
library(readr) # library needed for read csv
library(kableExtra) # Pretty print DataFrame
Read CSV file using read.csv() function. fileEncoding parameter specfied as “utf-8” due to the unicode characters.
df <- read.csv("netflix_titles.csv", fileEncoding = "utf-8")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## invalid input found on input connection 'netflix_titles.csv'
Summary statistics of the data.
summary(df)
## show_id type title director
## Min. :60003155 Length:233 Length:233 Length:233
## 1st Qu.:80106966 Class :character Class :character Class :character
## Median :80182483 Mode :character Mode :character Mode :character
## Mean :79192107
## 3rd Qu.:81002866
## Max. :81186758
##
## cast country date_added release_year
## Length:233 Length:233 Length:233 Min. :1982
## Class :character Class :character Class :character 1st Qu.:2015
## Mode :character Mode :character Mode :character Median :2017
## Mean :2016
## 3rd Qu.:2018
## Max. :2019
## NA's :1
## rating duration listed_in description
## Length:233 Length:233 Length:233 Length:233
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
I will get rid of columns that are not useful for my analysis.
df_filtered <- df %>% select(type,title,director,cast,country,release_year,rating,duration,listed_in)
First 5 rows of the data.
kable(head(df_filtered)) %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "100%", height = "800px")
type | title | director | cast | country | release_year | rating | duration | listed_in |
---|---|---|---|---|---|---|---|---|
Movie | Norm of the North: King Sized Adventure | Richard Finn, Tim Maltby | Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson | United States, India, South Korea, China | 2019 | TV-PG | 90 min | Children & Family Movies, Comedies |
Movie | Jandino: Whatever it Takes | Jandino Asporaat | United Kingdom | 2016 | TV-MA | 94 min | Stand-Up Comedy | |
TV Show | Transformers Prime | Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle | United States | 2013 | TV-Y7-FV | 1 Season | Kids’ TV | |
TV Show | Transformers: Robots in Disguise | Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen | United States | 2016 | TV-Y7 | 1 Season | Kids’ TV | |
Movie | #realityhigh | Fernando Lebrija | Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis | United States | 2017 | TV-14 | 99 min | Comedies |
TV Show | Apaches | Alberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia Traisac | Spain | 2016 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, Spanish-Language TV Shows |
Filter only movie and comedy genre.
df_comedy_movies <- df_filtered %>% subset(., grepl("Comed", listed_in)) %>% filter(type=="Movie")
Replace empty strings with NaN value.
df_comedy_movies <- df_comedy_movies %>% mutate_all(., list(~na_if(.,"")))
Clean duration column , replace ‘min’ keyword with empty string then convert to numeric.
df_comedy_movies <- df_comedy_movies %>% mutate(duration = str_replace(duration, "min", ""))
df_comedy_movies$duration <- as.numeric(df_comedy_movies$duration)
Print summary again.
summary(df_comedy_movies)
## type title director cast
## Length:56 Length:56 Length:56 Length:56
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## country release_year rating duration
## Length:56 Min. :2009 Length:56 Min. : 58.00
## Class :character 1st Qu.:2015 Class :character 1st Qu.: 77.50
## Mode :character Median :2016 Mode :character Median : 97.00
## Mean :2016 Mean : 96.91
## 3rd Qu.:2018 3rd Qu.:112.25
## Max. :2019 Max. :152.00
## listed_in
## Length:56
## Class :character
## Mode :character
##
##
##
Comedy Movie Counts by Country
comedy_country <- df_comedy_movies %>%
mutate(country = strsplit(as.character(country), ",")) %>%
unnest(country) %>% select(country) %>% group_by(country_name = trimws(country)) %>% summarise(total = n())
It looks like US by far the highest producer of comedy movies , followed by India , Canada etc.
Comedy Movie Count by Rating
We can see the most of the movies have the rating TV-14 and TV-MA. That means this movies most suitable for people that their age greater than 14 and 19.
Boxplot of Movie Duration
We can see %50 of the movie durations nearly between 76 and 112.5.
When we try to find and watch some comedy movies , its common to find movie properties like,
This is the short analysis of the comedy movies in netflix. If we have more data and features like user rating, total watch etc. We could have been expanded our analysis further to gain useful information.
Thanks to Yigit Erol. I have been able to analyze netflix data briefly.
You can check his analysis and github repo in the links below,