Netflix Analysis

Introduction

In this assignment i will be exploring the Netflix data.

The data set consists of TV shows and movies available on Netflix as of 2019 and part of 2020.

This analysis will be base on only comedy movies.

Lets start with importing necessary libraries.

library(tidyverse) # data manipulation , visualization
library(readr) # library needed for read csv
library(kableExtra) # Pretty print DataFrame

Data Exploration

Read CSV file using read.csv() function. fileEncoding parameter specfied as “utf-8” due to the unicode characters.

df <- read.csv("netflix_titles.csv", fileEncoding = "utf-8")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## invalid input found on input connection 'netflix_titles.csv'

Summary statistics of the data.

summary(df)
##     show_id             type              title             director        
##  Min.   :60003155   Length:233         Length:233         Length:233        
##  1st Qu.:80106966   Class :character   Class :character   Class :character  
##  Median :80182483   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :79192107                                                           
##  3rd Qu.:81002866                                                           
##  Max.   :81186758                                                           
##                                                                             
##      cast             country           date_added         release_year 
##  Length:233         Length:233         Length:233         Min.   :1982  
##  Class :character   Class :character   Class :character   1st Qu.:2015  
##  Mode  :character   Mode  :character   Mode  :character   Median :2017  
##                                                           Mean   :2016  
##                                                           3rd Qu.:2018  
##                                                           Max.   :2019  
##                                                           NA's   :1     
##     rating            duration          listed_in         description       
##  Length:233         Length:233         Length:233         Length:233        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
## 

I will get rid of columns that are not useful for my analysis.

df_filtered <- df %>% select(type,title,director,cast,country,release_year,rating,duration,listed_in)

First 5 rows of the data.

kable(head(df_filtered)) %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "100%", height = "800px")
type title director cast country release_year rating duration listed_in
Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson United States, India, South Korea, China 2019 TV-PG 90 min Children & Family Movies, Comedies
Movie Jandino: Whatever it Takes Jandino Asporaat United Kingdom 2016 TV-MA 94 min Stand-Up Comedy
TV Show Transformers Prime Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle United States 2013 TV-Y7-FV 1 Season Kids’ TV
TV Show Transformers: Robots in Disguise Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen United States 2016 TV-Y7 1 Season Kids’ TV
Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis United States 2017 TV-14 99 min Comedies
TV Show Apaches Alberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia Traisac Spain 2016 TV-MA 1 Season Crime TV Shows, International TV Shows, Spanish-Language TV Shows

Filter only movie and comedy genre.

df_comedy_movies <- df_filtered %>% subset(., grepl("Comed", listed_in)) %>% filter(type=="Movie")

Replace empty strings with NaN value.

df_comedy_movies <- df_comedy_movies %>% mutate_all(., list(~na_if(.,"")))

Clean duration column , replace ‘min’ keyword with empty string then convert to numeric.

df_comedy_movies <- df_comedy_movies %>% mutate(duration = str_replace(duration, "min", ""))
df_comedy_movies$duration <- as.numeric(df_comedy_movies$duration)

Print summary again.

summary(df_comedy_movies)
##      type              title             director             cast          
##  Length:56          Length:56          Length:56          Length:56         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    country           release_year     rating             duration     
##  Length:56          Min.   :2009   Length:56          Min.   : 58.00  
##  Class :character   1st Qu.:2015   Class :character   1st Qu.: 77.50  
##  Mode  :character   Median :2016   Mode  :character   Median : 97.00  
##                     Mean   :2016                      Mean   : 96.91  
##                     3rd Qu.:2018                      3rd Qu.:112.25  
##                     Max.   :2019                      Max.   :152.00  
##   listed_in        
##  Length:56         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Data Visualization



Comedy Movie Counts by Country

comedy_country <-  df_comedy_movies %>% 
     mutate(country = strsplit(as.character(country), ",")) %>% 
     unnest(country) %>% select(country) %>% group_by(country_name = trimws(country)) %>% summarise(total = n())

It looks like US by far the highest producer of comedy movies , followed by India , Canada etc.

Comedy Movie Count by Rating

We can see the most of the movies have the rating TV-14 and TV-MA. That means this movies most suitable for people that their age greater than 14 and 19.

Boxplot of Movie Duration

We can see %50 of the movie durations nearly between 76 and 112.5.

Conclusion

When we try to find and watch some comedy movies , its common to find movie properties like,

  • Have the rating TV-14 and TV-MA
  • produced by US
  • movie durations nearly between 76 and 112.5.

This is the short analysis of the comedy movies in netflix. If we have more data and features like user rating, total watch etc. We could have been expanded our analysis further to gain useful information.

Thanks to Yigit Erol. I have been able to analyze netflix data briefly.

You can check his analysis and github repo in the links below,

Medium Github