Netflix Assignment

Netflix Analysis

Introduction

In this assignment i will be exploring the Netflix data.

The data set consists of TV shows and movies available on Netflix as of 2019 and part of 2020.

This analysis will be base on only comedy movies.

Lets start with importing necessary libraries.

library(tidyverse) # data manipulation , visualization
library(readr) # library needed for read csv
library(kableExtra) # Pretty print DataFrame

Data Exploration

Read CSV file using read.csv() function. fileEncoding parameter specfied as “utf-8” due to the unicode characters.

df <- read.csv("netflix_titles.csv", fileEncoding = "utf-8")

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## invalid input found on input connection 'netflix_titles.csv'

Summary statistics of the data.

summary(df)

##     show_id             type              title             director        
##  Min.   :60003155   Length:233         Length:233         Length:233        
##  1st Qu.:80106966   Class :character   Class :character   Class :character  
##  Median :80182483   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :79192107                                                           
##  3rd Qu.:81002866                                                           
##  Max.   :81186758                                                           
##                                                                             
##      cast             country           date_added         release_year 
##  Length:233         Length:233         Length:233         Min.   :1982  
##  Class :character   Class :character   Class :character   1st Qu.:2015  
##  Mode  :character   Mode  :character   Mode  :character   Median :2017  
##                                                           Mean   :2016  
##                                                           3rd Qu.:2018  
##                                                           Max.   :2019  
##                                                           NA's   :1     
##     rating            duration          listed_in         description       
##  Length:233         Length:233         Length:233         Length:233        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##

I will get rid of columns that are not useful for my analysis.

df_filtered <- df %>% select(type,title,director,cast,country,release_year,rating,duration,listed_in)

First 5 rows of the data.

kable(head(df_filtered)) %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "100%", height = "800px")

type	title	director	cast	country	release_year	rating	duration	listed_in
Movie	Norm of the North: King Sized Adventure	Richard Finn, Tim Maltby	Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson	United States, India, South Korea, China	2019	TV-PG	90 min	Children & Family Movies, Comedies
Movie	Jandino: Whatever it Takes		Jandino Asporaat	United Kingdom	2016	TV-MA	94 min	Stand-Up Comedy
TV Show	Transformers Prime		Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle	United States	2013	TV-Y7-FV	1 Season	Kids’ TV
TV Show	Transformers: Robots in Disguise		Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen	United States	2016	TV-Y7	1 Season	Kids’ TV
Movie	#realityhigh	Fernando Lebrija	Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis	United States	2017	TV-14	99 min	Comedies
TV Show	Apaches		Alberto Ammann, Eloy Azorín, Verónica Echegui, Lucía Jiménez, Claudia Traisac	Spain	2016	TV-MA	1 Season	Crime TV Shows, International TV Shows, Spanish-Language TV Shows

Filter only movie and comedy genre.

df_comedy_movies <- df_filtered %>% subset(., grepl("Comed", listed_in)) %>% filter(type=="Movie")

Replace empty strings with NaN value.

df_comedy_movies <- df_comedy_movies %>% mutate_all(., list(~na_if(.,"")))

Clean duration column , replace ‘min’ keyword with empty string then convert to numeric.

df_comedy_movies <- df_comedy_movies %>% mutate(duration = str_replace(duration, "min", ""))
df_comedy_movies$duration <- as.numeric(df_comedy_movies$duration)

Print summary again.

summary(df_comedy_movies)

##      type              title             director             cast          
##  Length:56          Length:56          Length:56          Length:56         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    country           release_year     rating             duration     
##  Length:56          Min.   :2009   Length:56          Min.   : 58.00  
##  Class :character   1st Qu.:2015   Class :character   1st Qu.: 77.50  
##  Mode  :character   Median :2016   Mode  :character   Median : 97.00  
##                     Mean   :2016                      Mean   : 96.91  
##                     3rd Qu.:2018                      3rd Qu.:112.25  
##                     Max.   :2019                      Max.   :152.00  
##   listed_in        
##  Length:56         
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Data Visualization

Comedy Movie Counts by Country

comedy_country <-  df_comedy_movies %>% 
     mutate(country = strsplit(as.character(country), ",")) %>% 
     unnest(country) %>% select(country) %>% group_by(country_name = trimws(country)) %>% summarise(total = n())

It looks like US by far the highest producer of comedy movies , followed by India , Canada etc.

Comedy Movie Count by Rating

We can see the most of the movies have the rating TV-14 and TV-MA. That means this movies most suitable for people that their age greater than 14 and 19.

Boxplot of Movie Duration

We can see %50 of the movie durations nearly between 76 and 112.5.

Conclusion

When we try to find and watch some comedy movies , its common to find movie properties like,

Have the rating TV-14 and TV-MA
produced by US
movie durations nearly between 76 and 112.5.

This is the short analysis of the comedy movies in netflix. If we have more data and features like user rating, total watch etc. We could have been expanded our analysis further to gain useful information.

Thanks to Yigit Erol. I have been able to analyze netflix data briefly.

You can check his analysis and github repo in the links below,

Medium Github