Exploratory Analysis of Netflix Dataset

Introduction

Netflix, Inc. is an American pay television over-the-top media service and original programming production company. It offers subscription-based video on demand from a library of films and television series, 40% of which is Netflix original programming produced in-house.

The Netflix dataset includes details of movies and tv shows available on Netflix platform as of September 2021. The purpose of this assignment is to extract useful insights and findings from the dataset using R libraries.

Loading Libraries & Importing Dataset

As tidyverse includes all necessary libraries, calling only it will be enough for analysis. I also loaded hrbrthemes and ggthemes in order to get more ggplot themes.

Downloaded the Netflix dataset to my desktop and imported by using read_csv() function. Take a look at first part of our data by using head() function.

library(tidyverse)
library(hrbrthemes)
library(ggthemes)

netflix <- read_csv("C:/Users/tceme/Desktop/Data/R projects/netflix_titles.csv/netflix_titles.csv") 

head(netflix)

Let’s check what kind of data we will work with. glimpse() is one of the good function in order to check data types of columns.

glimpse(netflix)

## Rows: 8,807
## Columns: 12
## $ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", ...
## $ type         <chr> "Movie", "TV Show", "TV Show", "TV Show", "TV Show", "...
## $ title        <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", ...
## $ director     <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mik...
## $ cast         <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang M...
## $ country      <chr> "United States", "South Africa", NA, NA, "India", NA, ...
## $ date_added   <chr> "September 25, 2021", "September 24, 2021", "September...
## $ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, ...
## $ rating       <chr> "PG-13", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-MA", ...
## $ duration     <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seas...
## $ listed_in    <chr> "Documentaries", "International TV Shows, TV Dramas, T...
## $ description  <chr> "As her father nears the end of his life, filmmaker Ki...

There are 12 columns and 8807 rows. As it seen above, date_added column is character data type. Converted to date data type by using below as.Date() function.

netflix$date_added <- as.Date(netflix$date_added, format = "%B %d, %Y")

rating and type columns are also character type. These columns might be more useful if i were able to categorize them. So they were converted to factor data type by using as.Factor() function.

netflix$rating <- as.factor(netflix$rating)
netflix$type <- as.factor(netflix$type)

Let’s glimpse() again to check if the columns were converted.

glimpse(netflix)

## Rows: 8,807
## Columns: 12
## $ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", ...
## $ type         <fct> Movie, TV Show, TV Show, TV Show, TV Show, TV Show, Mo...
## $ title        <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", ...
## $ director     <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mik...
## $ cast         <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang M...
## $ country      <chr> "United States", "South Africa", NA, NA, "India", NA, ...
## $ date_added   <date> 2021-09-25, 2021-09-24, 2021-09-24, 2021-09-24, 2021-...
## $ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, ...
## $ rating       <fct> PG-13, TV-MA, TV-MA, TV-MA, TV-MA, TV-MA, PG, TV-MA, T...
## $ duration     <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seas...
## $ listed_in    <chr> "Documentaries", "International TV Shows, TV Dramas, T...
## $ description  <chr> "As her father nears the end of his life, filmmaker Ki...

Our work with the columns is done. As a last, let’s check NA values in the dataset. First part of this process is counting NA values by using sapply() function on dataset. And then

NA_values <- sapply(netflix, function(y) sum(is.na(y)))
NA_values

##      show_id         type        title     director         cast      country 
##            0            0            0         2634          825          831 
##   date_added release_year       rating     duration    listed_in  description 
##           10            0            4            3            0            0

director, cast, country, date_added, rating and duration columns has NA values. It looks alright for now, These rows will be removed or filled if needed.

Handling Data & Visualizing the Results

table() function performs categorical tabulation of data with the variable and its frequency.

table(netflix$type)

## 
##   Movie TV Show 
##    6131    2676

As table() function calculated the number of frequencies in type column, now it’s much easier to visualize the result.

production_by_type <- netflix %>% 
  group_by(type) %>% 
  mutate(count = n()) %>% 
  ggplot(aes(x =reorder(type, count))) +
  geom_bar(fill = c("yellow","red"), width = 0.5) +
  labs(x = "Type", y = "Number of Productions", title = 'Content Produced by Type')+
  coord_flip()+
  theme_calc()

production_by_type + theme(
  plot.title = element_text(color = "Black", size = 12, face = "italic"))

It is very simple to guess that the US is the country that produces the most content, but by seeing other countries in the ranking, we can more easily understand the popularity of Netflix among countries.

Remember we have NA Values in country column. So filtering them using !is.na() function will bring us the rows which are not NA.

production_by_country <- netflix %>%
  filter(!is.na(country)) %>%
  group_by(country)%>%
  count()%>%
  arrange(desc(n))%>%
  head(15)%>%
  ggplot()+
  geom_col(aes(y=reorder(country,n),x=n), width = 0.5, fill = '#a678de')+
  geom_label(aes(y=reorder(country,n),x=n,label=n))+
  labs(x = "Number of Productions" , y = "Country", title = "Content Produced by Countries")+
  theme_get()

production_by_country + theme(
  plot.title = element_text(color = "Black", size = 12, face = "italic"))

The sum of other 14 countries in the graph is almost as same amount as total content produced in the US.

Let’s look at content production by genre. We know that there are character values in the listed_in column separated by comma. In order to split them, strsplit() function comes in duty. Also unnest() function will list all items in the column seperately.

production_by_genre <- netflix %>% 
  mutate(listed_in = strsplit(as.character(listed_in), ", ")) %>%
  unnest(listed_in)%>% 
  group_by(listed_in) %>% 
  summarise(count = n()) %>%
  arrange(desc(count)) %>% 
  top_n(20)%>%
  ggplot( aes(x = count, y = reorder(listed_in,count)))+
  geom_col(color = "light blue", width = 0.5) +
  labs(y = "Genre", x = "Number of Productions" , title = 'Content Produced by Genre')

production_by_genre + theme(
  plot.title = element_text(color = "Black", size = 12, face = "italic"))

Who is the director with the most productions on Netflix? Let’s take a look but i remember there are NA Values in director column too.The same process like above, using !is.na() function will turn the rows which are not NA.

most_directors <- netflix[!is.na(netflix$director), ]

First 15 directors with most productions on Netflix are below. It’s glad to see Yilmaz Erdogan in the same list with Steven Spielberg and Martin Scorsese

most_directors %>% 
  mutate(director = strsplit(as.character(director), ", ")) %>%
  unnest(director)%>%
  group_by(director) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(15)%>%
  ggplot(aes(reorder(director,count),count))+
  geom_col(fill = '#c00000', width = 0.5)+
  theme(plot.title = element_text(color = "Black", size = 12, face = "italic")) +
  coord_flip() +
  labs(x = "Directors", y = "Number of Productions" , title = 'Directors with Most Movies')

Who is the most common Turkish cast in Netflix? Using filter() and group_by() functions will give us the answer.

tr_cast <- netflix %>% 
  filter(country == "Turkey") %>%
  separate_rows(cast,sep=', ') %>% 
  group_by(cast) %>% 
  count(sort=TRUE)

top_15_tr <- tr_cast$cast[1:15]
top_15_tr

##  [1] "Demet Akbag"    "Cezmi Baskin"   "Cengiz Bozkurt" "Salih Kalyon"  
##  [5] "Ata Demirer"    "Büsra Pekin"    "Fatih Artman"   "Yilmaz Erdogan"
##  [9] "Cem Yilmaz"     "Devrim Yakut"   "Eda Ece"        "Erdal Tosun"   
## [13] "Tarik Ünlüoglu" "Belçim Bilgin"  "Ersin Korkut"

I found this plotly chart in Kaggle and wanted to add it to my analysis. Turkish theater actress Demet Akbag has the most performances as a Turkish cast in Netflix.

library(plotly)

netflix %>% 
  separate_rows(cast, sep=', ') %>% 
  filter(cast %in% top_15_tr) %>% 
  mutate(cast=factor(cast,levels = top_15_tr)) %>% 
  group_by(cast,type) %>% 
  count(sort=TRUE) %>% 
  plot_ly(y=~cast,
          x=~n,
          type='bar',
          orientation='h',
          color=~type,
          text=~n,
          textposition='outside') %>%
  layout(title = 'Turkish Cast in Netflix', plot_bgcolor = "whitesmoke", xaxis = list(title = 'Caption : Kaggle / @rohithkannan17'), 
         yaxis = list(title = 'Turkish Cast'), legend = list(title=list(text='Type of Production')))

Insights from Netflix Data

It’s normal for Hollywood to be the clear leader in content production, but producing content for Bollywood doesn’t seem like a bad idea after all.

As human-beings are hooked by digital things, we can see that the famous theater actors in Turkey also have a lot of performances on Netflix.

Finally, many thanks to Yigit Erol for providing me an example with such a useful study.

Exploratory Analysis of Netflix Dataset

Cem Eldemir

Last edited: November 02, 2021 02:43

Introduction

Loading Libraries & Importing Dataset

Handling Data & Visualizing the Results

Insights from Netflix Data