Netflix, Inc. is an American pay television over-the-top media service and original programming production company. It offers subscription-based video on demand from a library of films and television series, 40% of which is Netflix original programming produced in-house.
The Netflix dataset includes details of movies and tv shows available on Netflix platform as of September 2021. The purpose of this assignment is to extract useful insights and findings from the dataset using R libraries.
As tidyverse includes all necessary libraries, calling only it will be enough for analysis. I also loaded hrbrthemes and ggthemes in order to get more ggplot themes.
Downloaded the Netflix dataset to my desktop and imported by using read_csv()
function. Take a look at first part of our data by using head()
function.
library(tidyverse)
library(hrbrthemes)
library(ggthemes)
<- read_csv("C:/Users/tceme/Desktop/Data/R projects/netflix_titles.csv/netflix_titles.csv")
netflix
head(netflix)
Let’s check what kind of data we will work with. glimpse()
is one of the good function in order to check data types of columns.
glimpse(netflix)
## Rows: 8,807
## Columns: 12
## $ show_id <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", ...
## $ type <chr> "Movie", "TV Show", "TV Show", "TV Show", "TV Show", "...
## $ title <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", ...
## $ director <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mik...
## $ cast <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang M...
## $ country <chr> "United States", "South Africa", NA, NA, "India", NA, ...
## $ date_added <chr> "September 25, 2021", "September 24, 2021", "September...
## $ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, ...
## $ rating <chr> "PG-13", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-MA", ...
## $ duration <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seas...
## $ listed_in <chr> "Documentaries", "International TV Shows, TV Dramas, T...
## $ description <chr> "As her father nears the end of his life, filmmaker Ki...
There are 12 columns and 8807 rows. As it seen above, date_added column is character data type. Converted to date data type by using below as.Date()
function.
$date_added <- as.Date(netflix$date_added, format = "%B %d, %Y") netflix
rating and type columns are also character type. These columns might be more useful if i were able to categorize them. So they were converted to factor data type by using as.Factor()
function.
$rating <- as.factor(netflix$rating)
netflix$type <- as.factor(netflix$type) netflix
Let’s glimpse()
again to check if the columns were converted.
glimpse(netflix)
## Rows: 8,807
## Columns: 12
## $ show_id <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", ...
## $ type <fct> Movie, TV Show, TV Show, TV Show, TV Show, TV Show, Mo...
## $ title <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", ...
## $ director <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mik...
## $ cast <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang M...
## $ country <chr> "United States", "South Africa", NA, NA, "India", NA, ...
## $ date_added <date> 2021-09-25, 2021-09-24, 2021-09-24, 2021-09-24, 2021-...
## $ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, ...
## $ rating <fct> PG-13, TV-MA, TV-MA, TV-MA, TV-MA, TV-MA, PG, TV-MA, T...
## $ duration <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seas...
## $ listed_in <chr> "Documentaries", "International TV Shows, TV Dramas, T...
## $ description <chr> "As her father nears the end of his life, filmmaker Ki...
Our work with the columns is done. As a last, let’s check NA values in the dataset. First part of this process is counting NA values by using sapply()
function on dataset. And then
<- sapply(netflix, function(y) sum(is.na(y)))
NA_values NA_values
## show_id type title director cast country
## 0 0 0 2634 825 831
## date_added release_year rating duration listed_in description
## 10 0 4 3 0 0
director, cast, country, date_added, rating and duration columns has NA values. It looks alright for now, These rows will be removed or filled if needed.
table()
function performs categorical tabulation of data with the variable and its frequency.
table(netflix$type)
##
## Movie TV Show
## 6131 2676
As table()
function calculated the number of frequencies in type column, now it’s much easier to visualize the result.
<- netflix %>%
production_by_type group_by(type) %>%
mutate(count = n()) %>%
ggplot(aes(x =reorder(type, count))) +
geom_bar(fill = c("yellow","red"), width = 0.5) +
labs(x = "Type", y = "Number of Productions", title = 'Content Produced by Type')+
coord_flip()+
theme_calc()
+ theme(
production_by_type plot.title = element_text(color = "Black", size = 12, face = "italic"))
It is very simple to guess that the US is the country that produces the most content, but by seeing other countries in the ranking, we can more easily understand the popularity of Netflix among countries.
Remember we have NA Values in country column. So filtering them using !is.na()
function will bring us the rows which are not NA.
<- netflix %>%
production_by_country filter(!is.na(country)) %>%
group_by(country)%>%
count()%>%
arrange(desc(n))%>%
head(15)%>%
ggplot()+
geom_col(aes(y=reorder(country,n),x=n), width = 0.5, fill = '#a678de')+
geom_label(aes(y=reorder(country,n),x=n,label=n))+
labs(x = "Number of Productions" , y = "Country", title = "Content Produced by Countries")+
theme_get()
+ theme(
production_by_country plot.title = element_text(color = "Black", size = 12, face = "italic"))
The sum of other 14 countries in the graph is almost as same amount as total content produced in the US.
Let’s look at content production by genre. We know that there are character values in the listed_in column separated by comma. In order to split them, strsplit()
function comes in duty. Also unnest()
function will list all items in the column seperately.
<- netflix %>%
production_by_genre mutate(listed_in = strsplit(as.character(listed_in), ", ")) %>%
unnest(listed_in)%>%
group_by(listed_in) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(20)%>%
ggplot( aes(x = count, y = reorder(listed_in,count)))+
geom_col(color = "light blue", width = 0.5) +
labs(y = "Genre", x = "Number of Productions" , title = 'Content Produced by Genre')
+ theme(
production_by_genre plot.title = element_text(color = "Black", size = 12, face = "italic"))
Who is the director with the most productions on Netflix? Let’s take a look but i remember there are NA Values in director column too.The same process like above, using !is.na()
function will turn the rows which are not NA.
<- netflix[!is.na(netflix$director), ] most_directors
First 15 directors with most productions on Netflix are below. It’s glad to see Yilmaz Erdogan in the same list with Steven Spielberg and Martin Scorsese
%>%
most_directors mutate(director = strsplit(as.character(director), ", ")) %>%
unnest(director)%>%
group_by(director) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(15)%>%
ggplot(aes(reorder(director,count),count))+
geom_col(fill = '#c00000', width = 0.5)+
theme(plot.title = element_text(color = "Black", size = 12, face = "italic")) +
coord_flip() +
labs(x = "Directors", y = "Number of Productions" , title = 'Directors with Most Movies')
Who is the most common Turkish cast in Netflix? Using filter()
and group_by()
functions will give us the answer.
<- netflix %>%
tr_cast filter(country == "Turkey") %>%
separate_rows(cast,sep=', ') %>%
group_by(cast) %>%
count(sort=TRUE)
<- tr_cast$cast[1:15]
top_15_tr top_15_tr
## [1] "Demet Akbag" "Cezmi Baskin" "Cengiz Bozkurt" "Salih Kalyon"
## [5] "Ata Demirer" "Büsra Pekin" "Fatih Artman" "Yilmaz Erdogan"
## [9] "Cem Yilmaz" "Devrim Yakut" "Eda Ece" "Erdal Tosun"
## [13] "Tarik Ünlüoglu" "Belçim Bilgin" "Ersin Korkut"
I found this plotly chart in Kaggle and wanted to add it to my analysis. Turkish theater actress Demet Akbag has the most performances as a Turkish cast in Netflix.
library(plotly)
%>%
netflix separate_rows(cast, sep=', ') %>%
filter(cast %in% top_15_tr) %>%
mutate(cast=factor(cast,levels = top_15_tr)) %>%
group_by(cast,type) %>%
count(sort=TRUE) %>%
plot_ly(y=~cast,
x=~n,
type='bar',
orientation='h',
color=~type,
text=~n,
textposition='outside') %>%
layout(title = 'Turkish Cast in Netflix', plot_bgcolor = "whitesmoke", xaxis = list(title = 'Caption : Kaggle / @rohithkannan17'),
yaxis = list(title = 'Turkish Cast'), legend = list(title=list(text='Type of Production')))
It’s normal for Hollywood to be the clear leader in content production, but producing content for Bollywood doesn’t seem like a bad idea after all.
As human-beings are hooked by digital things, we can see that the famous theater actors in Turkey also have a lot of performances on Netflix.
Finally, many thanks to Yigit Erol for providing me an example with such a useful study.