About Netflix
Netflix, Inc. is an American subscription streaming service and production company. Launched on August 29, 1997, it offers a library of films and television series through distribution deals as well as its own productions, known as Netflix Originals. Netflix can be accessed via internet browser on computers, or via application software installed on smart TVs, set-top boxes connected to televisions, tablet computers, smartphones, digital media players, Blu-ray Disc players, video game consoles, and virtual reality headsets on the list of Netflix-compatible devices.
Data Reading and Cleaning
Importing Data
In this analysis, it will be given different insights about this Netflix data. First, let’s take a look at the data and see which information we have. As we can see down below, there are 6234 observations of 12 following variables describing the tv shows and movies.
netdf <- read.csv("https://raw.githubusercontent.com/ygterl/EDA-Netflix-2020-in-R/master/netflix_titles.csv", stringsAsFactors = FALSE, na.strings=c("","NA"))
str(netdf)
## 'data.frame': 6234 obs. of 12 variables:
## $ show_id : int 81145628 80117401 70234439 80058654 80125979 80163890 70304989 80164077 80117902 70304990 ...
## $ type : chr "Movie" "Movie" "TV Show" "TV Show" ...
## $ title : chr "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
## $ director : chr "Richard Finn, Tim Maltby" NA NA NA ...
## $ cast : chr "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
## $ country : chr "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
## $ date_added : chr "September 9, 2019" "September 9, 2016" "September 8, 2018" "September 8, 2018" ...
## $ release_year: int 2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
## $ rating : chr "TV-PG" "TV-MA" "TV-Y7-FV" "TV-Y7" ...
## $ duration : chr "90 min" "94 min" "1 Season" "1 Season" ...
## $ listed_in : chr "Children & Family Movies, Comedies" "Stand-Up Comedy" "Kids' TV" "Kids' TV" ...
## $ description : chr "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...
Data Cleaning
The format of some variables need to be changed. “date_added” variable format will convert date format by using as.Date() function and also “rating”, “listed_in”, “type” variables’ format will convert date format by using as.factor() function.
netdf$date_added <- as.Date(netdf$date_added, format = "%B %d, %Y")
netdf$rating <- as.factor(netdf$rating)
netdf$listed_in <- as.factor(netdf$listed_in)
netdf$type <- as.factor(netdf$type)
str(netdf)
## 'data.frame': 6234 obs. of 12 variables:
## $ show_id : int 81145628 80117401 70234439 80058654 80125979 80163890 70304989 80164077 80117902 70304990 ...
## $ type : Factor w/ 2 levels "Movie","TV Show": 1 1 2 2 1 2 1 1 2 1 ...
## $ title : chr "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
## $ director : chr "Richard Finn, Tim Maltby" NA NA NA ...
## $ cast : chr "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
## $ country : chr "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
## $ date_added : Date, format: "2019-09-09" "2016-09-09" ...
## $ release_year: int 2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
## $ rating : Factor w/ 14 levels "G","NC-17","NR",..: 10 9 13 12 7 9 6 9 9 6 ...
## $ duration : chr "90 min" "94 min" "1 Season" "1 Season" ...
## $ listed_in : Factor w/ 461 levels "Action & Adventure",..: 111 421 382 382 168 218 338 421 273 58 ...
## $ description : chr "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...
Beside these adjustments, it is important to know if the data have missing values or not. If there are missing values, we can either fill these values or get rid of them. However, also it is significant to know that getting rid of them means also losing some useful part of the data as well. Thus, if these missing values can be filled then they should be filled.
na_values <- data.frame("Variable"=c(colnames(netdf)), "Missing Values"=sapply(netdf, function(x) sum(is.na(x))), row.names=NULL)
na_values
## Variable Missing.Values
## 1 show_id 0
## 2 type 0
## 3 title 0
## 4 director 1969
## 5 cast 570
## 6 country 476
## 7 date_added 651
## 8 release_year 0
## 9 rating 10
## 10 duration 0
## 11 listed_in 0
## 12 description 0
Since “rating” column is the categorical variable, we can fill in (approximate) the missing values for “rating” with a mode.
mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
netdf$rating[is.na(netdf$rating)] <- mode(netdf$rating)
We also drop duplicated rows in the data set based on the “title”, “country”, “type”,” release_year” variables.
netdf=distinct(netdf, title, country, type, release_year, .keep_all = TRUE)
Now, we can look missing values again.
na_values <- data.frame("Variable"=c(colnames(netdf)), "Missing Values"=sapply(netdf, function(x) sum(is.na(x))), row.names=NULL)
na_values
## Variable Missing.Values
## 1 show_id 0
## 2 type 0
## 3 title 0
## 4 director 1968
## 5 cast 570
## 6 country 476
## 7 date_added 650
## 8 release_year 0
## 9 rating 0
## 10 duration 0
## 11 listed_in 0
## 12 description 0
Data Visualizing
The percentage of Netflix content by type
As seen below, movies make up more than half of Netflix content.
content_by_type <- netdf %>% group_by(type) %>%
summarise(count = n())
labels <- c(content_by_type$type)
perc <- round(100*content_by_type$count/sum(content_by_type$count), 1)
percent <- paste(perc, "%")
ggplot(content_by_type, aes(x = "", y = percent, fill = labels)) +
geom_col() +
geom_text(aes(label = percent),
position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
ggtitle("Netflix Content by Type") +
theme_void()
Top 10 Actors who played in Movies and TV Shows
actors10 <- strsplit(netdf$cast, split = ", ")
titles_actors <- data.frame(type= rep(netdf$type, sapply(actors10, length)), actor = unlist(actors10))
titles_actors$actor <- as.character(gsub(","," ", titles_actors$actor))
top_actors_movie <- titles_actors %>%
na.omit(titles_actors) %>%
filter(type == "Movie") %>%
group_by(actor) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(10, count)
ggplot(top_actors_movie, aes(x =actor, y = count)) +
geom_col(fill = '#5fabba') +
labs(x = "Actors" , y = "Num of Movies", title = "Top 10 Actor in Movies") +
theme_minimal() +
coord_flip()
top_actors_tv <- titles_actors %>%
na.omit(titles_actors) %>%
filter(type == "TV Show") %>%
group_by(actor) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(10, count)
ggplot(top_actors_tv, aes(x =actor, y = count)) +
geom_col(fill = '#5fabba') +
labs(x = "Actors" , y = "Num of TV Shows", title = "Top 10 Actor in TV Shows") +
theme_minimal() +
coord_flip()
Turkish TV show releases in Netflix by time
Turkish TV series broadcast on Netflix differ from year to year. Although there has been an increase in general in recent years, there is no regular increase.
tv_show_turkey <- netdf %>%
filter(country == "Turkey" & type == "TV Show") %>%
group_by(release_year) %>%
summarise(num_production = n()) %>%
arrange(desc(num_production))
ggplot(tv_show_turkey, aes(x =release_year, y = num_production)) +
geom_line(color = "#3476e0", size = 1.5) +
labs(x = "Release Year" , y = "Num of TV Shows", title = "Turkish TV show releases in Netflix by time") +
theme_bw()
Movie & TV Show Releases by time
tv_show <- netdf %>%
filter(type == "TV Show") %>%
group_by(release_year) %>%
summarise(num_production = n()) %>%
arrange(desc(num_production))
movie <- netdf %>%
filter(type == "Movie") %>%
group_by(release_year) %>%
summarise(num_production = n()) %>%
arrange(desc(num_production))
movie_tv <- merge(x = movie, y = tv_show, by = "release_year", suffixes = c("_movie", "_tvshows")) %>%
arrange(desc(release_year))
colors <- c("Movie" = "blue", "TV Shows" = "red")
ggplot(movie_tv, aes(x = release_year)) +
geom_line(aes(y = num_production_movie, color = "Movie"), size = 1.5) +
geom_line(aes(y = num_production_tvshows, color = "TV Shows"), size = 1.5) +
labs(x = "Year",
y = "Amount of Releases",
title = "Amount of Releases by Time",
color = "Legend") +
scale_color_manual(values = colors) +
theme_bw()
Conclusion
As seen above, Netflix contents have diversified and more productions have been released recently. However, there has been a higher increase in the film industry compared to the TV series industry.
References
Yiğit Erol (2020, April 27). Exploration of Netflix 2020 Dataset with R Markdown (EDA). Retrieved from this link