Exploratory Analysis of Netflix Data

Mine Kara

December 04, 2021

About Netflix

Netflix, Inc. is an American subscription streaming service and production company. Launched on August 29, 1997, it offers a library of films and television series through distribution deals as well as its own productions, known as Netflix Originals. Netflix can be accessed via internet browser on computers, or via application software installed on smart TVs, set-top boxes connected to televisions, tablet computers, smartphones, digital media players, Blu-ray Disc players, video game consoles, and virtual reality headsets on the list of Netflix-compatible devices.

Data Reading and Cleaning

Importing Data

In this analysis, it will be given different insights about this Netflix data. First, let’s take a look at the data and see which information we have. As we can see down below, there are 6234 observations of 12 following variables describing the tv shows and movies.

netdf <- read.csv("https://raw.githubusercontent.com/ygterl/EDA-Netflix-2020-in-R/master/netflix_titles.csv", stringsAsFactors = FALSE, na.strings=c("","NA"))

str(netdf)

## 'data.frame':    6234 obs. of  12 variables:
##  $ show_id     : int  81145628 80117401 70234439 80058654 80125979 80163890 70304989 80164077 80117902 70304990 ...
##  $ type        : chr  "Movie" "Movie" "TV Show" "TV Show" ...
##  $ title       : chr  "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
##  $ director    : chr  "Richard Finn, Tim Maltby" NA NA NA ...
##  $ cast        : chr  "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
##  $ country     : chr  "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
##  $ date_added  : chr  "September 9, 2019" "September 9, 2016" "September 8, 2018" "September 8, 2018" ...
##  $ release_year: int  2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
##  $ rating      : chr  "TV-PG" "TV-MA" "TV-Y7-FV" "TV-Y7" ...
##  $ duration    : chr  "90 min" "94 min" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Children & Family Movies, Comedies" "Stand-Up Comedy" "Kids' TV" "Kids' TV" ...
##  $ description : chr  "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...

Data Cleaning

The format of some variables need to be changed. “date_added” variable format will convert date format by using as.Date() function and also “rating”, “listed_in”, “type” variables’ format will convert date format by using as.factor() function.

netdf$date_added <- as.Date(netdf$date_added, format = "%B %d, %Y")

netdf$rating <- as.factor(netdf$rating)

netdf$listed_in <- as.factor(netdf$listed_in)

netdf$type <- as.factor(netdf$type)

str(netdf)

## 'data.frame':    6234 obs. of  12 variables:
##  $ show_id     : int  81145628 80117401 70234439 80058654 80125979 80163890 70304989 80164077 80117902 70304990 ...
##  $ type        : Factor w/ 2 levels "Movie","TV Show": 1 1 2 2 1 2 1 1 2 1 ...
##  $ title       : chr  "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
##  $ director    : chr  "Richard Finn, Tim Maltby" NA NA NA ...
##  $ cast        : chr  "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
##  $ country     : chr  "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
##  $ date_added  : Date, format: "2019-09-09" "2016-09-09" ...
##  $ release_year: int  2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
##  $ rating      : Factor w/ 14 levels "G","NC-17","NR",..: 10 9 13 12 7 9 6 9 9 6 ...
##  $ duration    : chr  "90 min" "94 min" "1 Season" "1 Season" ...
##  $ listed_in   : Factor w/ 461 levels "Action & Adventure",..: 111 421 382 382 168 218 338 421 273 58 ...
##  $ description : chr  "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...

Beside these adjustments, it is important to know if the data have missing values or not. If there are missing values, we can either fill these values or get rid of them. However, also it is significant to know that getting rid of them means also losing some useful part of the data as well. Thus, if these missing values can be filled then they should be filled.

na_values <- data.frame("Variable"=c(colnames(netdf)), "Missing Values"=sapply(netdf, function(x) sum(is.na(x))), row.names=NULL)

na_values

##        Variable Missing.Values
## 1       show_id              0
## 2          type              0
## 3         title              0
## 4      director           1969
## 5          cast            570
## 6       country            476
## 7    date_added            651
## 8  release_year              0
## 9        rating             10
## 10     duration              0
## 11    listed_in              0
## 12  description              0

Since “rating” column is the categorical variable, we can fill in (approximate) the missing values for “rating” with a mode.

mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

netdf$rating[is.na(netdf$rating)] <- mode(netdf$rating)

We also drop duplicated rows in the data set based on the “title”, “country”, “type”,” release_year” variables.

netdf=distinct(netdf, title, country, type, release_year, .keep_all = TRUE)

Now, we can look missing values again.

na_values <- data.frame("Variable"=c(colnames(netdf)), "Missing Values"=sapply(netdf, function(x) sum(is.na(x))), row.names=NULL)

na_values

##        Variable Missing.Values
## 1       show_id              0
## 2          type              0
## 3         title              0
## 4      director           1968
## 5          cast            570
## 6       country            476
## 7    date_added            650
## 8  release_year              0
## 9        rating              0
## 10     duration              0
## 11    listed_in              0
## 12  description              0

Data Visualizing

The percentage of Netflix content by type

As seen below, movies make up more than half of Netflix content.

content_by_type <- netdf %>% group_by(type) %>% 
  summarise(count = n())

labels <-  c(content_by_type$type)

perc <- round(100*content_by_type$count/sum(content_by_type$count), 1)

percent <- paste(perc, "%")

ggplot(content_by_type, aes(x = "", y = percent, fill = labels)) +
  geom_col() +
  geom_text(aes(label = percent),
            position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") + 
  ggtitle("Netflix Content by Type") +
  theme_void()

Top 10 Actors who played in Movies and TV Shows

actors10 <- strsplit(netdf$cast, split = ", ")
titles_actors <-  data.frame(type= rep(netdf$type, sapply(actors10, length)), actor = unlist(actors10))
titles_actors$actor <- as.character(gsub(","," ", titles_actors$actor))


top_actors_movie <- titles_actors %>% 
  na.omit(titles_actors) %>%
  filter(type == "Movie") %>% 
  group_by(actor) %>% 
  summarise(count = n()) %>%
  arrange(desc(count)) %>% 
  top_n(10, count) 

ggplot(top_actors_movie, aes(x =actor, y = count)) + 
  geom_col(fill = '#5fabba') +
  labs(x = "Actors" , y = "Num of Movies", title = "Top 10 Actor in Movies") +
  theme_minimal() + 
  coord_flip()

top_actors_tv <- titles_actors %>% 
  na.omit(titles_actors) %>%
  filter(type == "TV Show") %>% 
  group_by(actor) %>% 
  summarise(count = n()) %>%
  arrange(desc(count)) %>% 
  top_n(10, count)

ggplot(top_actors_tv, aes(x =actor, y = count)) + 
  geom_col(fill = '#5fabba') +
  labs(x = "Actors" , y = "Num of TV Shows", title = "Top 10 Actor in TV Shows") +
  theme_minimal() + 
  coord_flip()

Turkish TV show releases in Netflix by time

Turkish TV series broadcast on Netflix differ from year to year. Although there has been an increase in general in recent years, there is no regular increase.

tv_show_turkey <- netdf %>% 
  filter(country == "Turkey" & type == "TV Show") %>% 
  group_by(release_year) %>%
  summarise(num_production = n()) %>% 
  arrange(desc(num_production))

ggplot(tv_show_turkey, aes(x =release_year, y = num_production)) + 
  geom_line(color = "#3476e0", size = 1.5) +
  labs(x = "Release Year" , y = "Num of TV Shows", title = "Turkish TV show releases in Netflix by time") +
  theme_bw()

Movie & TV Show Releases by time

tv_show <- netdf %>% 
  filter(type == "TV Show") %>% 
  group_by(release_year) %>%
  summarise(num_production = n()) %>% 
  arrange(desc(num_production))

movie <- netdf %>% 
  filter(type == "Movie") %>% 
  group_by(release_year) %>%
  summarise(num_production = n()) %>% 
  arrange(desc(num_production))


movie_tv <- merge(x = movie, y = tv_show, by = "release_year", suffixes = c("_movie", "_tvshows")) %>% 
        arrange(desc(release_year))


  

colors <- c("Movie" = "blue", "TV Shows" = "red")

ggplot(movie_tv, aes(x = release_year)) +
    geom_line(aes(y = num_production_movie, color = "Movie"), size = 1.5) +
    geom_line(aes(y = num_production_tvshows, color = "TV Shows"), size = 1.5) +
    labs(x = "Year",
         y = "Amount of Releases",
         title = "Amount of Releases by Time",
         color = "Legend") +
    scale_color_manual(values = colors) +
  theme_bw()

Conclusion

As seen above, Netflix contents have diversified and more productions have been released recently. However, there has been a higher increase in the film industry compared to the TV series industry.

References

Yiğit Erol (2020, April 27). Exploration of Netflix 2020 Dataset with R Markdown (EDA). Retrieved from this link