Netflix

Overview

In this assignment i am going to be exploring the Netflix data.

This dataset consists of TV shows & movies suitable on Netflix as of 2019 and part of 2020.

netds <- read.csv("netflix_titles.csv", na.strings = c("", "NA"), stringsAsFactors =FALSE)

In the dataset there are 6234 observations of 12 following variables describing the tv shows and movies

library(plotly)

## Zorunlu paket yükleniyor: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

values_table1 <- rbind(c('Show_ID', 'Type', 'Title', 'Director', 'Cast', 'Country', 'Date_added', 'Release_year', 'Rating' , 'Duration', 'Listed_in', 'Description'), c("Netflix ID for Every Movie & Tv Shows", 
     "Identifier - A Movie or TV Show", 
     "Title of the Movie or TV Show", 
     "Director of the Movie /TV Show", 
    "Actors involved in the Movie / TV Show",
    "Country where the movie / show was produced",
    "Added date on Netflix",
    "Actual release year of the Movie / TV Show",
    "Rating type of the Movie or TV Show",
    "Total Duration - in minutes or number of seasons",
    "Genere",
    "The summary description"))

desc_table <- plot_ly(
  type = 'table',
  columnorder = c(1,2),
  columnwidth = c(12,12),
  header = list(
    values = c('<b>VARIABLES</b><br>', '<b>DESCRIPTION</b>'),
    line = list(color = '#506784'),
    fill = list(color = '#119DFF'),
    align = c('left','center'),
    font = list(color = 'white', size = 12),
    height = 40
  ),
  cells = list(
    values = values_table1,
    line = list(color = '#506784'),
    fill = list(color = c('#25FEFD', 'white')),
    align = c('left', 'left'),
    font = list(color = c('#506784'), size = 12),
    height = 30
    ))
desc_table

Data Cleaning

I will get rid of columns that are not necessary for my Netflix analysis.

netds$cast <- NULL

netds$rating <- as.factor(netds$rating)

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

netds$date_added <- mdy(netds$date_added)

netds$listed_in <- as.factor(netds$listed_in)
netds$type <- as.factor(netds$type)

data.frame("Variable"=c(colnames(netds)), "Missing Values"=sapply(netds, function(x) sum(is.na(x))), row.names=NULL)

##        Variable Missing.Values
## 1       show_id              0
## 2          type              0
## 3         title              0
## 4      director           1969
## 5       country            476
## 6    date_added             11
## 7  release_year              0
## 8        rating             10
## 9      duration              0
## 10    listed_in              0
## 11  description              0

mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
netds$rating[is.na(netds$rating)] <- mode(netds$rating)

data.frame("Variable"=c(colnames(netds)), "Missing Values"=sapply(netds, function(x) sum(is.na(x))), row.names=NULL)

##        Variable Missing.Values
## 1       show_id              0
## 2          type              0
## 3         title              0
## 4      director           1969
## 5       country            476
## 6    date_added             11
## 7  release_year              0
## 8        rating              0
## 9      duration              0
## 10    listed_in              0
## 11  description              0

I will use title, country, type, duration variables for my analysis.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

netds=distinct(netds, title, country, type, release_year, .keep_all = TRUE)

Data Visualisation

I will discover Which TV show and movie genres are the most in Netflix?

library(crayon)

## 
## Attaching package: 'crayon'

## The following object is masked from 'package:plotly':
## 
##     style

## The following object is masked from 'package:ggplot2':
## 
##     %+%

netds$listed_in<- as.character(netds$listed_in)
t20 <- strsplit(netds$listed_in, split = ", ")
count_listed_in<- data.frame(type = rep(netds$type, 
                                        sapply(t20, length)), 
                             listed_in = unlist(t20))
count_listed_in$listed_in <- as.character(gsub(",","",count_listed_in$listed_in))
df_count_listed_in <- count_listed_in %>% 
                            group_by(listed_in) %>% 
                            summarise(count = n()) %>% 
                            top_n(20)

## Selecting by count

fig <- plot_ly(df_count_listed_in, x= ~listed_in, y= ~df_count_listed_in$count, type = "bar" )
fig <- fig %>% layout(xaxis=list(categoryorder = "array", 
                                         categoryarray = df_count_listed_in$listed_in,
                                         title="Genre"), yaxis = list(title = 'Count'),
                                         title="20 Top Genres On Netflix")
fig

As we can see above, International movies&TV shows are the most common popular programs.

This is the short analysis of the which programs are the most popular in netflix. If we have more data and features like user rating, total watch etc. We could have been expanded our analysis further to gain useful information.

Thanks to Yigit Erol I performed my analysis about netflix data.

You can see his analysis and GitHub Repo in the links below.

https://github.com/ygterl/EDA-Netflix-2020-in-R/blob/master/Output-%20Exploration%20of%20Netflix%20Dataset%20in%20R.pdf

https://github.com/ygterl/EDA-Netflix-2020-in-R/blob/master/analiz.Rmd

Netflix_EDA

Kadir Baver Kerimoglu

03 11 2021

Overview

Data Cleaning

Data Visualisation