In this assignment i am going to be exploring the Netflix data.
This dataset consists of TV shows & movies suitable on Netflix as of 2019 and part of 2020.
netds <- read.csv("netflix_titles.csv", na.strings = c("", "NA"), stringsAsFactors =FALSE)
library(plotly)
## Zorunlu paket yükleniyor: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
values_table1 <- rbind(c('Show_ID', 'Type', 'Title', 'Director', 'Cast', 'Country', 'Date_added', 'Release_year', 'Rating' , 'Duration', 'Listed_in', 'Description'), c("Netflix ID for Every Movie & Tv Shows",
"Identifier - A Movie or TV Show",
"Title of the Movie or TV Show",
"Director of the Movie /TV Show",
"Actors involved in the Movie / TV Show",
"Country where the movie / show was produced",
"Added date on Netflix",
"Actual release year of the Movie / TV Show",
"Rating type of the Movie or TV Show",
"Total Duration - in minutes or number of seasons",
"Genere",
"The summary description"))
desc_table <- plot_ly(
type = 'table',
columnorder = c(1,2),
columnwidth = c(12,12),
header = list(
values = c('<b>VARIABLES</b><br>', '<b>DESCRIPTION</b>'),
line = list(color = '#506784'),
fill = list(color = '#119DFF'),
align = c('left','center'),
font = list(color = 'white', size = 12),
height = 40
),
cells = list(
values = values_table1,
line = list(color = '#506784'),
fill = list(color = c('#25FEFD', 'white')),
align = c('left', 'left'),
font = list(color = c('#506784'), size = 12),
height = 30
))
desc_table
I will get rid of columns that are not necessary for my Netflix analysis.
netds$cast <- NULL
netds$rating <- as.factor(netds$rating)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
netds$date_added <- mdy(netds$date_added)
netds$listed_in <- as.factor(netds$listed_in)
netds$type <- as.factor(netds$type)
data.frame("Variable"=c(colnames(netds)), "Missing Values"=sapply(netds, function(x) sum(is.na(x))), row.names=NULL)
## Variable Missing.Values
## 1 show_id 0
## 2 type 0
## 3 title 0
## 4 director 1969
## 5 country 476
## 6 date_added 11
## 7 release_year 0
## 8 rating 10
## 9 duration 0
## 10 listed_in 0
## 11 description 0
mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
netds$rating[is.na(netds$rating)] <- mode(netds$rating)
data.frame("Variable"=c(colnames(netds)), "Missing Values"=sapply(netds, function(x) sum(is.na(x))), row.names=NULL)
## Variable Missing.Values
## 1 show_id 0
## 2 type 0
## 3 title 0
## 4 director 1969
## 5 country 476
## 6 date_added 11
## 7 release_year 0
## 8 rating 0
## 9 duration 0
## 10 listed_in 0
## 11 description 0
I will use title, country, type, duration variables for my analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
netds=distinct(netds, title, country, type, release_year, .keep_all = TRUE)
I will discover Which TV show and movie genres are the most in Netflix?
library(crayon)
##
## Attaching package: 'crayon'
## The following object is masked from 'package:plotly':
##
## style
## The following object is masked from 'package:ggplot2':
##
## %+%
netds$listed_in<- as.character(netds$listed_in)
t20 <- strsplit(netds$listed_in, split = ", ")
count_listed_in<- data.frame(type = rep(netds$type,
sapply(t20, length)),
listed_in = unlist(t20))
count_listed_in$listed_in <- as.character(gsub(",","",count_listed_in$listed_in))
df_count_listed_in <- count_listed_in %>%
group_by(listed_in) %>%
summarise(count = n()) %>%
top_n(20)
## Selecting by count
fig <- plot_ly(df_count_listed_in, x= ~listed_in, y= ~df_count_listed_in$count, type = "bar" )
fig <- fig %>% layout(xaxis=list(categoryorder = "array",
categoryarray = df_count_listed_in$listed_in,
title="Genre"), yaxis = list(title = 'Count'),
title="20 Top Genres On Netflix")
fig
As we can see above, International movies&TV shows are the most common popular programs.
This is the short analysis of the which programs are the most popular in netflix. If we have more data and features like user rating, total watch etc. We could have been expanded our analysis further to gain useful information.
Thanks to Yigit Erol I performed my analysis about netflix data.
You can see his analysis and GitHub Repo in the links below.
https://github.com/ygterl/EDA-Netflix-2020-in-R/blob/master/analiz.Rmd