Netflix Data
In this assignment we will examine, clean, process and visualize the Netflix data. We will first extract information about the average duration in each genre of the movies. Secondly, we will find out which how many movie and TV-show that have been produced in each year between 2000 and 2019. Lastly, we will see statistical output about TV-shows that shows us the min, max, median, and mean considering season duration.
This dataset consists of id, type, title, director, cast, country, the date that has been added to the system, release year, rating, duration, categories that the content is listed in, and lastly the description of the content.
In this part, we simply read the CSV file via the link. Also for further details, you can view the structure of the data.
df <- read_csv("https://github.com/ygterl/EDA-Netflix-2020-in-R/raw/master/netflix_titles.csv")
str(df)
spec_tbl_df [6,234 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ show_id : num [1:6234] 81145628 80117401 70234439 80058654 80125979 ...
$ type : chr [1:6234] "Movie" "Movie" "TV Show" "TV Show" ...
$ title : chr [1:6234] "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
$ director : chr [1:6234] "Richard Finn, Tim Maltby" NA NA NA ...
$ cast : chr [1:6234] "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
$ country : chr [1:6234] "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
$ date_added : chr [1:6234] "September 9, 2019" "September 9, 2016" "September 8, 2018" "September 8, 2018" ...
$ release_year: num [1:6234] 2019 2016 2013 2016 2017 ...
$ rating : chr [1:6234] "TV-PG" "TV-MA" "TV-Y7-FV" "TV-Y7" ...
$ duration : chr [1:6234] "90 min" "94 min" "1 Season" "1 Season" ...
$ listed_in : chr [1:6234] "Children & Family Movies, Comedies" "Stand-Up Comedy" "Kids' TV" "Kids' TV" ...
$ description : chr [1:6234] "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...
- attr(*, "spec")=
.. cols(
.. show_id = col_double(),
.. type = col_character(),
.. title = col_character(),
.. director = col_character(),
.. cast = col_character(),
.. country = col_character(),
.. date_added = col_character(),
.. release_year = col_double(),
.. rating = col_character(),
.. duration = col_character(),
.. listed_in = col_character(),
.. description = col_character()
.. )
- attr(*, "problems")=<externalptr>
As seen above, the data consists of 6,234 rows and 12 columns related to tv shows and movies. But also we can see there are some unnecessary variables and improperly formatted data.
As mentioned above, in this part we will clean and transform the data so we can get rid of the unnecessary variables and improperly formatted data. For starters, we eliminate the show_id column since we will not use it in our analysis.
df$show_id <- NULL
Then, we convert date_added column to date format instead of character so we can easily filter or reorder the column as we desire.
df$date_added <- mdy(df$date_added)
For rating, listed_in and type column we have to convert them into categorical values rather than characters. For that, we will use factor function which is used to encode a vector as a factor. Factor returns an object of class “factor”.
df$rating <- as.factor(df$rating)
df$listed_in <- as.factor(df$listed_in)
df$type <- as.factor(df$type)
Lastly, we will eliminate the duplicated rows depending on the title, country, type, release_year columns.
df <- distinct(df, title, country, type, release_year, .keep_all = TRUE)
With that last part we concluded our data cleaning process.
In this part, first, we will use multiple packages to process and filter the data so we can have the necessary rows and columns. Then we will use the ggplot2 package to create some visualization with the processed data.
For the first part of our data processing, we will find mean values of duration in minutes in each genre of the movies.
df$duration <- lapply(df$duration, function(a) gsub(" min", "", a))
durations_movie <- df %>%
filter(type == "Movie") %>%
mutate(genre = gsub("^(.*?),.*", "\\1", listed_in)) %>%
group_by(genre) %>%
summarise(average_duration = round(mean(as.integer(duration)))) %>%
ungroup()
durations_movie
As seen in the code we used several functions to get the data in the form we need. What may come as strange is we also implemented RegEx in the second gsub function in order to locate the string types we needed. First, we had to get rid of the ” min” parts of the duration column so we would calculate the mean. Then we have proceeded to filter the data and group by. But since a movie may contain multiple genres we got 249 rows in which the same genre was produced over and over just because the combination of the genres in the same movie was different. That issue was solved by using regular expressions. Since different genres in a movie are separated we used a RegEx which means get anything until a comma. The rest is usual.
Now we will visualize the data with a bar chart since this data consists of categorical variable
ggplot(durations_movie, aes(x=reorder(genre, average_duration), y=average_duration)) +
geom_bar(colour="black", size=0.3, fill="#E3363f", stat="identity") +
ylim(0, 120) +
coord_flip() +
geom_text(aes(label = paste(format(average_duration, nsmall=0), "min")),
position = position_stack(vjust=0.5),
colour = "black", size = 3) +
xlab("Netflix Movie Genres") + ylab("Average Duration (min)") +
ggtitle("Average Duration in each Genre of the Movies")
As seen above “Action and Adventure” genre has the highest average screen time.
In the second analysis, we will find the number of movies and tv-shows in each year on Netflix between 2000 and 2019. We exclude 20th-century content since there is so little amount of them in a year. So let’s start with processing the data.
number_movies_tvshows <- df %>%
filter(release_year >= 2000 & release_year < 2020) %>%
group_by(type, release_year) %>%
summarise(count = n()) %>%
arrange(desc(release_year)) %>%
ungroup()
number_movies_tvshows
As you can see 2017, 2018, and 2016 there were hundreds of movies whereas the year that most tv-shows produced was 2019.
Now it’s time to visualize the data on a line chart.
ggplot(number_movies_tvshows, aes(release_year, count)) +
geom_line(aes(colour = type), size = 1.5, linetype = "dotted") +
geom_point(size = 2.5) +
ggtitle("Number of Movies and TV-shows in each year") +
theme(legend.position=c(0.2, 0.8),
legend.key.size = unit(1, 'cm'),
legend.title = element_text(size=14),
legend.text = element_text(size=10),
legend.key=element_blank(),
legend.background=element_rect(fill = "lightgrey"),
axis.text.x = element_text(angle = 90, vjust = 1)) +
labs(color="Type of the Content") +
scale_x_continuous(name="Year", breaks=seq(2000, 2019)) +
scale_y_continuous(name="Number of Movies and TV-Shows", breaks=seq(0, 700, 50))
The line chart shows us there was only one year, 2019, that amount of the tv-shows surpassed the number of movies. But the decrease in the number of movies started in 2018 so we can expect to see more tv-shows than movies, in the future.
Lastly, we will process the data so we get a statistical summary about the season length of TV-shows. Even though we could find a way to implement summary function which gives us all the values we need, minimum, 1st quartile, median, mean, 3rd quartile and max values, we will find them seperately by using their own functions in order to demonstrate how diverse and practical R is.
Well, we all can guess what is the minimum number of seasons for a TV show, 1 season. But what about the longest tv show in our data. Let’s find out.
df$duration <- lapply(df$duration, function(b) as.numeric(gsub("([0-9]+).*$", "\\1", b)))
maximum_season <- df %>%
filter(type == "TV Show") %>%
slice(which.max(as.numeric(duration))) %>%
summarise(title, longest_season = as.numeric(duration)) %>%
ungroup()
maximum_season
We inferred that “Grey’s Anatomy” was the longest tv show in our data with 15 seasons.
Now let’s find the median season length and its corresponding title.
median_season <- df %>%
filter(type == "TV Show") %>%
slice(median(as.numeric(duration))) %>%
summarise(title, median_season = as.numeric(duration)) %>%
ungroup()
median_season
Since there is a lot of tv shows with 1 season our median is 1 season with the “Transformers Prime” tv show. So far with the information, we have, minimum, maximum, and median, we can assume that this is a right-skewed distribution.
Let’s not just assume and visualize the data we already have.
season_number <- df %>%
filter(type == "TV Show")
ggplot(season_number, aes(x=as.numeric(duration))) +
geom_histogram(fill="#E3363f", binwidth=1) +
geom_vline(aes(xintercept=mean(as.numeric(duration))), color="#366de3", linetype="dashed", size=0.9) +
annotate(
"text",
x = 2.5, y = 1200,
label = "The\naverage\nnumber\nof\nseasons",
vjust = 1, size = 3, color = "#366de3"
) +
geom_vline(aes(xintercept=median(as.numeric(duration))), color="grey40", linetype="dashed", size=0.9) +
annotate(
"text",
x = .5, y = 1200,
label = "The\nmedian\nnumber\nof\nseasons",
vjust = 1, size = 3, color = "grey40"
) +
xlab("Number of Season") +
ylab("Count of Number of Season") +
ggtitle("Distribution of Season Length") +
theme(legend.title = element_blank())
As expected, this is a right-skewed distribution. For better understanding, I also included the other statistics we have obtained.
In this part we will calculate the average season length of each genre of tv shows.
season_length_tvshows <- df %>%
filter(type == "TV Show") %>%
mutate(genre = gsub("^(.*?),.*", "\\1", listed_in)) %>%
group_by(genre) %>%
summarise(average_season_length = round(mean(as.integer(duration), na.rm = TRUE), 1)) %>%
ungroup()
season_length_tvshows
For this part we will visualize the data on horizontal bar chart.
ggplot(season_length_tvshows, aes(x=reorder(genre, average_season_length), y=average_season_length)) +
geom_bar(colour="black", size=0.3, fill="#E3363f", stat="identity") +
ylim(0, 6) +
coord_flip() +
geom_text(aes(label = paste(format(average_season_length, nsmall=0), " Season")),
position = position_stack(vjust=0.5),
colour = "black", size = 3) +
xlab("Netflix TV Show Genres") + ylab("Average Length (season)") +
ggtitle("Average Season Length in each Genre of the TV Shows")
It is clear that Classical & Cult TV Shows have the highest average season length with 5.8 Season, followed by Romantic TV Shows.
Well, as we head to end this report, we will include the way to locate 1st and 3rd quartiles, just to demonstrate the proper functions, not like they provide us any insights.
duration_quartiles <- df %>%
filter(type == "TV Show") %>%
group_by(type) %>%
summarise(Q1 = quantile(as.numeric(duration), 0.25),
Q3 = quantile(as.numeric(duration), 0.75)) %>%
ungroup()
duration_quartiles
Today we have dealt with Netflix Data. Nowadays people tend to get lost in daily problems, career goals, and social media, but sometimes it is necessary to just retreat from all and chill. Well, that brings us to Netflix. In conclusion to our analyses, we can say movie production statistically tends to diminish whereas tv shows are rising, especially with the help of streaming platforms. There is a high chance that we will see more and more tv shows. We also concluded that watching Grey’s Anatomy from the beginning would take so much time in the nowadays competitive world. But thankfully seeing from our histogram tv shows tend to be as short as 1 season, which is the perfect amount to get the gist of the story and finish it in one go. Even though tv shows evolve to be shorter we also see the number of seasons of Romantic TV Shows is still high with 3.6 seasons, averagely. But being a documentary and docuseries lover saves me so much time to work more on data analytics, as seen in our outputs.
Disclaimer: First part of this report is overly inspired by Yigit Erol’s Medium post and Github repository. All the excerpts and analyses are done for educational purposes.