Netflix Data

Netflix’s data contains from 6234 observations of 12 variables. Those variables are general information about programs such as title, director, cast, country.

netflix_data=read.csv(file = 'https://github.com/ygterl/EDA-Netflix-2020-in-R/raw/master/netflix_titles.csv',
                    stringsAsFactors = FALSE, header = TRUE,sep = ",", encoding="UTF-8")
str(netflix_data)
## 'data.frame':    6234 obs. of  12 variables:
##  $ show_id     : int  81145628 80117401 70234439 80058654 80125979 80163890 70304989 80164077 80117902 70304990 ...
##  $ type        : chr  "Movie" "Movie" "TV Show" "TV Show" ...
##  $ title       : chr  "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
##  $ director    : chr  "Richard Finn, Tim Maltby" "" "" "" ...
##  $ cast        : chr  "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
##  $ country     : chr  "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
##  $ date_added  : chr  "September 9, 2019" "September 9, 2016" "September 8, 2018" "September 8, 2018" ...
##  $ release_year: int  2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
##  $ rating      : chr  "TV-PG" "TV-MA" "TV-Y7-FV" "TV-Y7" ...
##  $ duration    : chr  "90 min" "94 min" "1 Season" "1 Season" ...
##  $ listed_in   : chr  "Children & Family Movies, Comedies" "Stand-Up Comedy" "Kids' TV" "Kids' TV" ...
##  $ description : chr  "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...

Basic Questions

  1. How many programs are there by type or country ?

All data contains two different type of programs that are Movie and TV Show.

netflix_country=netflix_data %>%
  mutate(country=gsub("^(.*?),.*", "\\1", country)) %>% 
  filter(type=="Movie" & nchar(country)!=0 ) %>%
  group_by(country) %>%
  count() %>%
  arrange(desc(n)) %>% 
  head(5)

# This is normal pie chart
ggplot(netflix_country,aes(x='',y=n,fill=country))+
ggtitle("Most 10 Country by Movie Count")+
geom_bar(stat = "Identity",width = 2,color="White")+
coord_polar("y",start = 0) +
theme_void()

# This is normal donut chart

ggplot(netflix_country, aes(x = 2, y = n, fill = country)) +
  geom_bar(stat = "identity", color = "white") +
  coord_polar(theta = "y", start = 0)+
  theme_void()+
  xlim(0.5, 2.5)

We examine data according to country.

netflix_data %>% 
  group_by(country) %>%
  count()
## # A tibble: 555 x 2
## # Groups:   country [555]
##    country                                                   n
##    <chr>                                                 <int>
##  1 ""                                                      476
##  2 "Argentina"                                              38
##  3 "Argentina, Brazil, France, Poland, Germany, Denmark"     1
##  4 "Argentina, Chile"                                        1
##  5 "Argentina, Chile, Peru"                                  1
##  6 "Argentina, France"                                       1
##  7 "Argentina, France, Germany"                              1
##  8 "Argentina, Italy"                                        2
##  9 "Argentina, Spain"                                        7
## 10 "Argentina, United States"                                1
## # ... with 545 more rows
  1. What kind of movies are in this data?

Top 3 kind of Movies are “Dramas”, “Comedies” and “Documentaries”.

netflix_data %>%
  filter(type=="Movie") %>%
  mutate(kind=gsub("^(.*?),.*", "\\1", listed_in)) %>%
  group_by(kind) %>%
  count() %>% 
  arrange(desc(n))
## # A tibble: 18 x 2
## # Groups:   kind [18]
##    kind                         n
##    <chr>                    <int>
##  1 Dramas                    1077
##  2 Comedies                   803
##  3 Documentaries              644
##  4 Action & Adventure         597
##  5 Children & Family Movies   358
##  6 Stand-Up Comedy            273
##  7 Horror Movies              205
##  8 International Movies        85
##  9 Classic Movies              62
## 10 Movies                      56
## 11 Thrillers                   40
## 12 Independent Movies          18
## 13 Anime Features              12
## 14 Music & Musicals            12
## 15 Cult Movies                 10
## 16 Sci-Fi & Fantasy            10
## 17 Romantic Movies              2
## 18 Sports Movies                1
  1. How many movie are there by country ?

When analysis, I relaized that 195 movies don’t have country information. So, I eliminated this data.

netflix_country = netflix_data %>%
  mutate(country=gsub("^(.*?),.*", "\\1", country)) %>% 
  filter(type=="Movie" & nchar(country)!=0 ) %>%
  group_by(country) %>% 
  count() %>%
  summarize(total_count=sum(n)) %>%
  arrange(desc(total_count)) %>%
  head(10)


netflix_country_by_year = netflix_data %>% 
  mutate(country=gsub("^(.*?),.*", "\\1", country)) %>% 
  filter(type=="Movie" & nchar(country)!=0 ) %>%
  group_by(country,release_year) %>% count() %>%
  arrange(desc(country,release_year)) 

netflix_country_by_year2= netflix_country_by_year %>% inner_join(netflix_country,by="country")
netflix_country_by_year2
## # A tibble: 274 x 4
## # Groups:   country, release_year [274]
##    country       release_year     n total_count
##    <chr>                <int> <int>       <int>
##  1 United States         1942     2        1682
##  2 United States         1943     3        1682
##  3 United States         1944     3        1682
##  4 United States         1945     3        1682
##  5 United States         1946     2        1682
##  6 United States         1947     1        1682
##  7 United States         1954     1        1682
##  8 United States         1955     1        1682
##  9 United States         1956     1        1682
## 10 United States         1958     2        1682
## # ... with 264 more rows

Lets see the grap.

ggplot(netflix_country_by_year2,aes(x=release_year,y=country,color=country,size=n)) + 
  geom_line() + 
  labs(title="Yıllara Göre Ülkelerin Toplam Film Sayısı",caption="Source: Netflix")

4. How many movie did the directors make ?

This is same analysis above, but this time according to directors

netflix_director = netflix_data %>%
  mutate(director=gsub("^(.*?),.*", "\\1", director)) %>% 
  filter(type=="Movie" & nchar(director)!=0 ) %>%
  group_by(director) %>% 
  count() %>%
  summarize(total_count=sum(n)) %>%
  arrange(desc(total_count)) %>%
  head(5)


netflix_director_by_year = netflix_data %>% 
  mutate(director=gsub("^(.*?),.*", "\\1", director)) %>% 
  filter(type=="Movie" & nchar(director)!=0 ) %>%
  group_by(director,release_year) %>% count() %>%
  arrange(desc(director,release_year)) 

netflix_director_by_year2= netflix_director_by_year %>% inner_join(netflix_director,by="director")
netflix_director_by_year2
## # A tibble: 29 x 4
## # Groups:   director, release_year [29]
##    director        release_year     n total_count
##    <chr>                  <int> <int>       <int>
##  1 Raúl Campos             2016     3          18
##  2 Raúl Campos             2017     3          18
##  3 Raúl Campos             2018    12          18
##  4 Martin Scorsese         1967     1           9
##  5 Martin Scorsese         1973     1           9
##  6 Martin Scorsese         1974     1           9
##  7 Martin Scorsese         1976     1           9
##  8 Martin Scorsese         1980     1           9
##  9 Martin Scorsese         2002     1           9
## 10 Martin Scorsese         2006     1           9
## # ... with 19 more rows

Lets see the grap.

ggplot(netflix_director_by_year2,aes(x=release_year,y=n,color=director)) + 
  geom_line() + 
  labs(title="Yıllara Göre Yönetmenlerin Film Sayısı",caption="Source: Netflix")

We saw the upper graph: The number of films made by the directors who made the most films in the last 10 years has increased. I rearrange the formula of plot that has data between 2005 and 2020.

netflix_director_by_year2=netflix_director_by_year2 %>% filter(release_year>=2005 & release_year<=2020)

ggplot(netflix_director_by_year2,aes(x=release_year,y=n,color=director)) + 
  geom_line() + 
  labs(title="Yıllara Göre Yönetmenlerin Film Sayısı",caption="Source: Netflix") +
  expand_limits(y=0)

  1. What does situation cast ?

We can find out how many movies the actors have in total.

We can see that 7 actors of top 10 are Indian.

netflix_cast= netflix_data %>%
  mutate(cast=gsub("^(.*?),.*", "\\1", cast)) %>% 
  filter(type=="Movie" & nchar(cast)!=0 ) %>%
  group_by(cast,country) %>% 
  count() %>%
  summarize(total_count=sum(n)) %>%
  arrange(desc(total_count)) %>%
  head(5)
## `summarise()` has grouped output by 'cast'. You can override using the `.groups` argument.
netflix_cast
## # A tibble: 5 x 3
## # Groups:   cast [5]
##   cast             country       total_count
##   <chr>            <chr>               <int>
## 1 Shah Rukh Khan   India                  22
## 2 Akshay Kumar     India                  19
## 3 Amitabh Bachchan India                  15
## 4 Adam Sandler     United States          13
## 5 Aamir Khan       India                  12
# Which Date , Which Cast
netflix_cast_by_year = netflix_data %>% 
  mutate(cast=gsub("^(.*?),.*", "\\1", cast)) %>% 
  filter(type=="Movie" & nchar(cast)!=0 ) %>%
  group_by(cast,release_year) %>% count() %>%
  arrange(desc(cast,release_year))


netflix_cast_by_year2= netflix_cast_by_year %>% inner_join(netflix_cast,by="cast")
netflix_cast_by_year2
## # A tibble: 65 x 5
## # Groups:   cast, release_year [65]
##    cast           release_year     n country total_count
##    <chr>                 <int> <int> <chr>         <int>
##  1 Shah Rukh Khan         1992     1 India            22
##  2 Shah Rukh Khan         1994     1 India            22
##  3 Shah Rukh Khan         1995     2 India            22
##  4 Shah Rukh Khan         1996     1 India            22
##  5 Shah Rukh Khan         1997     1 India            22
##  6 Shah Rukh Khan         1998     1 India            22
##  7 Shah Rukh Khan         2000     1 India            22
##  8 Shah Rukh Khan         2001     2 India            22
##  9 Shah Rukh Khan         2003     1 India            22
## 10 Shah Rukh Khan         2004     2 India            22
## # ... with 55 more rows

Let’s see the grap

netflix_cast
## # A tibble: 5 x 3
## # Groups:   cast [5]
##   cast             country       total_count
##   <chr>            <chr>               <int>
## 1 Shah Rukh Khan   India                  22
## 2 Akshay Kumar     India                  19
## 3 Amitabh Bachchan India                  15
## 4 Adam Sandler     United States          13
## 5 Aamir Khan       India                  12
ggplot(netflix_cast_by_year2,aes(x=release_year,y=n,color=cast)) + 
  geom_point() + 
  labs(title="Yıllara Göre Oyuncuların Film Sayısı",caption="Source: Netflix") +
  expand_limits(y=0)

To be continued…..

Thanks for Yigit Erol’s Medium