Netflix’s data contains from 6234 observations of 12 variables. Those variables are general information about programs such as title, director, cast, country.
netflix_data=read.csv(file = 'https://github.com/ygterl/EDA-Netflix-2020-in-R/raw/master/netflix_titles.csv',
stringsAsFactors = FALSE, header = TRUE,sep = ",", encoding="UTF-8")
str(netflix_data)
## 'data.frame': 6234 obs. of 12 variables:
## $ show_id : int 81145628 80117401 70234439 80058654 80125979 80163890 70304989 80164077 80117902 70304990 ...
## $ type : chr "Movie" "Movie" "TV Show" "TV Show" ...
## $ title : chr "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
## $ director : chr "Richard Finn, Tim Maltby" "" "" "" ...
## $ cast : chr "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
## $ country : chr "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
## $ date_added : chr "September 9, 2019" "September 9, 2016" "September 8, 2018" "September 8, 2018" ...
## $ release_year: int 2019 2016 2013 2016 2017 2016 2014 2017 2017 2014 ...
## $ rating : chr "TV-PG" "TV-MA" "TV-Y7-FV" "TV-Y7" ...
## $ duration : chr "90 min" "94 min" "1 Season" "1 Season" ...
## $ listed_in : chr "Children & Family Movies, Comedies" "Stand-Up Comedy" "Kids' TV" "Kids' TV" ...
## $ description : chr "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...
All data contains two different type of programs that are Movie and TV Show.
netflix_country=netflix_data %>%
mutate(country=gsub("^(.*?),.*", "\\1", country)) %>%
filter(type=="Movie" & nchar(country)!=0 ) %>%
group_by(country) %>%
count() %>%
arrange(desc(n)) %>%
head(5)
# This is normal pie chart
ggplot(netflix_country,aes(x='',y=n,fill=country))+
ggtitle("Most 10 Country by Movie Count")+
geom_bar(stat = "Identity",width = 2,color="White")+
coord_polar("y",start = 0) +
theme_void()
# This is normal donut chart
ggplot(netflix_country, aes(x = 2, y = n, fill = country)) +
geom_bar(stat = "identity", color = "white") +
coord_polar(theta = "y", start = 0)+
theme_void()+
xlim(0.5, 2.5)
We examine data according to country.
netflix_data %>%
group_by(country) %>%
count()
## # A tibble: 555 x 2
## # Groups: country [555]
## country n
## <chr> <int>
## 1 "" 476
## 2 "Argentina" 38
## 3 "Argentina, Brazil, France, Poland, Germany, Denmark" 1
## 4 "Argentina, Chile" 1
## 5 "Argentina, Chile, Peru" 1
## 6 "Argentina, France" 1
## 7 "Argentina, France, Germany" 1
## 8 "Argentina, Italy" 2
## 9 "Argentina, Spain" 7
## 10 "Argentina, United States" 1
## # ... with 545 more rows
Top 3 kind of Movies are “Dramas”, “Comedies” and “Documentaries”.
netflix_data %>%
filter(type=="Movie") %>%
mutate(kind=gsub("^(.*?),.*", "\\1", listed_in)) %>%
group_by(kind) %>%
count() %>%
arrange(desc(n))
## # A tibble: 18 x 2
## # Groups: kind [18]
## kind n
## <chr> <int>
## 1 Dramas 1077
## 2 Comedies 803
## 3 Documentaries 644
## 4 Action & Adventure 597
## 5 Children & Family Movies 358
## 6 Stand-Up Comedy 273
## 7 Horror Movies 205
## 8 International Movies 85
## 9 Classic Movies 62
## 10 Movies 56
## 11 Thrillers 40
## 12 Independent Movies 18
## 13 Anime Features 12
## 14 Music & Musicals 12
## 15 Cult Movies 10
## 16 Sci-Fi & Fantasy 10
## 17 Romantic Movies 2
## 18 Sports Movies 1
When analysis, I relaized that 195 movies don’t have country information. So, I eliminated this data.
netflix_country = netflix_data %>%
mutate(country=gsub("^(.*?),.*", "\\1", country)) %>%
filter(type=="Movie" & nchar(country)!=0 ) %>%
group_by(country) %>%
count() %>%
summarize(total_count=sum(n)) %>%
arrange(desc(total_count)) %>%
head(10)
netflix_country_by_year = netflix_data %>%
mutate(country=gsub("^(.*?),.*", "\\1", country)) %>%
filter(type=="Movie" & nchar(country)!=0 ) %>%
group_by(country,release_year) %>% count() %>%
arrange(desc(country,release_year))
netflix_country_by_year2= netflix_country_by_year %>% inner_join(netflix_country,by="country")
netflix_country_by_year2
## # A tibble: 274 x 4
## # Groups: country, release_year [274]
## country release_year n total_count
## <chr> <int> <int> <int>
## 1 United States 1942 2 1682
## 2 United States 1943 3 1682
## 3 United States 1944 3 1682
## 4 United States 1945 3 1682
## 5 United States 1946 2 1682
## 6 United States 1947 1 1682
## 7 United States 1954 1 1682
## 8 United States 1955 1 1682
## 9 United States 1956 1 1682
## 10 United States 1958 2 1682
## # ... with 264 more rows
Lets see the grap.
ggplot(netflix_country_by_year2,aes(x=release_year,y=country,color=country,size=n)) +
geom_line() +
labs(title="Yıllara Göre Ülkelerin Toplam Film Sayısı",caption="Source: Netflix")
4. How many movie did the directors make ?
This is same analysis above, but this time according to directors
netflix_director = netflix_data %>%
mutate(director=gsub("^(.*?),.*", "\\1", director)) %>%
filter(type=="Movie" & nchar(director)!=0 ) %>%
group_by(director) %>%
count() %>%
summarize(total_count=sum(n)) %>%
arrange(desc(total_count)) %>%
head(5)
netflix_director_by_year = netflix_data %>%
mutate(director=gsub("^(.*?),.*", "\\1", director)) %>%
filter(type=="Movie" & nchar(director)!=0 ) %>%
group_by(director,release_year) %>% count() %>%
arrange(desc(director,release_year))
netflix_director_by_year2= netflix_director_by_year %>% inner_join(netflix_director,by="director")
netflix_director_by_year2
## # A tibble: 29 x 4
## # Groups: director, release_year [29]
## director release_year n total_count
## <chr> <int> <int> <int>
## 1 Raúl Campos 2016 3 18
## 2 Raúl Campos 2017 3 18
## 3 Raúl Campos 2018 12 18
## 4 Martin Scorsese 1967 1 9
## 5 Martin Scorsese 1973 1 9
## 6 Martin Scorsese 1974 1 9
## 7 Martin Scorsese 1976 1 9
## 8 Martin Scorsese 1980 1 9
## 9 Martin Scorsese 2002 1 9
## 10 Martin Scorsese 2006 1 9
## # ... with 19 more rows
Lets see the grap.
ggplot(netflix_director_by_year2,aes(x=release_year,y=n,color=director)) +
geom_line() +
labs(title="Yıllara Göre Yönetmenlerin Film Sayısı",caption="Source: Netflix")
We saw the upper graph: The number of films made by the directors who made the most films in the last 10 years has increased. I rearrange the formula of plot that has data between 2005 and 2020.
netflix_director_by_year2=netflix_director_by_year2 %>% filter(release_year>=2005 & release_year<=2020)
ggplot(netflix_director_by_year2,aes(x=release_year,y=n,color=director)) +
geom_line() +
labs(title="Yıllara Göre Yönetmenlerin Film Sayısı",caption="Source: Netflix") +
expand_limits(y=0)
We can find out how many movies the actors have in total.
We can see that 7 actors of top 10 are Indian.
netflix_cast= netflix_data %>%
mutate(cast=gsub("^(.*?),.*", "\\1", cast)) %>%
filter(type=="Movie" & nchar(cast)!=0 ) %>%
group_by(cast,country) %>%
count() %>%
summarize(total_count=sum(n)) %>%
arrange(desc(total_count)) %>%
head(5)
## `summarise()` has grouped output by 'cast'. You can override using the `.groups` argument.
netflix_cast
## # A tibble: 5 x 3
## # Groups: cast [5]
## cast country total_count
## <chr> <chr> <int>
## 1 Shah Rukh Khan India 22
## 2 Akshay Kumar India 19
## 3 Amitabh Bachchan India 15
## 4 Adam Sandler United States 13
## 5 Aamir Khan India 12
# Which Date , Which Cast
netflix_cast_by_year = netflix_data %>%
mutate(cast=gsub("^(.*?),.*", "\\1", cast)) %>%
filter(type=="Movie" & nchar(cast)!=0 ) %>%
group_by(cast,release_year) %>% count() %>%
arrange(desc(cast,release_year))
netflix_cast_by_year2= netflix_cast_by_year %>% inner_join(netflix_cast,by="cast")
netflix_cast_by_year2
## # A tibble: 65 x 5
## # Groups: cast, release_year [65]
## cast release_year n country total_count
## <chr> <int> <int> <chr> <int>
## 1 Shah Rukh Khan 1992 1 India 22
## 2 Shah Rukh Khan 1994 1 India 22
## 3 Shah Rukh Khan 1995 2 India 22
## 4 Shah Rukh Khan 1996 1 India 22
## 5 Shah Rukh Khan 1997 1 India 22
## 6 Shah Rukh Khan 1998 1 India 22
## 7 Shah Rukh Khan 2000 1 India 22
## 8 Shah Rukh Khan 2001 2 India 22
## 9 Shah Rukh Khan 2003 1 India 22
## 10 Shah Rukh Khan 2004 2 India 22
## # ... with 55 more rows
Let’s see the grap
netflix_cast
## # A tibble: 5 x 3
## # Groups: cast [5]
## cast country total_count
## <chr> <chr> <int>
## 1 Shah Rukh Khan India 22
## 2 Akshay Kumar India 19
## 3 Amitabh Bachchan India 15
## 4 Adam Sandler United States 13
## 5 Aamir Khan India 12
ggplot(netflix_cast_by_year2,aes(x=release_year,y=n,color=cast)) +
geom_point() +
labs(title="Yıllara Göre Oyuncuların Film Sayısı",caption="Source: Netflix") +
expand_limits(y=0)
To be continued…..
Thanks for Yigit Erol’s Medium