In our dataset, there are five different DataFrame which related to 2017’s tennis competitions, match details, player and tournament information. You can find these beautiful dataset from given link.
library(tidyverse)
library(dplyr)
library(ggplot2)
What is the country rankings in single championships?
single_winners <- tourney_df %>%
inner_join(player_df,by = c("singles_winner_player_id" = "player_id")) %>%
group_by(flag_code) %>% summarise(country_wins = n()) %>% arrange(desc(country_wins))
head(single_winners)
## # A tibble: 6 x 2
## flag_code country_wins
## <chr> <int>
## 1 ESP 11
## 2 USA 9
## 3 SUI 8
## 4 FRA 7
## 5 GER 7
## 6 BUL 4
Of course Spain guess why? We will dive into this later.
The countries which couldn’t succeed any single chaimpionships and they ranked according to their total number of won matches during the games.
won_countries <- tourney_df %>%
inner_join(player_df,by = c("singles_winner_player_id" = "player_id"))
couldnt <- player_df %>% anti_join(won_countries, by = "flag_code") %>%
inner_join(score_df,by=c("player_id"="winner_player_id")) %>%
group_by(flag_code) %>% summarise(country_wins = sum(winner_games_won)) %>% arrange(desc(country_wins))
head(couldnt)
## # A tibble: 6 x 2
## flag_code country_wins
## <chr> <dbl>
## 1 AUS 1989
## 2 CZE 1209
## 3 CAN 1190
## 4 SVK 889
## 5 BRA 873
## 6 POR 621
Nice try Australia :(
It can be predicted that most of winning players would be right handed.The ratio is 84%.
tab <- score_df %>%
inner_join(player_df,by = c("winner_player_id"="player_id"))
tab$tourney_id <- as.numeric(tab$tourney_id)
tab %>% group_by(handedness) %>%
summarise(total_wins = n()) %>% arrange(desc(total_wins))
## # A tibble: 3 x 2
## handedness total_wins
## <chr> <int>
## 1 Right-Handed 3173
## 2 Left-Handed 591
## 3 <NA> 32
But there is an unpredicted left handed player which gives us an idea about Spain’s success in single championships.
tab %>% group_by(player_slug,flag_code,handedness) %>%
summarise(total_wins = n()) %>% arrange(desc(total_wins)) %>% head(5)
## # A tibble: 5 x 4
## # Groups: player_slug, flag_code [5]
## player_slug flag_code handedness total_wins
## <chr> <chr> <chr> <int>
## 1 rafael-nadal ESP Left-Handed 67
## 2 alexander-zverev GER Right-Handed 54
## 3 david-goffin BEL Right-Handed 53
## 4 roger-federer SUI Right-Handed 53
## 5 grigor-dimitrov BUL Right-Handed 49
Our player Data Frame includes information about start year of players professional life. But did it really affect their wins?
experience_rate <- tab %>% group_by(turned_pro) %>%
summarise(total_wins =n(),number_of_players = n_distinct(winner_player_id)) %>%
arrange(turned_pro) %>% mutate(wins_per_player = total_wins / number_of_players)
head(experience_rate,7)
## # A tibble: 7 x 4
## turned_pro total_wins number_of_players wins_per_player
## <dbl> <int> <int> <dbl>
## 1 1996 14 2 7
## 2 1997 25 1 25
## 3 1998 62 4 15.5
## 4 1999 32 4 8
## 5 2000 87 7 12.4
## 6 2001 206 9 22.9
## 7 2002 184 13 14.2
According to players start year of players, total wins in 2017 and number of player in that range given in the table, so we can calculate that specific years wins per player.
ggplot(experience_rate,aes(x=turned_pro,y=wins_per_player))+
geom_point(aes(col="tomato2",size = total_wins)) + theme(legend.position="none")
## Warning: Removed 1 rows containing missing values (geom_point).
Most of the wins accomplished from players which started their professional life around 2005.
But there is someone who started his career in 1997 and in 2017, at his 20th year of professional life in tennis he won 25 match.
tab %>% filter(turned_pro == 1997) %>% group_by(player_slug) %>% summarise(total_wins = n())
## # A tibble: 1 x 2
## player_slug total_wins
## <chr> <int>
## 1 feliciano-lopez 25