About Data

In our dataset, there are five different DataFrame which related to 2017’s tennis competitions, match details, player and tournament information. You can find these beautiful dataset from given link.

library(tidyverse)
library(dplyr)
library(ggplot2)

Initial Analysis

What is the country rankings in single championships?

single_winners <- tourney_df %>% 
  inner_join(player_df,by = c("singles_winner_player_id" = "player_id")) %>%
  group_by(flag_code) %>% summarise(country_wins = n()) %>% arrange(desc(country_wins))
head(single_winners)
## # A tibble: 6 x 2
##   flag_code country_wins
##   <chr>            <int>
## 1 ESP                 11
## 2 USA                  9
## 3 SUI                  8
## 4 FRA                  7
## 5 GER                  7
## 6 BUL                  4

Of course Spain guess why? We will dive into this later.

The countries which couldn’t succeed any single chaimpionships and they ranked according to their total number of won matches during the games.

won_countries <- tourney_df %>% 
  inner_join(player_df,by = c("singles_winner_player_id" = "player_id"))
couldnt <- player_df %>% anti_join(won_countries, by = "flag_code") %>%
  inner_join(score_df,by=c("player_id"="winner_player_id")) %>%
  group_by(flag_code) %>% summarise(country_wins = sum(winner_games_won)) %>% arrange(desc(country_wins))
head(couldnt)
## # A tibble: 6 x 2
##   flag_code country_wins
##   <chr>            <dbl>
## 1 AUS               1989
## 2 CZE               1209
## 3 CAN               1190
## 4 SVK                889
## 5 BRA                873
## 6 POR                621

Nice try Australia :(

Handedness Analysis and Nadal

It can be predicted that most of winning players would be right handed.The ratio is 84%.

tab <- score_df %>% 
  inner_join(player_df,by = c("winner_player_id"="player_id"))
tab$tourney_id <- as.numeric(tab$tourney_id)

tab %>% group_by(handedness) %>%
  summarise(total_wins = n()) %>% arrange(desc(total_wins))
## # A tibble: 3 x 2
##   handedness   total_wins
##   <chr>             <int>
## 1 Right-Handed       3173
## 2 Left-Handed         591
## 3 <NA>                 32

But there is an unpredicted left handed player which gives us an idea about Spain’s success in single championships.

tab %>% group_by(player_slug,flag_code,handedness) %>%
  summarise(total_wins = n()) %>% arrange(desc(total_wins)) %>% head(5)
## # A tibble: 5 x 4
## # Groups:   player_slug, flag_code [5]
##   player_slug      flag_code handedness   total_wins
##   <chr>            <chr>     <chr>             <int>
## 1 rafael-nadal     ESP       Left-Handed          67
## 2 alexander-zverev GER       Right-Handed         54
## 3 david-goffin     BEL       Right-Handed         53
## 4 roger-federer    SUI       Right-Handed         53
## 5 grigor-dimitrov  BUL       Right-Handed         49

Experience Analysis

Our player Data Frame includes information about start year of players professional life. But did it really affect their wins?

experience_rate <- tab %>% group_by(turned_pro) %>% 
  summarise(total_wins =n(),number_of_players = n_distinct(winner_player_id)) %>% 
  arrange(turned_pro) %>% mutate(wins_per_player = total_wins / number_of_players)
head(experience_rate,7)
## # A tibble: 7 x 4
##   turned_pro total_wins number_of_players wins_per_player
##        <dbl>      <int>             <int>           <dbl>
## 1       1996         14                 2             7  
## 2       1997         25                 1            25  
## 3       1998         62                 4            15.5
## 4       1999         32                 4             8  
## 5       2000         87                 7            12.4
## 6       2001        206                 9            22.9
## 7       2002        184                13            14.2

According to players start year of players, total wins in 2017 and number of player in that range given in the table, so we can calculate that specific years wins per player.

ggplot(experience_rate,aes(x=turned_pro,y=wins_per_player))+ 
  geom_point(aes(col="tomato2",size = total_wins)) + theme(legend.position="none") 
## Warning: Removed 1 rows containing missing values (geom_point).

Most of the wins accomplished from players which started their professional life around 2005.

But there is someone who started his career in 1997 and in 2017, at his 20th year of professional life in tennis he won 25 match.

tab %>% filter(turned_pro == 1997) %>% group_by(player_slug) %>% summarise(total_wins = n())
## # A tibble: 1 x 2
##   player_slug     total_wins
##   <chr>                <int>
## 1 feliciano-lopez         25