This year we selected Business Analytics lecture to improve our data analysing skills. The whole lecture was focused about one final project and improving our skills to be able to analyse data in the project . We choose the 2017-2018 Premier League match results data to analyse. The data had 2 different dataframes: one frame had the bets companies and bets given for a match played in Premier League in 2017-2018 season where second frame contains match results of total matches plaed in Premier League 2017-2018 season. The data looks like the one below.

Then we put some effort on the data and clenaed the NAs in data and regroup the data as homeWin Draw and Away Wins. This markdown will not give any output, it was just to clean empty columns, seperating scores to be able to analyse well and grouping our datas in 1,x and 2 according to scores.

dat2 = dat2[,-c(1,7)]
dat2 = dat2[!is.na(dat2$score), ]
dat2 = separate(dat2, col = 'score', into = c('sc.home', 'sc.away'), sep = ":")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows
## [3311].
dat2$sc.away = as.numeric(dat2$sc.away)
dat2$sc.home = as.numeric(dat2$sc.home)
## Warning: Zorlamadan dolayı ortaya çıkan NAs
dat2 = na.omit(dat2) #Match postponed, no score
dat2$res = ifelse(dat2$sc.home > dat2$sc.away, '1', 
                  ifelse(dat2$sc.home == dat2$sc.away, 'x','2' ))

results = table(dat2$res)

Then what we wrote a function that chooses random matches between 10000 matches inside result pool to be able to see the homewin is real or not. The result was looking good for us because as you can see in our graphs, homewin is real. Our function is as below.

## Analysis of Variance Table
## 
## Response: boots_vec
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## catg          2 2352607 1176304   54201 < 2.2e-16 ***
## Residuals 29997  651011      22                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = boots_vec ~ catg)
## 
## $catg
##         diff        lwr        upr p adj
## 2-1 -16.0957 -16.250109 -15.941291     0
## x-1 -20.6410 -20.795409 -20.486591     0
## x-2  -4.5453  -4.699709  -4.390891     0

on the 95% confidence level the most related one was x-2. Away teams were most likely playing for not to loose matches where home is definetely playing for winning and taking more risks. Then we made a boxplot with it to see the home, away, and draw results more accurately and its relation with the last anova table we made .

boots_pl = data.frame(wins = boots_vec, catg = catg)
boots_gg = ggplot(boots_pl, aes('', wins))


ggplot(boots_pl, aes(x = catg, y = wins, fill = catg)) +
    geom_boxplot(color = 'black') +
    scale_fill_discrete(name = 'Category',labels = c('Home', 'Away', 'Draw')) +
    scale_x_discrete(breaks = NULL, labels = NULL) +
    labs(x = '', y = 'Frequency', title = 'Distribution of wins acc. category') +
    # Anova sonuclarina gore asterisk koymak icin alttaki satirlari
    # uncomment edebilirsiniz

    annotate('segment', x = c(1,2.2,1.2), xend = c(1.8,3,2.8),
              y = c(65,45,55), yend=c(65,45,55),
              color = 'black',size = 0.6) +
    annotate('text', x = c(1.4,2.6,2),
              y = c(66,46,56),
              color = 'black',
              label = "***") +
    theme_gray() +
    theme(legend.position = c(0.5, 0.1),
          plot.title = element_text(face = 'bold', color = alpha('black', 0.8)))

As we can see in boxplot, home win frequency is almost equal to draw+away wins together, which we can say there is a home playing advantage for Premier League teams for sure.

Then we made a histogram graph to see frequencies better and used the codes below for that.

ggplot(boots_pl, aes(x = wins, fill = catg)) +
    geom_histogram(color = 'black',bins = 25) +
    facet_wrap(~catg, ncol = 3, scales = 'fixed') +
    scale_fill_discrete(name = 'Category',labels = c('Home', 'Away', 'Draw')) +
    theme_gray() +
    theme(strip.background = element_blank(),
          strip.text.x = element_blank())

then we needed to merge the datas for making further analysis with them.

dat1 = readRDS("C:\\Users\\burka\\Desktop\\proce\\dat1.rds")
dat3 = readRDS("C:\\Users\\burka\\Desktop\\proce\\dat2.rds")
dat1$totalhandicap[is.na(dat1$totalhandicap)] = 0 #replacing NAs with zeros
merged = merge(dat1, dat3)
saveRDS(merged, 'C:\\Users\\burka\\Desktop\\proce\\merged.rds') #we will call it later for further analysis

After merging two datas and saved it for further analysis, now we will call data here and will be analysing the odds related analysis. First we start with over/under bets analysis with writing a function for it to find

betRes = function(row)
{
  ## Function to decide if the user would have won given this bet
  ## only implemented over/under bets
  
  bettype = row[2]
  oddtype = row[3]
  sc.home = as.numeric(row[10])
  sc.away = as.numeric(row[11])
  handicap= as.numeric(row[7])
  ## only for ou type bets
  if(bettype == 'ou' )
  {
    if(oddtype == 'over')
    {
      sc.home = sc.home + handicap
      retval = ifelse(sc.home > sc.away, 'win', 'lose')
      return(retval)
    }else
    {
      sc.away = sc.away + handicap
      retval = ifelse(sc.away > sc.home, 'win', 'lose')
      return(retval)
      
    }
    
    
  }else return("NA")
  
}

this function is used for to decide if the user would have won given this bet for over and under bets. Then we went further on analysis with our function. First we needed to clean data to be able to make graphs. We again seperated the score part and if any Nas or non numbers in it were fixed.

merged = readRDS('C:\\Users\\burka\\Desktop\\proce\\merged.rds')

merged = separate(merged, col = 'score', into = c('sc.home', 'sc.away'), sep = ":")

merged$sc.home = as.numeric(merged$sc.home)
merged$sc.away = as.numeric(merged$sc.away)
merged$totalhandicap = as.numeric(merged$totalhandicap)
## Warning: Zorlamadan dolayı ortaya çıkan NAs
merged$res = ifelse(merged$sc.home > merged$sc.away, '1',
             ifelse(merged$sc.home == merged$sc.away, 'x','2' ))


#!!!!!!!!!!this code is bugging in rmarkdown but working under normal R script, if you run this on rscript u can see the last graph!!!!!!!!  
#betres = apply(merged, 1, betRes)
#merged$betres = betres ; rm(betres)

merged$goals = merged$sc.home + merged$sc.away

overs  = merged$oddtype == 'over'

Then we made an analysis with using the odds and total goals for over and under odds given. İt canbe seen that lots of 10 goals matches happened in premier league last season

ggplot(data.frame(odd = merged[overs,]$odd,
                  goals = factor(merged[overs,]$goals )),
       aes(x=goals, y=odd, fill = goals)) +
    geom_boxplot() +
    scale_y_continuous(limits = c(0.8,20)) + #Axis limits to tweak
    scale_fill_discrete(name = 'Total Goals') +
    labs(x = 'Total Goals', y = 'Odd') + # Axis labels to tweak
    theme_gray() # Different themes can be used
## Warning: Removed 229 rows containing non-finite values (stat_boxplot).

   scale_y_continuous(limits = c(0.8,5))
## <ScaleContinuousPosition>
##  Range:  
##  Limits:  0.8 --    5

Here you can see odd ranges and goal counts of each matches. It also gives us a tip for ideal bet range for over and under.

ggplot(data.frame(odd = merged[overs,]$odd,
                  goals = merged[overs,]$goals),
       aes(y=goals, x=odd)) +
    geom_point() +
    geom_smooth(method = 'lm', se=F, color = 'maroon',linetype = 'dashed') 

We can see the ideal range for odds better here , it looks strange for odds to go from 0-300s but the highest bet is 380 interestingly.

summary(merged$odd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.010   1.670   1.990   2.938   2.570 380.050

This is our last graph that shows bet range for different bet companies. if you do run the #code in rscript you can see it too.

#ggplot(data.frame(odd = merged[over_wins,]$odd,
 #                   booker = factor(merged[over_wins,]$bookmaker),
  #                  handic = merged[over_wins,]$totalhandicap
   #                 ),
     #    aes(x=handic, y = odd, color = booker)) +
    #  geom_point(alpha = 0.45) +
      #scale_y_continuous(limits = c(0,10)) +
      #scale_x_continuous(limits = c(0,10))```