The Data

The dataset I have chosen to conduct my regression analysis is the (LoL) League of Legends Ranked Games dataset. This data looks at ranked League of Legends games played during season 9 in the EUW region. This dataset was collected using the Riot Games API, which makes it easy to look up and collect information on a users ranked history and collect their games.

# Load packages
library(ggplot2)
library(rstanarm)
library(bayesplot)
library(bayesrules)
library(tidyverse)
library(tidybayes)
library(dplyr)
library(broom.mixed)
library(interactions)

To preface this analysis, one must first understand the game of League of Legends, or LOL for short. In LOL, two 5 person teams fight to destroy each others nexus and get to choose and ban a champion, each with a different set of skills and abilities, and 2 summoner spells to do so. Along the way, the players must both fight each other, destroy objectives such as towers and inhibitors, and strategize effectively to reach the opponent’s nexus. There are also several monsters that teams can defeat to buff them as well, such as Rift Herald, Baron Nashor, and an assorted set of Dragons.

We will use lol_data to build various models of League of Legends gameDuration. Throughout, we’ll utilize weakly informative priors and a basic understanding that LOL games usually are 30 minutes but can range from 25-35 minutes. We will asses an array of different predictors and combinations to determine what can provide us the best fit for predicting the game duration of a League of Legends game. A base criteria for these games is there at least needs to be one tower, one inhibitor, and one champion kill for each game. This is to mitigate the amount of games that teams may have forfeited due to a player leaving or trolling.

lol_data <- read.csv("games.csv",
                    sep=",",
                    na.strings=c(""," ","NA","N/A")
)
head(lol_data)

Setting Up the Data

Transforming data to fit baseline of at least one kill, tower, and inhibitor

lol_data <- lol_data[!(lol_data$firstBlood == 0 | lol_data$firstTower == 0 | lol_data$firstInhibitor == 0), ]
nrow(lol_data)
[1] 45214
colnames(lol_data)
 [1] "gameId"             "creationTime"      
 [3] "gameDuration"       "seasonId"          
 [5] "winner"             "firstBlood"        
 [7] "firstTower"         "firstInhibitor"    
 [9] "firstBaron"         "firstDragon"       
[11] "firstRiftHerald"    "t1_champ1id"       
[13] "t1_champ1_sum1"     "t1_champ1_sum2"    
[15] "t1_champ2id"        "t1_champ2_sum1"    
[17] "t1_champ2_sum2"     "t1_champ3id"       
[19] "t1_champ3_sum1"     "t1_champ3_sum2"    
[21] "t1_champ4id"        "t1_champ4_sum1"    
[23] "t1_champ4_sum2"     "t1_champ5id"       
[25] "t1_champ5_sum1"     "t1_champ5_sum2"    
[27] "t1_towerKills"      "t1_inhibitorKills" 
[29] "t1_baronKills"      "t1_dragonKills"    
[31] "t1_riftHeraldKills" "t1_ban1"           
[33] "t1_ban2"            "t1_ban3"           
[35] "t1_ban4"            "t1_ban5"           
[37] "t2_champ1id"        "t2_champ1_sum1"    
[39] "t2_champ1_sum2"     "t2_champ2id"       
[41] "t2_champ2_sum1"     "t2_champ2_sum2"    
[43] "t2_champ3id"        "t2_champ3_sum1"    
[45] "t2_champ3_sum2"     "t2_champ4id"       
[47] "t2_champ4_sum1"     "t2_champ4_sum2"    
[49] "t2_champ5id"        "t2_champ5_sum1"    
[51] "t2_champ5_sum2"     "t2_towerKills"     
[53] "t2_inhibitorKills"  "t2_baronKills"     
[55] "t2_dragonKills"     "t2_riftHeraldKills"
[57] "t2_ban1"            "t2_ban2"           
[59] "t2_ban3"            "t2_ban4"           
[61] "t2_ban5"           

We can clean up our data by omitting unnecessary columns

lol_data <- subset(lol_data, select = -c(t2_ban1, t2_ban2, t2_ban3, t2_ban4, t2_ban5, t1_ban1, t1_ban2, t1_ban3, t1_ban4, t1_ban5, t1_champ1_sum1, t1_champ2_sum1, t1_champ3_sum1, t1_champ4_sum1, t1_champ5_sum1, t2_champ5_sum1, t2_champ1_sum1, t2_champ2_sum1, t2_champ3_sum1, t2_champ4_sum1, t2_champ5_sum1, t1_champ1_sum2, t1_champ2_sum2, t1_champ3_sum2, t1_champ4_sum2, t1_champ5_sum2, t2_champ1_sum2, t2_champ1_sum2, t2_champ2_sum2, t2_champ3_sum2, t2_champ4_sum2, t2_champ5_sum2, t1_champ1id, t1_champ2id, t1_champ3id, t1_champ4id, t1_champ5id, t2_champ1id, t2_champ2id, t2_champ3id, t2_champ4id, t2_champ5id))

colnames(lol_data)
 [1] "gameId"             "creationTime"      
 [3] "gameDuration"       "seasonId"          
 [5] "winner"             "firstBlood"        
 [7] "firstTower"         "firstInhibitor"    
 [9] "firstBaron"         "firstDragon"       
[11] "firstRiftHerald"    "t1_towerKills"     
[13] "t1_inhibitorKills"  "t1_baronKills"     
[15] "t1_dragonKills"     "t1_riftHeraldKills"
[17] "t2_towerKills"      "t2_inhibitorKills" 
[19] "t2_baronKills"      "t2_dragonKills"    
[21] "t2_riftHeraldKills"

We can also covert some of the columns to factors for team 1 and team 2 for ease of visualization and interpretation

lol_data$winner <- as.factor(lol_data$winner)
lol_data$firstBlood <- as.factor(lol_data$firstBlood)
lol_data$firstTower <- as.factor(lol_data$firstTower)
lol_data$firstInhibitor <- as.factor(lol_data$firstInhibitor)
lol_data$firstBaron <- as.factor(lol_data$firstBaron)
lol_data$firstDragon <- as.factor(lol_data$firstDragon)
lol_data$firstRiftHerald <- as.factor(lol_data$firstRiftHerald)
head(lol_data)

Summary Analysis

Here are some visualizations and statistics to help better understand our data and what we are working with.

summary(lol_data)
     gameId           creationTime        gameDuration 
 Min.   :3.215e+09   Min.   :1.497e+12   Min.   : 477  
 1st Qu.:3.292e+09   1st Qu.:1.502e+12   1st Qu.:1632  
 Median :3.320e+09   Median :1.504e+12   Median :1895  
 Mean   :3.306e+09   Mean   :1.503e+12   Mean   :1930  
 3rd Qu.:3.327e+09   3rd Qu.:1.504e+12   3rd Qu.:2196  
 Max.   :3.332e+09   Max.   :1.505e+12   Max.   :4728  
    seasonId winner    firstBlood firstTower firstInhibitor
 Min.   :9   1:22867   1:23151    1:23248    1:23054       
 1st Qu.:9   2:22347   2:22063    2:21966    2:22160       
 Median :9                                                 
 Mean   :9                                                 
 3rd Qu.:9                                                 
 Max.   :9                                                 
 firstBaron firstDragon firstRiftHerald t1_towerKills   
 0:14617    0:  434     0:21922         Min.   : 0.000  
 1:14469    1:22258     1:11933         1st Qu.: 3.000  
 2:16128    2:22522     2:11359         Median : 7.000  
                                        Mean   : 6.173  
                                        3rd Qu.:10.000  
                                        Max.   :11.000  
 t1_inhibitorKills t1_baronKills    t1_dragonKills 
 Min.   : 0.000    Min.   :0.0000   Min.   :0.000  
 1st Qu.: 0.000    1st Qu.:0.0000   1st Qu.:0.000  
 Median : 1.000    Median :0.0000   Median :1.000  
 Mean   : 1.159    Mean   :0.4173   Mean   :1.483  
 3rd Qu.: 2.000    3rd Qu.:1.0000   3rd Qu.:2.000  
 Max.   :10.000    Max.   :5.0000   Max.   :6.000  
 t1_riftHeraldKills t2_towerKills    t2_inhibitorKills
 Min.   :0.0000     Min.   : 0.000   Min.   : 0.000   
 1st Qu.:0.0000     1st Qu.: 2.000   1st Qu.: 0.000   
 Median :0.0000     Median : 7.000   Median : 1.000   
 Mean   :0.2639     Mean   : 6.014   Mean   : 1.122   
 3rd Qu.:1.0000     3rd Qu.:10.000   3rd Qu.: 2.000   
 Max.   :1.0000     Max.   :11.000   Max.   :10.000   
 t2_baronKills    t2_dragonKills  t2_riftHeraldKills
 Min.   :0.0000   Min.   :0.000   Min.   :0.0000    
 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000    
 Median :0.0000   Median :1.000   Median :0.0000    
 Mean   :0.4641   Mean   :1.506   Mean   :0.2512    
 3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:1.0000    
 Max.   :4.0000   Max.   :6.000   Max.   :1.0000    

Some conclusions we can make here:

Visualizations

We can use ggplot do visualize some of our findings

Response Distribution

ggplot(lol_data, aes(x = gameDuration)) +
  geom_histogram(binwidth = 60, fill = "blue", color = "black") +
  labs(x = "Game Duration (seconds)", y = "Frequency", title = "Distribution of Game Duration") +
  theme_minimal()

Game duration seems to be a bit right-skewed

First Comparisons

ggplot(lol_data, aes(x = winner, fill = winner)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "Winners")

ggplot(lol_data, aes(x = firstBlood, fill = firstBlood)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Bloods")

ggplot(lol_data, aes(x = firstTower, fill = firstTower)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Towers")

ggplot(lol_data, aes(x = firstInhibitor, fill = firstInhibitor)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Inhibitors")

ggplot(lol_data, aes(x = firstDragon, fill = firstDragon)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Dragons")

ggplot(lol_data, aes(x = firstRiftHerald, fill = firstRiftHerald)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Rift Heralds")

ggplot(lol_data, aes(x = firstBaron, fill = firstBaron)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Barons")

These all are consistent with our previous summary findings about first categories

Total Objectives Comparisons

ggplot(lol_data, aes(x = t1_towerKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

ggplot(lol_data, aes(x = t2_towerKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

tower_kills <- data.frame(Team = c("T1", "T2"),
                          Kills = c(sum(lol_data$t1_towerKills, na.rm = TRUE), 
                                    sum(lol_data$t2_towerKills, na.rm = TRUE)))

ggplot(tower_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Tower Kills", title = "Comparison of Tower Kills between T1 and T2")

More often than not, both teams tend to kill >6 towers per game

ggplot(lol_data, aes(x = t1_inhibitorKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

ggplot(lol_data, aes(x = t2_inhibitorKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

# Calculate the total number of inhibitor kills for each team
inhibitor_kills <- data.frame(Team = c("T1", "T2"),
                              Kills = c(sum(lol_data$t1_inhibitorKills, na.rm = TRUE), 
                                        sum(lol_data$t2_inhibitorKills, na.rm = TRUE)))

ggplot(inhibitor_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Inhibitor Kills", title = "Comparison of Inhibitor Kills between T1 and T2")

Games usually end with 0 to 2 inhibitors being destroyed, likely due to late game forfeits or steam rolls (which is often the case in many games)

ggplot(lol_data, aes(x = t1_dragonKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

ggplot(lol_data, aes(x = t2_dragonKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

# Calculate the total number of dragon kills for each team
dragon_kills <- data.frame(Team = c("T1", "T2"),
                           Kills = c(sum(lol_data$t1_dragonKills, na.rm = TRUE), 
                                     sum(lol_data$t2_dragonKills, na.rm = TRUE)))

ggplot(dragon_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Dragon Kills", title = "Comparison of Dragon Kills between T1 and T2")

Dragon kills comfortably lie within the 0-2 range

ggplot(lol_data, aes(x = t1_riftHeraldKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

ggplot(lol_data, aes(x = t2_riftHeraldKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

# Calculate the total number of rift herald kills for each team
riftHerald_kills <- data.frame(Team = c("T1", "T2"),
                               Kills = c(sum(lol_data$t1_riftHeraldKills, na.rm = TRUE), 
                                         sum(lol_data$t2_riftHeraldKills, na.rm = TRUE)))

ggplot(riftHerald_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Rift Herald Kills", title = "Comparison of Rift Herald Kills between T1 and T2")

Rift herald only spawns from 9:50 - 19:45 during the game and is most often only taken once if at all

ggplot(lol_data, aes(x = t1_baronKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

ggplot(lol_data, aes(x = t2_baronKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")

# Calculate the total number of baron kills for each team
baron_kills <- data.frame(Team = c("T1", "T2"),
                          Kills = c(sum(lol_data$t1_baronKills, na.rm = TRUE), 
                                    sum(lol_data$t2_baronKills, na.rm = TRUE)))

ggplot(baron_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Baron Kills", title = "Comparison of Baron Kills between T1 and T2")

Baron spawns after the rift herald and usually is game over if taken and used properly, hence why the amount taken is usually 0 or 1 as sometimes games end before it can be taken

Base Model and Prior Modeling

Lets start off with a base model to see if we can get insight from just the winner of the match

base <- stan_glm(gameDuration ~ winner,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 6.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.66 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.198 seconds (Warm-up)
Chain 1:                24.632 seconds (Sampling)
Chain 1:                24.83 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.9e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.193 seconds (Warm-up)
Chain 2:                24.084 seconds (Sampling)
Chain 2:                24.277 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.6e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.225 seconds (Warm-up)
Chain 3:                23.984 seconds (Sampling)
Chain 3:                24.209 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.4e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.226 seconds (Warm-up)
Chain 4:                23.894 seconds (Sampling)
Chain 4:                24.12 seconds (Total)
Chain 4: 
prior_summary(base) 
Priors for model 'base' 
------
Intercept (after predictors centered)
  Specified prior:
    ~ normal(location = 1800, scale = 150)
  Adjusted prior:
    ~ normal(location = 1800, scale = 64093)

Coefficients
  Specified prior:
    ~ normal(location = 0, scale = 2.5)
  Adjusted prior:
    ~ normal(location = 0, scale = 2137)

Auxiliary (sigma)
  Specified prior:
    ~ exponential(rate = 1)
  Adjusted prior:
    ~ exponential(rate = 0.0023)
------
See help('prior_summary.stanreg') for more details

A prior predictive check can be used to see what we can potentially expect from the data alone and what the stan model viewed as appropriate for our predictions

base_priors <- update(base, prior_PD = TRUE)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 3.402 seconds (Warm-up)
Chain 1:                0.138 seconds (Sampling)
Chain 1:                3.54 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 2.386 seconds (Warm-up)
Chain 2:                0.212 seconds (Sampling)
Chain 2:                2.598 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 3.215 seconds (Warm-up)
Chain 3:                0.133 seconds (Sampling)
Chain 3:                3.348 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.1e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 2.956 seconds (Warm-up)
Chain 4:                0.13 seconds (Sampling)
Chain 4:                3.086 seconds (Total)
Chain 4: 
# 200 prior model lines
lol_data %>%
  add_fitted_draws(base, n = 200) %>%
  ggplot(aes(x = winner, y = gameDuration)) +
    geom_line(aes(y = .value, group = .draw), alpha = 0.05)
Warning: `fitted_draws` and `add_fitted_draws` are deprecated as their names were confusing.
- Use [add_]epred_draws() to get the expectation of the posterior predictive.
- Use [add_]linpred_draws() to get the distribution of the linear predictor.
- For example, you used [add_]fitted_draws(..., scale = "response"), which
  means you most likely want [add_]epred_draws(...).
NOTE: When updating to the new functions, note that the `model` parameter is now
  named `object` and the `n` parameter is now named `ndraws`.

# 4 prior simulated datasets
set.seed(3)
lol_data %>%
  add_predicted_draws(base, n = 4) %>%
  ggplot(aes(x = winner, y = gameDuration)) +
    geom_point(aes(y = .prediction, group = .draw)) + 
    facet_wrap(~ .draw)
Warning: 
In add_predicted_draws(): The `n` argument is a deprecated alias for `ndraws`.
Use the `ndraws` argument instead.
See help("tidybayes-deprecated").

Prior distribution expects the games to range from 1910 - 1945 seconds as expected

Game duration lies between 500-3500 seconds which falls in line with our summary

Prior distribution expects games to last longer if team 2 wins as compared to team 1

mcmc_trace(base, size = .1)

Plots look fast mixing and consistent

rhat(base)
(Intercept)     winner2       sigma 
  0.9998587   0.9999295   0.9999225 

Rhat close to 1 and not >1.05, good

neff_ratio(base)
(Intercept)     winner2       sigma 
    0.92220     0.95970     0.95575 

neff ratio is a little high but still acceptable, >.10 which is good

tidy(base, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

Overall the model could work, but more investigation should be done to improve what we have

Interaction Investigation

Lets see if there are some interactions we can expect

Speaking from experience in playing the game myself, I predict the dragon and baron kills will have a big impact on the game duration due to the buffs they give each member of the team being crucial to winning

int <- stan_glm(gameDuration ~ t1_dragonKills:t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.243 seconds (Warm-up)
Chain 1:                23.706 seconds (Sampling)
Chain 1:                23.949 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.5e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.22 seconds (Warm-up)
Chain 2:                23.406 seconds (Sampling)
Chain 2:                23.626 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.5e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.235 seconds (Warm-up)
Chain 3:                23.581 seconds (Sampling)
Chain 3:                23.816 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.7e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.237 seconds (Warm-up)
Chain 4:                23.737 seconds (Sampling)
Chain 4:                23.974 seconds (Total)
Chain 4: 
int2 <- stan_glm(gameDuration ~ t1_baronKills:t2_baronKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.273 seconds (Warm-up)
Chain 1:                17.357 seconds (Sampling)
Chain 1:                17.63 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.5e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.234 seconds (Warm-up)
Chain 2:                17.045 seconds (Sampling)
Chain 2:                17.279 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.249 seconds (Warm-up)
Chain 3:                16.572 seconds (Sampling)
Chain 3:                16.821 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.5e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.231 seconds (Warm-up)
Chain 4:                16.656 seconds (Sampling)
Chain 4:                16.887 seconds (Total)
Chain 4: 
summary(int)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      gameDuration ~ t1_dragonKills:t2_dragonKills
 algorithm:    sampling
 sample:       20000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 45214
 predictors:   2

Estimates:
                                mean   sd     10%    50%    90% 
(Intercept)                   1728.5    2.0 1725.9 1728.5 1731.1
t1_dragonKills:t2_dragonKills  147.5    0.9  146.3  147.5  148.7
sigma                          340.5    1.1  339.0  340.5  341.9

Fit Diagnostics:
           mean   sd     10%    50%    90% 
mean_PPD 1929.8    2.3 1926.9 1929.8 1932.7

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
                              mcse Rhat n_eff
(Intercept)                   0.0  1.0  17400
t1_dragonKills:t2_dragonKills 0.0  1.0  20373
sigma                         0.0  1.0  18046
mean_PPD                      0.0  1.0  17889
log-posterior                 0.0  1.0   9182

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
summary(int2)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      gameDuration ~ t1_baronKills:t2_baronKills
 algorithm:    sampling
 sample:       20000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 45214
 predictors:   2

Estimates:
                              mean   sd     10%    50%    90% 
(Intercept)                 1884.0    1.9 1881.6 1884.0 1886.4
t1_baronKills:t2_baronKills  466.7    4.8  460.4  466.6  472.8
sigma                        390.0    1.3  388.4  390.0  391.7

Fit Diagnostics:
           mean   sd     10%    50%    90% 
mean_PPD 1929.8    2.6 1926.5 1929.8 1933.1

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
                            mcse Rhat n_eff
(Intercept)                 0.0  1.0  17022
t1_baronKills:t2_baronKills 0.0  1.0  20014
sigma                       0.0  1.0  18691
mean_PPD                    0.0  1.0  18228
log-posterior               0.0  1.0   9914

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Summary stats look promising with both values not containing 0 for their 95% CI showing they are significant and can be useful

# Extract the posterior samples
posterior_samples <- as.matrix(int)
posterior_samples2 <- as.matrix(int2)

# Plot the posterior distribution of the interaction term
mcmc_hist(posterior_samples, pars = c("t1_dragonKills:t2_dragonKills"))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

mcmc_hist(posterior_samples2, pars = c("t1_baronKills:t2_baronKills"))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

The predictors don’t look skewed either, looks good.

We can canclude that there may be a potential interaction between dragon kills and baron kills for each team

We can utilize this info later on upon building a better model

Main Model

Let’s assess how much getting a jump start in the game does for determining the length of the game

main <- stan_glm(gameDuration ~ winner + firstBlood + firstTower + firstDragon + firstBaron,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 1 seconds (Warm-up)
Chain 1:                16.219 seconds (Sampling)
Chain 1:                17.219 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 1.142 seconds (Warm-up)
Chain 2:                20.153 seconds (Sampling)
Chain 2:                21.295 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.3e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.23 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.895 seconds (Warm-up)
Chain 3:                15.819 seconds (Sampling)
Chain 3:                16.714 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.88 seconds (Warm-up)
Chain 4:                15.902 seconds (Sampling)
Chain 4:                16.782 seconds (Total)
Chain 4: 
prior_summary(main) 
Priors for model 'main' 
------
Intercept (after predictors centered)
  Specified prior:
    ~ normal(location = 1800, scale = 150)
  Adjusted prior:
    ~ normal(location = 1800, scale = 64093)

Coefficients
  Specified prior:
    ~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
  Adjusted prior:
    ~ normal(location = [0,0,0,...], scale = [2136.54,2137.02,2137.26,...])

Auxiliary (sigma)
  Specified prior:
    ~ exponential(rate = 1)
  Adjusted prior:
    ~ exponential(rate = 0.0023)
------
See help('prior_summary.stanreg') for more details
mcmc_trace(main, size = .1)

Chains look normal

rhat(main)
 (Intercept)      winner2  firstBlood2  firstTower2 firstDragon1 
   1.0001821    0.9999929    0.9999398    0.9998713    1.0001601 
firstDragon2  firstBaron1  firstBaron2        sigma 
   1.0001982    0.9998699    0.9999778    0.9999692 
neff_ratio(main)
 (Intercept)      winner2  firstBlood2  firstTower2 firstDragon1 
     0.51960      1.05495      1.28565      1.17810      0.50175 
firstDragon2  firstBaron1  firstBaron2        sigma 
     0.49995      0.98030      0.97875      1.05095 

Rhat looks good, some of the neff ratio values are still high but acceptable for now

tidy(main, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

From the tidy output, it looks like winner, firstBlood, and firstTower lose their significance when adding other predictors since their 95% CI range includes 0. We’ll use this info later for refining our model

newdata <- data.frame(winner = factor(1, levels = levels(lol_data$winner)),
                      firstBlood = factor(2, levels = levels(lol_data$firstBlood)),
                      firstTower = factor(2, levels = levels(lol_data$firstTower)),
                      firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
                      firstBaron = factor(2, levels = levels(lol_data$firstBaron)))

main_predict <- posterior_predict(
  main, 
  newdata = newdata)
mcmc_areas(main_predict) +  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first blood, tower, dragon, baron was team 2 and team 1 won') +
  theme(plot.title = element_text(size = 7))

Here’s an example output of where a game duration may lie if team 1 wins despite being behind from the start. The game is a little later than normal, but to be expected as the team would need time to make the comeback in the first place. However, I believe this is undershooting the true time we could expect from this scenario. We’ll verify if this is true after refining our model

More Predictors

We can add some numerical variables to see how much adding the amount of each objective a team has taken to see how it affects the game duration

ext_main <- stan_glm(gameDuration ~ winner + firstBlood + firstTower + firstDragon + firstBaron + t1_towerKills + t1_inhibitorKills + t1_baronKills + t1_dragonKills + t2_towerKills + t2_inhibitorKills + t2_baronKills + t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 5.2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.52 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 1.402 seconds (Warm-up)
Chain 1:                23.404 seconds (Sampling)
Chain 1:                24.806 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 1.461 seconds (Warm-up)
Chain 2:                23.609 seconds (Sampling)
Chain 2:                25.07 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 1.524 seconds (Warm-up)
Chain 3:                23.227 seconds (Sampling)
Chain 3:                24.751 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2.4e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 1.53 seconds (Warm-up)
Chain 4:                22.993 seconds (Sampling)
Chain 4:                24.523 seconds (Total)
Chain 4: 
prior_summary(ext_main) 
Priors for model 'ext_main' 
------
Intercept (after predictors centered)
  Specified prior:
    ~ normal(location = 1800, scale = 150)
  Adjusted prior:
    ~ normal(location = 1800, scale = 64093)

Coefficients
  Specified prior:
    ~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
  Adjusted prior:
    ~ normal(location = [0,0,0,...], scale = [2136.54,2137.02,2137.26,...])

Auxiliary (sigma)
  Specified prior:
    ~ exponential(rate = 1)
  Adjusted prior:
    ~ exponential(rate = 0.0023)
------
See help('prior_summary.stanreg') for more details
mcmc_trace(ext_main, size = .1)

Traces look good

rhat(ext_main)
      (Intercept)           winner2       firstBlood2 
        1.0002256         1.0002342         0.9998792 
      firstTower2      firstDragon1      firstDragon2 
        0.9999455         1.0003390         1.0003823 
      firstBaron1       firstBaron2     t1_towerKills 
        1.0002441         0.9999394         0.9999148 
t1_inhibitorKills     t1_baronKills    t1_dragonKills 
        1.0000205         1.0005001         1.0000165 
    t2_towerKills t2_inhibitorKills     t2_baronKills 
        1.0000909         0.9998952         0.9999024 
   t2_dragonKills             sigma 
        0.9999282         1.0000341 
neff_ratio(ext_main)
      (Intercept)           winner2       firstBlood2 
          0.59525           0.91430           2.09050 
      firstTower2      firstDragon1      firstDragon2 
          1.52335           0.49105           0.49450 
      firstBaron1       firstBaron2     t1_towerKills 
          0.70805           0.74320           0.77175 
t1_inhibitorKills     t1_baronKills    t1_dragonKills 
          1.02380           0.68870           0.96135 
    t2_towerKills t2_inhibitorKills     t2_baronKills 
          0.80965           1.10315           0.75605 
   t2_dragonKills             sigma 
          0.99900           1.19210 

Rhat looks good, some neff ratios (i.e. firstBlood, firstTower) are now too high to keep and will need to be handled later

tidy(ext_main, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

Here, the predictors winner, firstBlood, and firstTower are not significant due to their 95% CI including 0

# dataframe with the specified values
newdata_ext <- data.frame(
  winner = factor(1, levels = levels(lol_data$winner)),
  firstBlood = factor(2, levels = levels(lol_data$firstBlood)),
  firstTower = factor(2, levels = levels(lol_data$firstTower)),
  firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
  firstBaron = factor(2, levels = levels(lol_data$firstBaron)),
  t1_towerKills = 7,
  t1_inhibitorKills = 2,
  t1_baronKills = 2,
  t1_dragonKills = 2,
  t2_towerKills = 3,
  t2_inhibitorKills = 2,
  t2_baronKills = 1,
  t2_dragonKills = 3
)

ext_main_predict <- posterior_predict(ext_main, newdata = newdata_ext)

mcmc_areas(ext_main_predict) +  
  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first blood, tower, dragon, baron was team 2 and team 1 won, with additional predictors') +
  theme(plot.title = element_text(size = 8))

After adding a few more variables into play we can see the game gets longer (and expectedly so). Team 1 comes from behind but ends up getting 2 barons and 7 towers making us believe this game will have to go on for a while for the comeback to truly be complete

Refined Model

Let’s now remove the high neff ratio terms from the extended model (i.e. > 1) and include the interaction terms we deemed useful beforehand

ext_main_int <- stan_glm(gameDuration ~ firstDragon + firstBaron + t1_baronKills + t1_dragonKills + t2_baronKills + t2_dragonKills + t1_baronKills:t2_baronKills + t1_dragonKills:t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 1.415 seconds (Warm-up)
Chain 1:                24.649 seconds (Sampling)
Chain 1:                26.064 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 1.797 seconds (Warm-up)
Chain 2:                24.903 seconds (Sampling)
Chain 2:                26.7 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.3e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.23 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 1.53 seconds (Warm-up)
Chain 3:                25.085 seconds (Sampling)
Chain 3:                26.615 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 1.312 seconds (Warm-up)
Chain 4:                24.619 seconds (Sampling)
Chain 4:                25.931 seconds (Total)
Chain 4: 
prior_summary(ext_main_int)
Priors for model 'ext_main_int' 
------
Intercept (after predictors centered)
  Specified prior:
    ~ normal(location = 1800, scale = 150)
  Adjusted prior:
    ~ normal(location = 1800, scale = 64093)

Coefficients
  Specified prior:
    ~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
  Adjusted prior:
    ~ normal(location = [0,0,0,...], scale = [2136.66,2136.42,2289.92,...])

Auxiliary (sigma)
  Specified prior:
    ~ exponential(rate = 1)
  Adjusted prior:
    ~ exponential(rate = 0.0023)
------
See help('prior_summary.stanreg') for more details
mcmc_trace(ext_main_int, size = .1)

So far so good

rhat(ext_main_int)
                  (Intercept)                  firstDragon1 
                    1.0001372                     1.0002078 
                 firstDragon2                   firstBaron1 
                    1.0001533                     1.0003050 
                  firstBaron2                 t1_baronKills 
                    1.0002689                     1.0002779 
               t1_dragonKills                 t2_baronKills 
                    1.0000343                     1.0004493 
               t2_dragonKills   t1_baronKills:t2_baronKills 
                    1.0002614                     1.0004149 
t1_dragonKills:t2_dragonKills                         sigma 
                    0.9999161                     0.9999662 

Rhat looks satisfactory for all predictors

neff_ratio(ext_main_int)
                  (Intercept)                  firstDragon1 
                      0.59695                       0.51615 
                 firstDragon2                   firstBaron1 
                      0.51000                       0.56870 
                  firstBaron2                 t1_baronKills 
                      0.56995                       0.55110 
               t1_dragonKills                 t2_baronKills 
                      0.72620                       0.54695 
               t2_dragonKills   t1_baronKills:t2_baronKills 
                      0.71520                       0.68670 
t1_dragonKills:t2_dragonKills                         sigma 
                      0.87645                       1.14675 

Neff ratios are much more reasonable now and a big improvement from prior models with no value being > .90

tidy(ext_main_int, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

Every value looks significant from the 95% CI (even the interaction terms!!) due to none of them including 0.

# dataframe with the specified values
newdata_ext_int <- data.frame(
  firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
  firstBaron = factor(2, levels = levels(lol_data$firstBaron)),
  t1_baronKills = 2,
  t1_dragonKills = 2,
  t2_baronKills = 1,
  t2_dragonKills = 3
)

ext_main_int_predict <- posterior_predict(ext_main_int, newdata = newdata_ext_int)

mcmc_areas(ext_main_int_predict) +  
  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first dragon, baron was team 2 and team 1 had more baron kills with interaction') +
  theme(plot.title = element_text(size = 8))

The games seem to drag on even longer with this new model. This is expected since the dragon and baron respawn times are quite long so, as said before, the comeback would realistically take longer than average.

Lets do some model comparisons to verify which model is best

Model Comparisons

PP Check

pp_check(main, nreps = 50) + xlab("Game Duration") +
  ggtitle('Main effects model')

The main model gets a decent amount of area, yet it starts moving too far right and undershoots at the peak

pp_check(ext_main, nreps = 50) + xlab("Game Duration") +
  ggtitle('Extended Main effects model')

The extended model is objectively better than the main model, but still undershoots area at the peak

pp_check(ext_main_int, nreps = 50) + xlab("Game Duration") +
  ggtitle('Refined Interaction effects model')

The interaction model is very similar to the extended model which is better than the main model.

The ppchecks show most of the are being covered for the last 2 models with the interaction model doing better around the peak by a small margin

10 Fold Cross-Validations

test_sample <- lol_data %>% head(10000)
nrow(test_sample)
[1] 10000
set.seed(84735)

p_main <- prediction_summary(model = main, data = test_sample)
p_ext_main <- prediction_summary(model = ext_main, data = test_sample)
p_ext_main_int <- prediction_summary(model = ext_main_int, data = test_sample)
rbind(p_main, p_ext_main, p_ext_main_int)

These are the raw MAE values based on a sample of 10000 observations from the data, we can use these to determine the behavior and bias of each model based on their cross-validation results

set.seed(84735)

cv_main <- prediction_summary_cv(
  model = main, data = lol_data, k = 10)

cv_extend <- prediction_summary_cv(
  model = ext_main, data = lol_data, k = 10)

cv_interact <- prediction_summary_cv(
  model = ext_main_int, data = lol_data, k = 10)
rbind(cv_main$cv, cv_extend$cv, cv_interact$cv)

Loo Diagnostics

The extended model fairs better than the interaction model when comparing their ELPD and MAE. The interaction model does 4337.6 points “worse” for ELPD and is around 16 points higher in its MAE. However, something to consider here is the idea of overfitting and how more predictors and affect the model overall and skew our metrics.

Despite the interaction model covering more area, the MAE and ELPD are both worse. This may be due to these values being inflated by the larger number of predictors in the extended model. Furthermore, the extended model contains numerous values with higher than normal Neff Ratios which is a cause for concern about the validity of the model along with predictors that were not significant.

Despite having less predictors, the interaction model garners more area in its predictive posterior distribution while having a similar MAE and good ELPD score too (-309859.7 vs 314197.3) which is a minor (~1%) difference given the size of these values. However, these interaction terms could also be having an adverse effect on the model leading to these greater errors.

Overfitting

Based on the difference between the raw MAE values and the cross-validation MAE values, overfitting does not seem to pose a threat to these models, however we can only truly know this if we are given new data entirely to test the models. Furthermore, the extended model does have a larger (though mostly negligible due to how small it is) difference from the cross-validation MAE compared to the other models.

What is undoubtedly clear is that adding numerical predictors to the model does help the predictions overall and is shown with both the extended and interaction model fairing better than the main model in every aspect.

Regression Inference

Let’s see how the 2 best models fair when conducting a quick hypothesis test to see the Posterior probability of a game around 40 minutes long

We’ll use the predictive posterior models we created and visualized earlier to conduct the tests

\[ H_0: \pi \geq 2500 \]

\[ H_a: \pi < 2500 \]

Though the two models seemingly predict the actual distribution of the data well, the 2 models have very different outcomes when conducting the test.

The extended model favors the alternate hypothesis while the interaction model clearly favors the null hypothesis. This both coincides with what we saw before from the visualizations and tells us that the extended model favors games that are shorter while the interaction model expects games like this to take longer.

This is also something one should consider when choosing the model. Whether or not the predictions themselves seem realistic given the scenario/circumstances.

To back up what I’m saying, Baron Nashor spawns at 20 minutes and respawns every 6 minutes. Since 3 Barons are killed in both scenarios, the game would have to be at least 38 minutes and only if the teams kill the Baron IMMEDIATELY (which is usually never the case). Therefore, one would expect majority of the area to lie within the ~40 minute range at the very least for when the game would end.

Conclusion

In conclusion, it is a choice between whether one wants to take the chance of playing with a model that may be susceptible to overfitting for the sake of potentially less error or a slightly higher error model that has interactions but would be less likely to be susceptible to overfitting.

Personally, I say that the refined interaction model is the best model to use for predicting the game duration of a LOL match given our data due to its comparable MAE along with better posterior predictive abilities and lack of evidence for overfitting. Furthermore, based on my experience and game rules, the predictive values it comes up with end up being much more realistic in the grand scheme of things and with other factors such as respawn time considered.

References

---
title: "League of Legends Bayesian Regression Analysis"
output: html_notebook
---

## The Data

The dataset I have chosen to conduct my regression analysis is the [(LoL) League of Legends Ranked Games dataset](https://www.kaggle.com/datasets/datasnaek/league-of-legends/?select=games.csv). This data looks at ranked League of Legends games played during season 9 in the EUW region. This dataset was collected using the Riot Games API, which makes it easy to look up and collect information on a users ranked history and collect their games.

```{r setup, results=F, message=FALSE, error=FALSE, warning=FALSE}
# Load packages
library(ggplot2)
library(rstanarm)
library(bayesplot)
library(bayesrules)
library(tidyverse)
library(tidybayes)
library(dplyr)
library(broom.mixed)
library(interactions)
```

To preface this analysis, one must first understand the game of League of Legends, or LOL for short. In LOL, two 5 person teams fight to destroy each others nexus and get to choose and ban a champion, each with a different set of skills and abilities, and 2 summoner spells to do so. Along the way, the players must both fight each other, destroy objectives such as towers and inhibitors, and strategize effectively to reach the opponent's nexus. There are also several monsters that teams can defeat to buff them as well, such as Rift Herald, Baron Nashor, and an assorted set of Dragons.

We will use `lol_data` to build various models of League of Legends `gameDuration`. Throughout, we'll utilize weakly informative priors and a basic understanding that LOL games usually are [30 minutes](https://www.leagueofgraphs.com/stats/game-durations) but can range from 25-35 minutes. We will asses an array of different predictors and combinations to determine what can provide us the best fit for predicting the game duration of a League of Legends game. A base criteria for these games is there at least needs to be one tower, one inhibitor, and one champion kill for each game. This is to mitigate the amount of games that teams may have forfeited due to a player leaving or trolling.

```{r}
lol_data <- read.csv("games.csv",
                    sep=",",
                    na.strings=c(""," ","NA","N/A")
)
head(lol_data)
```

## Setting Up the Data

Transforming data to fit baseline of at least one kill, tower, and inhibitor

```{r}
lol_data <- lol_data[!(lol_data$firstBlood == 0 | lol_data$firstTower == 0 | lol_data$firstInhibitor == 0), ]
```

```{r}
nrow(lol_data)
```

```{r}
colnames(lol_data)
```

We can clean up our data by omitting unnecessary columns

```{r}
lol_data <- subset(lol_data, select = -c(t2_ban1, t2_ban2, t2_ban3, t2_ban4, t2_ban5, t1_ban1, t1_ban2, t1_ban3, t1_ban4, t1_ban5, t1_champ1_sum1, t1_champ2_sum1, t1_champ3_sum1, t1_champ4_sum1, t1_champ5_sum1, t2_champ5_sum1, t2_champ1_sum1, t2_champ2_sum1, t2_champ3_sum1, t2_champ4_sum1, t2_champ5_sum1, t1_champ1_sum2, t1_champ2_sum2, t1_champ3_sum2, t1_champ4_sum2, t1_champ5_sum2, t2_champ1_sum2, t2_champ1_sum2, t2_champ2_sum2, t2_champ3_sum2, t2_champ4_sum2, t2_champ5_sum2, t1_champ1id, t1_champ2id, t1_champ3id, t1_champ4id, t1_champ5id, t2_champ1id, t2_champ2id, t2_champ3id, t2_champ4id, t2_champ5id))

colnames(lol_data)
```

We can also covert some of the columns to factors for team 1 and team 2 for ease of visualization and interpretation

```{r}
lol_data$winner <- as.factor(lol_data$winner)
lol_data$firstBlood <- as.factor(lol_data$firstBlood)
lol_data$firstTower <- as.factor(lol_data$firstTower)
lol_data$firstInhibitor <- as.factor(lol_data$firstInhibitor)
lol_data$firstBaron <- as.factor(lol_data$firstBaron)
lol_data$firstDragon <- as.factor(lol_data$firstDragon)
lol_data$firstRiftHerald <- as.factor(lol_data$firstRiftHerald)
```

```{r}
head(lol_data)
```

## Summary Analysis

Here are some visualizations and statistics to help better understand our data and what we are working with.

```{r}
summary(lol_data)
```

Some conclusions we can make here:

-   The mean game time is around 1930 seconds (roughly 32 minutes)

-   Game duration lies mostly between 1632 and 2196

-   Team 1 has won more games, has more first bloods, towers, inhibitors, and rift heralds

-   Team 2 has more first barons and dragons.

-   Team 1's tower, inhibitor, and herald kills are higher than Team 2's which is reflected from them getting the first of each more often than not.

-   Team 2's dragon and baron kills are higher than Team 1's which is reflected from them getting the first of each more often than not.

## Visualizations

We can use ggplot do visualize some of our findings

### Response Distribution

```{r}
ggplot(lol_data, aes(x = gameDuration)) +
  geom_histogram(binwidth = 60, fill = "blue", color = "black") +
  labs(x = "Game Duration (seconds)", y = "Frequency", title = "Distribution of Game Duration") +
  theme_minimal()
```

Game duration seems to be a bit right-skewed

### First Comparisons

```{r}
ggplot(lol_data, aes(x = winner, fill = winner)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "Winners")
```

```{r}
ggplot(lol_data, aes(x = firstBlood, fill = firstBlood)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Bloods")
```

```{r}
ggplot(lol_data, aes(x = firstTower, fill = firstTower)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Towers")
```

```{r}
ggplot(lol_data, aes(x = firstInhibitor, fill = firstInhibitor)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Inhibitors")
```

```{r}
ggplot(lol_data, aes(x = firstDragon, fill = firstDragon)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Dragons")
```

```{r}
ggplot(lol_data, aes(x = firstRiftHerald, fill = firstRiftHerald)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Rift Heralds")
```

```{r}
ggplot(lol_data, aes(x = firstBaron, fill = firstBaron)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Barons")
```

These all are consistent with our previous summary findings about first categories

### Total Objectives Comparisons

```{r}
ggplot(lol_data, aes(x = t1_towerKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_towerKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
tower_kills <- data.frame(Team = c("T1", "T2"),
                          Kills = c(sum(lol_data$t1_towerKills, na.rm = TRUE), 
                                    sum(lol_data$t2_towerKills, na.rm = TRUE)))

ggplot(tower_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Tower Kills", title = "Comparison of Tower Kills between T1 and T2")
```

More often than not, both teams tend to kill \>6 towers per game

```{r}
ggplot(lol_data, aes(x = t1_inhibitorKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_inhibitorKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
# Calculate the total number of inhibitor kills for each team
inhibitor_kills <- data.frame(Team = c("T1", "T2"),
                              Kills = c(sum(lol_data$t1_inhibitorKills, na.rm = TRUE), 
                                        sum(lol_data$t2_inhibitorKills, na.rm = TRUE)))

ggplot(inhibitor_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Inhibitor Kills", title = "Comparison of Inhibitor Kills between T1 and T2")
```

Games usually end with 0 to 2 inhibitors being destroyed, likely due to late game forfeits or steam rolls (which is often the case in many games)

```{r}
ggplot(lol_data, aes(x = t1_dragonKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_dragonKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
# Calculate the total number of dragon kills for each team
dragon_kills <- data.frame(Team = c("T1", "T2"),
                           Kills = c(sum(lol_data$t1_dragonKills, na.rm = TRUE), 
                                     sum(lol_data$t2_dragonKills, na.rm = TRUE)))

ggplot(dragon_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Dragon Kills", title = "Comparison of Dragon Kills between T1 and T2")
```

Dragon kills comfortably lie within the 0-2 range

```{r}
ggplot(lol_data, aes(x = t1_riftHeraldKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_riftHeraldKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
# Calculate the total number of rift herald kills for each team
riftHerald_kills <- data.frame(Team = c("T1", "T2"),
                               Kills = c(sum(lol_data$t1_riftHeraldKills, na.rm = TRUE), 
                                         sum(lol_data$t2_riftHeraldKills, na.rm = TRUE)))

ggplot(riftHerald_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Rift Herald Kills", title = "Comparison of Rift Herald Kills between T1 and T2")
```

Rift herald only spawns from 9:50 - 19:45 during the game and is most often only taken once if at all

```{r}
ggplot(lol_data, aes(x = t1_baronKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_baronKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
# Calculate the total number of baron kills for each team
baron_kills <- data.frame(Team = c("T1", "T2"),
                          Kills = c(sum(lol_data$t1_baronKills, na.rm = TRUE), 
                                    sum(lol_data$t2_baronKills, na.rm = TRUE)))

ggplot(baron_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Baron Kills", title = "Comparison of Baron Kills between T1 and T2")
```

Baron spawns after the rift herald and usually is game over if taken and used properly, hence why the amount taken is usually 0 or 1 as sometimes games end before it can be taken

## Base Model and Prior Modeling

Lets start off with a base model to see if we can get insight from just the winner of the match

```{r}
base <- stan_glm(gameDuration ~ winner,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
prior_summary(base) 
```

A prior predictive check can be used to see what we can potentially expect from the data alone and what the stan model viewed as appropriate for our predictions

```{r}
base_priors <- update(base, prior_PD = TRUE)
# 200 prior model lines
lol_data %>%
  add_fitted_draws(base, n = 200) %>%
  ggplot(aes(x = winner, y = gameDuration)) +
    geom_line(aes(y = .value, group = .draw), alpha = 0.05)

# 4 prior simulated datasets
set.seed(3)
lol_data %>%
  add_predicted_draws(base, n = 4) %>%
  ggplot(aes(x = winner, y = gameDuration)) +
    geom_point(aes(y = .prediction, group = .draw)) + 
    facet_wrap(~ .draw)
```

Prior distribution expects the games to range from 1910 - 1945 seconds as expected

Game duration lies between 500-3500 seconds which falls in line with our summary

Prior distribution expects games to last longer if team 2 wins as compared to team 1

```{r}
mcmc_trace(base, size = .1)
```

Plots look fast mixing and consistent

```{r}
rhat(base)
```

Rhat close to 1 and not \>1.05, good

```{r}
neff_ratio(base)
```

neff ratio is a little high but still acceptable, \>.10 which is good

```{r}
tidy(base, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)
```

Overall the model could work, but more investigation should be done to improve what we have

## Interaction Investigation

Lets see if there are some interactions we can expect

Speaking from experience in playing the game myself, I predict the dragon and baron kills will have a big impact on the game duration due to the buffs they give each member of the team being crucial to winning

```{r}
int <- stan_glm(gameDuration ~ t1_dragonKills:t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
int2 <- stan_glm(gameDuration ~ t1_baronKills:t2_baronKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
summary(int)
```

```{r}
summary(int2)
```

Summary stats look promising with both values not containing 0 for their 95% CI showing they are significant and can be useful

```{r}
# Extract the posterior samples
posterior_samples <- as.matrix(int)
posterior_samples2 <- as.matrix(int2)

# Plot the posterior distribution of the interaction term
mcmc_hist(posterior_samples, pars = c("t1_dragonKills:t2_dragonKills"))

mcmc_hist(posterior_samples2, pars = c("t1_baronKills:t2_baronKills"))
```

The predictors don't look skewed either, looks good.

We can canclude that there may be a potential interaction between dragon kills and baron kills for each team

We can utilize this info later on upon building a better model

## Main Model

Let's assess how much getting a jump start in the game does for determining the length of the game

```{r}
main <- stan_glm(gameDuration ~ winner + firstBlood + firstTower + firstDragon + firstBaron,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
prior_summary(main) 
```

```{r}
mcmc_trace(main, size = .1)
```

Chains look normal

```{r}
rhat(main)
```

```{r}
neff_ratio(main)
```

Rhat looks good, some of the neff ratio values are still high but acceptable for now

```{r}
tidy(main, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)
```

From the tidy output, it looks like winner, firstBlood, and firstTower lose their significance when adding other predictors since their 95% CI range includes 0. We'll use this info later for refining our model

```{r}
newdata <- data.frame(winner = factor(1, levels = levels(lol_data$winner)),
                      firstBlood = factor(2, levels = levels(lol_data$firstBlood)),
                      firstTower = factor(2, levels = levels(lol_data$firstTower)),
                      firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
                      firstBaron = factor(2, levels = levels(lol_data$firstBaron)))

main_predict <- posterior_predict(
  main, 
  newdata = newdata)
mcmc_areas(main_predict) +  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first blood, tower, dragon, baron was team 2 and team 1 won') +
  theme(plot.title = element_text(size = 7))
```

Here's an example output of where a game duration may lie if team 1 wins despite being behind from the start. The game is a little later than normal, but to be expected as the team would need time to make the comeback in the first place. However, I believe this is undershooting the true time we could expect from this scenario. We'll verify if this is true after refining our model

## More Predictors

We can add some numerical variables to see how much adding the amount of each objective a team has taken to see how it affects the game duration

```{r}
ext_main <- stan_glm(gameDuration ~ winner + firstBlood + firstTower + firstDragon + firstBaron + t1_towerKills + t1_inhibitorKills + t1_baronKills + t1_dragonKills + t2_towerKills + t2_inhibitorKills + t2_baronKills + t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
prior_summary(ext_main) 
```

```{r}
mcmc_trace(ext_main, size = .1)
```

Traces look good

```{r}
rhat(ext_main)
```

```{r}
neff_ratio(ext_main)
```

Rhat looks good, some neff ratios (i.e. firstBlood, firstTower) are now too high to keep and will need to be handled later

```{r}
tidy(ext_main, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)
```

Here, the predictors winner, firstBlood, and firstTower are not significant due to their 95% CI including 0

```{r}
# dataframe with the specified values
newdata_ext <- data.frame(
  winner = factor(1, levels = levels(lol_data$winner)),
  firstBlood = factor(2, levels = levels(lol_data$firstBlood)),
  firstTower = factor(2, levels = levels(lol_data$firstTower)),
  firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
  firstBaron = factor(2, levels = levels(lol_data$firstBaron)),
  t1_towerKills = 7,
  t1_inhibitorKills = 2,
  t1_baronKills = 2,
  t1_dragonKills = 2,
  t2_towerKills = 3,
  t2_inhibitorKills = 2,
  t2_baronKills = 1,
  t2_dragonKills = 3
)

ext_main_predict <- posterior_predict(ext_main, newdata = newdata_ext)

mcmc_areas(ext_main_predict) +  
  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first blood, tower, dragon, baron was team 2 and team 1 won, with additional predictors') +
  theme(plot.title = element_text(size = 8))
```

After adding a few more variables into play we can see the game gets longer (and expectedly so). Team 1 comes from behind but ends up getting 2 barons and 7 towers making us believe this game will have to go on for a while for the comeback to truly be complete

## Refined Model

Let's now remove the high neff ratio terms from the extended model (i.e. \> 1) and include the interaction terms we deemed useful beforehand

```{r}
ext_main_int <- stan_glm(gameDuration ~ firstDragon + firstBaron + t1_baronKills + t1_dragonKills + t2_baronKills + t2_dragonKills + t1_baronKills:t2_baronKills + t1_dragonKills:t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
prior_summary(ext_main_int)
```

```{r}
mcmc_trace(ext_main_int, size = .1)
```

So far so good

```{r}
rhat(ext_main_int)
```

Rhat looks satisfactory for all predictors

```{r}
neff_ratio(ext_main_int)
```

Neff ratios are much more reasonable now and a big improvement from prior models with no value being \> .90

```{r}
tidy(ext_main_int, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)
```

Every value looks significant from the 95% CI (even the interaction terms!!) due to none of them including 0.

```{r}
# dataframe with the specified values
newdata_ext_int <- data.frame(
  firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
  firstBaron = factor(2, levels = levels(lol_data$firstBaron)),
  t1_baronKills = 2,
  t1_dragonKills = 2,
  t2_baronKills = 1,
  t2_dragonKills = 3
)

ext_main_int_predict <- posterior_predict(ext_main_int, newdata = newdata_ext_int)

mcmc_areas(ext_main_int_predict) +  
  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first dragon, baron was team 2 and team 1 had more baron kills with interaction') +
  theme(plot.title = element_text(size = 8))
```

The games seem to drag on even longer with this new model. This is expected since the dragon and baron respawn times are quite long so, as said before, the comeback would realistically take longer than average.

Lets do some model comparisons to verify which model is best

## Model Comparisons

### PP Check

```{r}
pp_check(main, nreps = 50) + xlab("Game Duration") +
  ggtitle('Main effects model')
```

The main model gets a decent amount of area, yet it starts moving too far right and undershoots at the peak

```{r}
pp_check(ext_main, nreps = 50) + xlab("Game Duration") +
  ggtitle('Extended Main effects model')
```

The extended model is objectively better than the main model, but still undershoots area at the peak

```{r}
pp_check(ext_main_int, nreps = 50) + xlab("Game Duration") +
  ggtitle('Refined Interaction effects model')
```

The interaction model is very similar to the extended model which is better than the main model.

The ppchecks show most of the are being covered for the last 2 models with the interaction model doing better around the peak by a small margin

### 10 Fold Cross-Validations

```{r}
test_sample <- lol_data %>% head(10000)
nrow(test_sample)
```

```{r}
set.seed(84735)

p_main <- prediction_summary(model = main, data = test_sample)
p_ext_main <- prediction_summary(model = ext_main, data = test_sample)
p_ext_main_int <- prediction_summary(model = ext_main_int, data = test_sample)
```

```{r}
rbind(p_main, p_ext_main, p_ext_main_int)
```

These are the raw MAE values based on a sample of 10000 observations from the data, we can use these to determine the behavior and bias of each model based on their cross-validation results

```{r}
set.seed(84735)

cv_main <- prediction_summary_cv(
  model = main, data = lol_data, k = 10)

cv_extend <- prediction_summary_cv(
  model = ext_main, data = lol_data, k = 10)

cv_interact <- prediction_summary_cv(
  model = ext_main_int, data = lol_data, k = 10)
```

```{r}
rbind(cv_main$cv, cv_extend$cv, cv_interact$cv)
```

### Loo Diagnostics

```{r}
set.seed(34521)
main_elpd <- loo(main)
ext_main_elpd <- loo(ext_main)
ext_main_interact_elpd <- loo(ext_main_int)

main_elpd$estimates
ext_main_elpd$estimates
ext_main_interact_elpd$estimates
```

```{r}
c(main_elpd$estimates[1], ext_main_elpd$estimates[1], ext_main_interact_elpd$estimates[1])

loo_compare(main_elpd, ext_main_elpd, ext_main_interact_elpd)
```

The extended model fairs better than the interaction model when comparing their ELPD and MAE. The interaction model does 4337.6 points "worse" for ELPD and is around 16 points higher in its MAE. However, something to consider here is the idea of [overfitting](https://statisticsbyjim.com/regression/overfitting-regression-models/) and how more predictors and affect the model overall and skew our metrics.

Despite the interaction model covering more area, the [MAE](https://stephenallwright.com/good-mae-score/) and ELPD are both worse. This may be due to these values being inflated by the larger number of predictors in the extended model. Furthermore, the extended model contains numerous values with higher than normal Neff Ratios which is a [cause for concern](https://stats.stackexchange.com/questions/296059/effective-sample-size-greater-than-actual-sample-size) about the validity of the model along with predictors that were not significant.

Despite having less predictors, the interaction model garners more area in its predictive posterior distribution while having a similar MAE and good ELPD score too (-309859.7 vs 314197.3) which is a minor (\~1%) difference given the size of these values. However, these interaction terms could also be having an adverse effect on the model leading to these greater errors.

### Overfitting

```{r}
rbind(p_main, p_ext_main, p_ext_main_int) 
rbind(cv_main$cv, cv_extend$cv, cv_interact$cv)
```

Based on the difference between the raw MAE values and the cross-validation MAE values, overfitting does not seem to pose a threat to these models, however we can only truly know this if we are given new data entirely to test the models. Furthermore, the extended model does have a larger (though mostly negligible due to how small it is) difference from the cross-validation MAE compared to the other models.

What is undoubtedly clear is that adding numerical predictors to the model does help the predictions overall and is shown with both the extended and interaction model fairing better than the main model in every aspect.

## Regression Inference

Let's see how the 2 best models fair when conducting a quick hypothesis test to see the Posterior probability of a game around 40 minutes long

We'll use the predictive posterior models we created and visualized earlier to conduct the tests

$$
H_0: \pi \geq 2500
$$

$$
H_a: \pi < 2500
$$

```{r}
# Extract the posterior samples
posterior_samples <- ext_main_int_predict

p_H0 <- mean(posterior_samples >= 2500)

p_Ha <- mean(posterior_samples < 2500)

# Print the results
cat("Posterior probability of H0 (π ≥ 2500):", p_H0)
cat("Posterior probability of Ha (π < 2500):", p_Ha)
```

```{r}
# Extract the posterior samples
posterior_samples <- ext_main_predict

p_H0 <- mean(posterior_samples >= 2500)

p_Ha <- mean(posterior_samples < 2500)

# Print the results
cat("Posterior probability of H0 (π ≥ 2500):", p_H0)
cat("Posterior probability of Ha (π < 2500):", p_Ha)
```

Though the two models seemingly predict the actual distribution of the data well, the 2 models have very different outcomes when conducting the test.

The extended model favors the alternate hypothesis while the interaction model clearly favors the null hypothesis. This both coincides with what we saw before from the visualizations and tells us that the extended model favors games that are shorter while the interaction model expects games like this to take longer.

This is also something one should consider when choosing the model. Whether or not the predictions themselves seem realistic given the scenario/circumstances.

To back up what I'm saying, Baron Nashor spawns at 20 minutes and respawns every 6 minutes. Since 3 Barons are killed in both scenarios, the game would have to be at least 38 minutes and only if the teams kill the Baron IMMEDIATELY (which is usually never the case). Therefore, one would expect majority of the area to lie within the \~40 minute range at the very least for when the game would end.

## Conclusion

In conclusion, it is a choice between whether one wants to take the chance of playing with a model that may be susceptible to overfitting for the sake of potentially less error or a slightly higher error model that has interactions but would be less likely to be susceptible to overfitting.

Personally, I say that the refined interaction model is the best model to use for predicting the game duration of a LOL match given our data due to its comparable MAE along with better posterior predictive abilities and lack of evidence for overfitting. Furthermore, based on my experience and game rules, the predictive values it comes up with end up being much more realistic in the grand scheme of things and with other factors such as respawn time considered.

## References

-   <https://www.bayesrulesbook.com/>

-   <https://stats.stackexchange.com/questions/296059/effective-sample-size-greater-than-actual-sample-size>

-   <https://stephenallwright.com/good-mae-score/>

-   <https://discourse.mc-stan.org/t/understanding-looic/13409/6>

-   <https://discourse.mc-stan.org/t/projpred-elpd-goes-down-and-rmse-goes-up-after-x-variables/13153>

-   <https://stats.stackexchange.com/questions/313564/how-does-bayesian-analysis-make-accurate-predictions-using-subjectively-chosen-p>

-   [https://medium.com/\@ooemma83/interpretation-of-evaluation-metrics-for-regression-analysis-mae-mse-rmse-mape-r-squared-and-5693b61a9833](https://medium.com/@ooemma83/interpretation-of-evaluation-metrics-for-regression-analysis-mae-mse-rmse-mape-r-squared-and-5693b61a9833){.uri}

-   <https://statisticsbyjim.com/regression/overfitting-regression-models/>

-   <https://stats.stackexchange.com/questions/9053/how-does-cross-validation-overcome-the-overfitting-problem>