The Data
The dataset I have chosen to conduct my regression analysis is the (LoL)
League of Legends Ranked Games dataset. This data looks at ranked
League of Legends games played during season 9 in the EUW region. This
dataset was collected using the Riot Games API, which makes it easy to
look up and collect information on a users ranked history and collect
their games.
# Load packages
library(ggplot2)
library(rstanarm)
library(bayesplot)
library(bayesrules)
library(tidyverse)
library(tidybayes)
library(dplyr)
library(broom.mixed)
library(interactions)
To preface this analysis, one must first understand the game of
League of Legends, or LOL for short. In LOL, two 5 person teams fight to
destroy each others nexus and get to choose and ban a champion, each
with a different set of skills and abilities, and 2 summoner spells to
do so. Along the way, the players must both fight each other, destroy
objectives such as towers and inhibitors, and strategize effectively to
reach the opponent’s nexus. There are also several monsters that teams
can defeat to buff them as well, such as Rift Herald, Baron Nashor, and
an assorted set of Dragons.
We will use lol_data
to build various models of League
of Legends gameDuration
. Throughout, we’ll utilize weakly
informative priors and a basic understanding that LOL games usually are
30
minutes but can range from 25-35 minutes. We will asses an array of
different predictors and combinations to determine what can provide us
the best fit for predicting the game duration of a League of Legends
game. A base criteria for these games is there at least needs to be one
tower, one inhibitor, and one champion kill for each game. This is to
mitigate the amount of games that teams may have forfeited due to a
player leaving or trolling.
lol_data <- read.csv("games.csv",
sep=",",
na.strings=c(""," ","NA","N/A")
)
head(lol_data)
Setting Up the Data
Transforming data to fit baseline of at least one kill, tower, and
inhibitor
lol_data <- lol_data[!(lol_data$firstBlood == 0 | lol_data$firstTower == 0 | lol_data$firstInhibitor == 0), ]
nrow(lol_data)
[1] 45214
colnames(lol_data)
[1] "gameId" "creationTime" "gameDuration"
[4] "seasonId" "winner" "firstBlood"
[7] "firstTower" "firstInhibitor" "firstBaron"
[10] "firstDragon" "firstRiftHerald" "t1_champ1id"
[13] "t1_champ1_sum1" "t1_champ1_sum2" "t1_champ2id"
[16] "t1_champ2_sum1" "t1_champ2_sum2" "t1_champ3id"
[19] "t1_champ3_sum1" "t1_champ3_sum2" "t1_champ4id"
[22] "t1_champ4_sum1" "t1_champ4_sum2" "t1_champ5id"
[25] "t1_champ5_sum1" "t1_champ5_sum2" "t1_towerKills"
[28] "t1_inhibitorKills" "t1_baronKills" "t1_dragonKills"
[31] "t1_riftHeraldKills" "t1_ban1" "t1_ban2"
[34] "t1_ban3" "t1_ban4" "t1_ban5"
[37] "t2_champ1id" "t2_champ1_sum1" "t2_champ1_sum2"
[40] "t2_champ2id" "t2_champ2_sum1" "t2_champ2_sum2"
[43] "t2_champ3id" "t2_champ3_sum1" "t2_champ3_sum2"
[46] "t2_champ4id" "t2_champ4_sum1" "t2_champ4_sum2"
[49] "t2_champ5id" "t2_champ5_sum1" "t2_champ5_sum2"
[52] "t2_towerKills" "t2_inhibitorKills" "t2_baronKills"
[55] "t2_dragonKills" "t2_riftHeraldKills" "t2_ban1"
[58] "t2_ban2" "t2_ban3" "t2_ban4"
[61] "t2_ban5"
We can clean up our data by omitting unnecessary columns
lol_data <- subset(lol_data, select = -c(t2_ban1, t2_ban2, t2_ban3, t2_ban4, t2_ban5, t1_ban1, t1_ban2, t1_ban3, t1_ban4, t1_ban5, t1_champ1_sum1, t1_champ2_sum1, t1_champ3_sum1, t1_champ4_sum1, t1_champ5_sum1, t2_champ5_sum1, t2_champ1_sum1, t2_champ2_sum1, t2_champ3_sum1, t2_champ4_sum1, t2_champ5_sum1, t1_champ1_sum2, t1_champ2_sum2, t1_champ3_sum2, t1_champ4_sum2, t1_champ5_sum2, t2_champ1_sum2, t2_champ1_sum2, t2_champ2_sum2, t2_champ3_sum2, t2_champ4_sum2, t2_champ5_sum2, t1_champ1id, t1_champ2id, t1_champ3id, t1_champ4id, t1_champ5id, t2_champ1id, t2_champ2id, t2_champ3id, t2_champ4id, t2_champ5id))
colnames(lol_data)
[1] "gameId" "creationTime" "gameDuration"
[4] "seasonId" "winner" "firstBlood"
[7] "firstTower" "firstInhibitor" "firstBaron"
[10] "firstDragon" "firstRiftHerald" "t1_towerKills"
[13] "t1_inhibitorKills" "t1_baronKills" "t1_dragonKills"
[16] "t1_riftHeraldKills" "t2_towerKills" "t2_inhibitorKills"
[19] "t2_baronKills" "t2_dragonKills" "t2_riftHeraldKills"
We can also covert some of the columns to factors for team 1 and team
2 for ease of visualization and interpretation
lol_data$winner <- as.factor(lol_data$winner)
lol_data$firstBlood <- as.factor(lol_data$firstBlood)
lol_data$firstTower <- as.factor(lol_data$firstTower)
lol_data$firstInhibitor <- as.factor(lol_data$firstInhibitor)
lol_data$firstBaron <- as.factor(lol_data$firstBaron)
lol_data$firstDragon <- as.factor(lol_data$firstDragon)
lol_data$firstRiftHerald <- as.factor(lol_data$firstRiftHerald)
head(lol_data)
Summary Analysis
Here are some visualizations and statistics to help better understand
our data and what we are working with.
summary(lol_data)
gameId creationTime gameDuration seasonId
Min. :3.215e+09 Min. :1.497e+12 Min. : 477 Min. :9
1st Qu.:3.292e+09 1st Qu.:1.502e+12 1st Qu.:1632 1st Qu.:9
Median :3.320e+09 Median :1.504e+12 Median :1895 Median :9
Mean :3.306e+09 Mean :1.503e+12 Mean :1930 Mean :9
3rd Qu.:3.327e+09 3rd Qu.:1.504e+12 3rd Qu.:2196 3rd Qu.:9
Max. :3.332e+09 Max. :1.505e+12 Max. :4728 Max. :9
winner firstBlood firstTower firstInhibitor firstBaron firstDragon
1:22867 1:23151 1:23248 1:23054 0:14617 0: 434
2:22347 2:22063 2:21966 2:22160 1:14469 1:22258
2:16128 2:22522
firstRiftHerald t1_towerKills t1_inhibitorKills t1_baronKills
0:21922 Min. : 0.000 Min. : 0.000 Min. :0.0000
1:11933 1st Qu.: 3.000 1st Qu.: 0.000 1st Qu.:0.0000
2:11359 Median : 7.000 Median : 1.000 Median :0.0000
Mean : 6.173 Mean : 1.159 Mean :0.4173
3rd Qu.:10.000 3rd Qu.: 2.000 3rd Qu.:1.0000
Max. :11.000 Max. :10.000 Max. :5.0000
t1_dragonKills t1_riftHeraldKills t2_towerKills t2_inhibitorKills
Min. :0.000 Min. :0.0000 Min. : 0.000 Min. : 0.000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 2.000 1st Qu.: 0.000
Median :1.000 Median :0.0000 Median : 7.000 Median : 1.000
Mean :1.483 Mean :0.2639 Mean : 6.014 Mean : 1.122
3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:10.000 3rd Qu.: 2.000
Max. :6.000 Max. :1.0000 Max. :11.000 Max. :10.000
t2_baronKills t2_dragonKills t2_riftHeraldKills
Min. :0.0000 Min. :0.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
Median :0.0000 Median :1.000 Median :0.0000
Mean :0.4641 Mean :1.506 Mean :0.2512
3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :4.0000 Max. :6.000 Max. :1.0000
Some conclusions we can make here:
The mean game time is around 1930 seconds (roughly 32
minutes)
Game duration lies mostly between 1632 and 2196
Team 1 has won more games, has more first bloods, towers,
inhibitors, and rift heralds
Team 2 has more first barons and dragons.
Team 1’s tower, inhibitor, and herald kills are higher than Team
2’s which is reflected from them getting the first of each more often
than not.
Team 2’s dragon and baron kills are higher than Team 1’s which is
reflected from them getting the first of each more often than
not.
Visualizations
We can use ggplot do visualize some of our findings
Response Distribution
ggplot(lol_data, aes(x = gameDuration)) +
geom_histogram(binwidth = 60, fill = "blue", color = "black") +
labs(x = "Game Duration (seconds)", y = "Frequency", title = "Distribution of Game Duration") +
theme_minimal()

Game duration seems to be a bit right-skewed
First Comparisons
ggplot(lol_data, aes(x = winner, fill = winner)) +
geom_bar() +
theme(text = element_text(size=9)) +
labs(y = "Winners")

ggplot(lol_data, aes(x = firstBlood, fill = firstBlood)) +
geom_bar() +
theme(text = element_text(size=9)) +
labs(y = "First Bloods")

ggplot(lol_data, aes(x = firstTower, fill = firstTower)) +
geom_bar() +
theme(text = element_text(size=9)) +
labs(y = "First Towers")

ggplot(lol_data, aes(x = firstInhibitor, fill = firstInhibitor)) +
geom_bar() +
theme(text = element_text(size=9)) +
labs(y = "First Inhibitors")

ggplot(lol_data, aes(x = firstDragon, fill = firstDragon)) +
geom_bar() +
theme(text = element_text(size=9)) +
labs(y = "First Dragons")

ggplot(lol_data, aes(x = firstRiftHerald, fill = firstRiftHerald)) +
geom_bar() +
theme(text = element_text(size=9)) +
labs(y = "First Rift Heralds")

ggplot(lol_data, aes(x = firstBaron, fill = firstBaron)) +
geom_bar() +
theme(text = element_text(size=9)) +
labs(y = "First Barons")

These all are consistent with our previous summary findings about
first categories
Total Objectives Comparisons
ggplot(lol_data, aes(x = t1_towerKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

ggplot(lol_data, aes(x = t2_towerKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

tower_kills <- data.frame(Team = c("T1", "T2"),
Kills = c(sum(lol_data$t1_towerKills, na.rm = TRUE),
sum(lol_data$t2_towerKills, na.rm = TRUE)))
ggplot(tower_kills, aes(x = Team, y = Kills, fill = Team)) +
geom_bar(stat = "identity") +
labs(x = "Team", y = "Total Tower Kills", title = "Comparison of Tower Kills between T1 and T2")

More often than not, both teams tend to kill >6 towers per
game
ggplot(lol_data, aes(x = t1_inhibitorKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

ggplot(lol_data, aes(x = t2_inhibitorKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

# Calculate the total number of inhibitor kills for each team
inhibitor_kills <- data.frame(Team = c("T1", "T2"),
Kills = c(sum(lol_data$t1_inhibitorKills, na.rm = TRUE),
sum(lol_data$t2_inhibitorKills, na.rm = TRUE)))
ggplot(inhibitor_kills, aes(x = Team, y = Kills, fill = Team)) +
geom_bar(stat = "identity") +
labs(x = "Team", y = "Total Inhibitor Kills", title = "Comparison of Inhibitor Kills between T1 and T2")

Games usually end with 0 to 2 inhibitors being destroyed, likely due
to late game forfeits or steam rolls (which is often the case in many
games)
ggplot(lol_data, aes(x = t1_dragonKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

ggplot(lol_data, aes(x = t2_dragonKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

# Calculate the total number of dragon kills for each team
dragon_kills <- data.frame(Team = c("T1", "T2"),
Kills = c(sum(lol_data$t1_dragonKills, na.rm = TRUE),
sum(lol_data$t2_dragonKills, na.rm = TRUE)))
ggplot(dragon_kills, aes(x = Team, y = Kills, fill = Team)) +
geom_bar(stat = "identity") +
labs(x = "Team", y = "Total Dragon Kills", title = "Comparison of Dragon Kills between T1 and T2")

Dragon kills comfortably lie within the 0-2 range
ggplot(lol_data, aes(x = t1_riftHeraldKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

ggplot(lol_data, aes(x = t2_riftHeraldKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

# Calculate the total number of rift herald kills for each team
riftHerald_kills <- data.frame(Team = c("T1", "T2"),
Kills = c(sum(lol_data$t1_riftHeraldKills, na.rm = TRUE),
sum(lol_data$t2_riftHeraldKills, na.rm = TRUE)))
ggplot(riftHerald_kills, aes(x = Team, y = Kills, fill = Team)) +
geom_bar(stat = "identity") +
labs(x = "Team", y = "Total Rift Herald Kills", title = "Comparison of Rift Herald Kills between T1 and T2")

Rift herald only spawns from 9:50 - 19:45 during the game and is most
often only taken once if at all
ggplot(lol_data, aes(x = t1_baronKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

ggplot(lol_data, aes(x = t2_baronKills)) +
geom_bar() +
theme(text = element_text(size=10)) +
labs(y = "Count")

# Calculate the total number of baron kills for each team
baron_kills <- data.frame(Team = c("T1", "T2"),
Kills = c(sum(lol_data$t1_baronKills, na.rm = TRUE),
sum(lol_data$t2_baronKills, na.rm = TRUE)))
ggplot(baron_kills, aes(x = Team, y = Kills, fill = Team)) +
geom_bar(stat = "identity") +
labs(x = "Team", y = "Total Baron Kills", title = "Comparison of Baron Kills between T1 and T2")

Baron spawns after the rift herald and usually is game over if taken
and used properly, hence why the amount taken is usually 0 or 1 as
sometimes games end before it can be taken
Base Model and Prior Modeling
Lets start off with a base model to see if we can get insight from
just the winner of the match
base <- stan_glm(gameDuration ~ winner,
data = lol_data, family = gaussian,
prior_intercept = normal(1800, 150, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 8e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.8 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.248 seconds (Warm-up)
Chain 1: 13.035 seconds (Sampling)
Chain 1: 13.283 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.6e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.227 seconds (Warm-up)
Chain 2: 13.029 seconds (Sampling)
Chain 2: 13.256 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 2.2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.239 seconds (Warm-up)
Chain 3: 13.007 seconds (Sampling)
Chain 3: 13.246 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.6e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.239 seconds (Warm-up)
Chain 4: 12.989 seconds (Sampling)
Chain 4: 13.228 seconds (Total)
Chain 4:
prior_summary(base)
Priors for model 'base'
------
Intercept (after predictors centered)
Specified prior:
~ normal(location = 1800, scale = 150)
Adjusted prior:
~ normal(location = 1800, scale = 64093)
Coefficients
Specified prior:
~ normal(location = 0, scale = 2.5)
Adjusted prior:
~ normal(location = 0, scale = 2137)
Auxiliary (sigma)
Specified prior:
~ exponential(rate = 1)
Adjusted prior:
~ exponential(rate = 0.0023)
------
See help('prior_summary.stanreg') for more details
A prior predictive check can be used to see what we can potentially
expect from the data alone and what the stan model viewed as appropriate
for our predictions
base_priors <- update(base, prior_PD = TRUE)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 1.2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 4.083 seconds (Warm-up)
Chain 1: 0.161 seconds (Sampling)
Chain 1: 4.244 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 2.508 seconds (Warm-up)
Chain 2: 0.148 seconds (Sampling)
Chain 2: 2.656 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 4.038 seconds (Warm-up)
Chain 3: 0.161 seconds (Sampling)
Chain 3: 4.199 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 4.794 seconds (Warm-up)
Chain 4: 0.159 seconds (Sampling)
Chain 4: 4.953 seconds (Total)
Chain 4:
# 200 prior model lines
lol_data %>%
add_fitted_draws(base, n = 200) %>%
ggplot(aes(x = winner, y = gameDuration)) +
geom_line(aes(y = .value, group = .draw), alpha = 0.05)
Warning: `fitted_draws` and `add_fitted_draws` are deprecated as their names were confusing.
- Use [add_]epred_draws() to get the expectation of the posterior predictive.
- Use [add_]linpred_draws() to get the distribution of the linear predictor.
- For example, you used [add_]fitted_draws(..., scale = "response"), which
means you most likely want [add_]epred_draws(...).
NOTE: When updating to the new functions, note that the `model` parameter is now
named `object` and the `n` parameter is now named `ndraws`.

# 4 prior simulated datasets
set.seed(3)
lol_data %>%
add_predicted_draws(base, n = 4) %>%
ggplot(aes(x = winner, y = gameDuration)) +
geom_point(aes(y = .prediction, group = .draw)) +
facet_wrap(~ .draw)
Warning:
In add_predicted_draws(): The `n` argument is a deprecated alias for `ndraws`.
Use the `ndraws` argument instead.
See help("tidybayes-deprecated").

Prior distribution expects the games to range from 1910 - 1945
seconds as expected
Game duration lies between 500-3500 seconds which falls in line with
our summary
Prior distribution expects games to last longer if team 2 wins as
compared to team 1
mcmc_trace(base, size = .1)

Plots look fast mixing and consistent
rhat(base)
(Intercept) winner2 sigma
0.9998587 0.9999295 0.9999225
Rhat close to 1 and not >1.05, good
neff_ratio(base)
(Intercept) winner2 sigma
0.92220 0.95970 0.95575
neff ratio is a little high but still acceptable, >.10 which is
good
tidy(base, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.95)
Overall the model could work, but more investigation should be done
to improve what we have
Interaction Investigation
Lets see if there are some interactions we can expect
Speaking from experience in playing the game myself, I predict the
dragon and baron kills will have a big impact on the game duration due
to the buffs they give each member of the team being crucial to
winning
int <- stan_glm(gameDuration ~ t1_dragonKills:t2_dragonKills,
data = lol_data, family = gaussian,
prior_intercept = normal(1800, 150, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 2.4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.266 seconds (Warm-up)
Chain 1: 13.025 seconds (Sampling)
Chain 1: 13.291 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.6e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.259 seconds (Warm-up)
Chain 2: 13.003 seconds (Sampling)
Chain 2: 13.262 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.7e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.251 seconds (Warm-up)
Chain 3: 12.961 seconds (Sampling)
Chain 3: 13.212 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.7e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.234 seconds (Warm-up)
Chain 4: 13.145 seconds (Sampling)
Chain 4: 13.379 seconds (Total)
Chain 4:
int2 <- stan_glm(gameDuration ~ t1_baronKills:t2_baronKills,
data = lol_data, family = gaussian,
prior_intercept = normal(1800, 150, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 1.8e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.256 seconds (Warm-up)
Chain 1: 13.03 seconds (Sampling)
Chain 1: 13.286 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.241 seconds (Warm-up)
Chain 2: 12.952 seconds (Sampling)
Chain 2: 13.193 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.8e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.26 seconds (Warm-up)
Chain 3: 13.004 seconds (Sampling)
Chain 3: 13.264 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.6e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.233 seconds (Warm-up)
Chain 4: 12.956 seconds (Sampling)
Chain 4: 13.189 seconds (Total)
Chain 4:
summary(int)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: gameDuration ~ t1_dragonKills:t2_dragonKills
algorithm: sampling
sample: 20000 (posterior sample size)
priors: see help('prior_summary')
observations: 45214
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) 1728.5 2.0 1725.9 1728.5 1731.1
t1_dragonKills:t2_dragonKills 147.5 0.9 146.3 147.5 148.6
sigma 340.5 1.1 339.0 340.5 341.9
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 1929.8 2.3 1926.9 1929.8 1932.7
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 18279
t1_dragonKills:t2_dragonKills 0.0 1.0 21144
sigma 0.0 1.0 18662
mean_PPD 0.0 1.0 18790
log-posterior 0.0 1.0 9402
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
summary(int2)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: gameDuration ~ t1_baronKills:t2_baronKills
algorithm: sampling
sample: 20000 (posterior sample size)
priors: see help('prior_summary')
observations: 45214
predictors: 2
Estimates:
mean sd 10% 50% 90%
(Intercept) 1884.0 1.9 1881.6 1884.0 1886.4
t1_baronKills:t2_baronKills 466.7 4.8 460.4 466.6 472.8
sigma 390.0 1.3 388.4 390.0 391.7
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 1929.8 2.6 1926.5 1929.8 1933.1
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 17022
t1_baronKills:t2_baronKills 0.0 1.0 20014
sigma 0.0 1.0 18691
mean_PPD 0.0 1.0 18228
log-posterior 0.0 1.0 9914
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Summary stats look promising with both values not containing 0 for
their 95% CI showing they are significant and can be useful
# Extract the posterior samples
posterior_samples <- as.matrix(int)
posterior_samples2 <- as.matrix(int2)
# Plot the posterior distribution of the interaction term
mcmc_hist(posterior_samples, pars = c("t1_dragonKills:t2_dragonKills"))

mcmc_hist(posterior_samples2, pars = c("t1_baronKills:t2_baronKills"))

The predictors don’t look skewed either, looks good.
We can canclude that there may be a potential interaction between
dragon kills and baron kills for each team
We can utilize this info later on upon building a better model
Main Model
Let’s assess how much getting a jump start in the game does for
determining the length of the game
main <- stan_glm(gameDuration ~ winner + firstBlood + firstTower + firstDragon + firstBaron,
data = lol_data, family = gaussian,
prior_intercept = normal(1800, 150, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 1.8e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 1.141 seconds (Warm-up)
Chain 1: 14.816 seconds (Sampling)
Chain 1: 15.957 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 1.289 seconds (Warm-up)
Chain 2: 14.216 seconds (Sampling)
Chain 2: 15.505 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.8e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 1.089 seconds (Warm-up)
Chain 3: 14.186 seconds (Sampling)
Chain 3: 15.275 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.8e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 1.117 seconds (Warm-up)
Chain 4: 14.54 seconds (Sampling)
Chain 4: 15.657 seconds (Total)
Chain 4:
prior_summary(main)
Priors for model 'main'
------
Intercept (after predictors centered)
Specified prior:
~ normal(location = 1800, scale = 150)
Adjusted prior:
~ normal(location = 1800, scale = 64093)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [2136.54,2137.02,2137.26,...])
Auxiliary (sigma)
Specified prior:
~ exponential(rate = 1)
Adjusted prior:
~ exponential(rate = 0.0023)
------
See help('prior_summary.stanreg') for more details
mcmc_trace(main, size = .1)

Chains look normal
rhat(main)
(Intercept) winner2 firstBlood2 firstTower2 firstDragon1
1.0001252 1.0000866 0.9999479 0.9999554 1.0000860
firstDragon2 firstBaron1 firstBaron2 sigma
1.0001157 1.0000559 0.9999221 1.0000304
neff_ratio(main)
(Intercept) winner2 firstBlood2 firstTower2 firstDragon1
0.56465 1.06070 1.40485 1.23750 0.54275
firstDragon2 firstBaron1 firstBaron2 sigma
0.54175 1.05630 1.05235 1.12255
Rhat looks good, some of the neff ratio values are still high but
acceptable for now
tidy(main, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.95)
From the tidy output, it looks like winner, firstBlood, and
firstTower lose their significance when adding other predictors since
their 95% CI range includes 0. We’ll use this info later for refining
our model
newdata <- data.frame(winner = factor(1, levels = levels(lol_data$winner)),
firstBlood = factor(2, levels = levels(lol_data$firstBlood)),
firstTower = factor(2, levels = levels(lol_data$firstTower)),
firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
firstBaron = factor(2, levels = levels(lol_data$firstBaron)))
main_predict <- posterior_predict(
main,
newdata = newdata)
mcmc_areas(main_predict) + xlab("Game Duration") +
ggtitle('Predictive distribution of a League of Legends game where first blood, tower, dragon, baron was team 2 and team 1 won') +
theme(plot.title = element_text(size = 7))

Here’s an example output of where a game duration may lie if team 1
wins despite being behind from the start. The game is a little later
than normal, but to be expected as the team would need time to make the
comeback in the first place. However, I believe this is undershooting
the true time we could expect from this scenario. We’ll verify if this
is true after refining our model
More Predictors
We can add some numerical variables to see how much adding the amount
of each objective a team has taken to see how it affects the game
duration
ext_main <- stan_glm(gameDuration ~ winner + firstBlood + firstTower + firstDragon + firstBaron + t1_towerKills + t1_inhibitorKills + t1_baronKills + t1_dragonKills + t2_towerKills + t2_inhibitorKills + t2_baronKills + t2_dragonKills,
data = lol_data, family = gaussian,
prior_intercept = normal(1800, 150, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 1.9e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 1.543 seconds (Warm-up)
Chain 1: 14.226 seconds (Sampling)
Chain 1: 15.769 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.9e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 1.671 seconds (Warm-up)
Chain 2: 14.491 seconds (Sampling)
Chain 2: 16.162 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 2.1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 1.822 seconds (Warm-up)
Chain 3: 14.543 seconds (Sampling)
Chain 3: 16.365 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.6e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 1.593 seconds (Warm-up)
Chain 4: 14.438 seconds (Sampling)
Chain 4: 16.031 seconds (Total)
Chain 4:
prior_summary(ext_main)
Priors for model 'ext_main'
------
Intercept (after predictors centered)
Specified prior:
~ normal(location = 1800, scale = 150)
Adjusted prior:
~ normal(location = 1800, scale = 64093)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [2136.54,2137.02,2137.26,...])
Auxiliary (sigma)
Specified prior:
~ exponential(rate = 1)
Adjusted prior:
~ exponential(rate = 0.0023)
------
See help('prior_summary.stanreg') for more details
mcmc_trace(ext_main, size = .1)

Traces look good
rhat(ext_main)
(Intercept) winner2 firstBlood2 firstTower2
0.9999612 0.9999395 0.9999895 0.9998515
firstDragon1 firstDragon2 firstBaron1 firstBaron2
1.0000056 1.0000145 1.0001244 1.0001247
t1_towerKills t1_inhibitorKills t1_baronKills t1_dragonKills
0.9999698 1.0000633 1.0001761 0.9998990
t2_towerKills t2_inhibitorKills t2_baronKills t2_dragonKills
0.9998515 0.9999585 1.0003134 0.9999253
sigma
0.9998820
neff_ratio(ext_main)
(Intercept) winner2 firstBlood2 firstTower2
0.58290 0.79920 2.05680 1.44595
firstDragon1 firstDragon2 firstBaron1 firstBaron2
0.48310 0.48375 0.66920 0.65460
t1_towerKills t1_inhibitorKills t1_baronKills t1_dragonKills
0.71785 1.02960 0.65910 0.94240
t2_towerKills t2_inhibitorKills t2_baronKills t2_dragonKills
0.74235 1.01930 0.65770 0.91395
sigma
1.16840
Rhat looks good, some neff ratios (i.e. firstBlood, firstTower) are
now too high to keep and will need to be handled later
tidy(ext_main, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.95)
Here, the predictors winner, firstBlood, and firstTower are not
significant due to their 95% CI including 0
# dataframe with the specified values
newdata_ext <- data.frame(
winner = factor(1, levels = levels(lol_data$winner)),
firstBlood = factor(2, levels = levels(lol_data$firstBlood)),
firstTower = factor(2, levels = levels(lol_data$firstTower)),
firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
firstBaron = factor(2, levels = levels(lol_data$firstBaron)),
t1_towerKills = 7,
t1_inhibitorKills = 2,
t1_baronKills = 2,
t1_dragonKills = 2,
t2_towerKills = 3,
t2_inhibitorKills = 2,
t2_baronKills = 1,
t2_dragonKills = 3
)
ext_main_predict <- posterior_predict(ext_main, newdata = newdata_ext)
mcmc_areas(ext_main_predict) +
xlab("Game Duration") +
ggtitle('Predictive distribution of a League of Legends game where first blood, tower, dragon, baron was team 2 and team 1 won, with additional predictors') +
theme(plot.title = element_text(size = 8))

After adding a few more variables into play we can see the game gets
longer (and expectedly so). Team 1 comes from behind but ends up getting
2 barons and 7 towers making us believe this game will have to go on for
a while for the comeback to truly be complete
Refined Model
Let’s now remove the high neff ratio terms from the extended model
(i.e. > 1) and include the interaction terms we deemed useful
beforehand
ext_main_int <- stan_glm(gameDuration ~ firstDragon + firstBaron + t1_baronKills + t1_dragonKills + t2_baronKills + t2_dragonKills + t1_baronKills:t2_baronKills + t1_dragonKills:t2_dragonKills,
data = lol_data, family = gaussian,
prior_intercept = normal(1800, 150, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 1.7e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 1.363 seconds (Warm-up)
Chain 1: 14.396 seconds (Sampling)
Chain 1: 15.759 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.6e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 1.365 seconds (Warm-up)
Chain 2: 14.24 seconds (Sampling)
Chain 2: 15.605 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.4e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 1.761 seconds (Warm-up)
Chain 3: 14.227 seconds (Sampling)
Chain 3: 15.988 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.8e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 1.409 seconds (Warm-up)
Chain 4: 14.229 seconds (Sampling)
Chain 4: 15.638 seconds (Total)
Chain 4:
prior_summary(ext_main_int)
Priors for model 'ext_main_int'
------
Intercept (after predictors centered)
Specified prior:
~ normal(location = 1800, scale = 150)
Adjusted prior:
~ normal(location = 1800, scale = 64093)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [2136.66,2136.42,2289.92,...])
Auxiliary (sigma)
Specified prior:
~ exponential(rate = 1)
Adjusted prior:
~ exponential(rate = 0.0023)
------
See help('prior_summary.stanreg') for more details
mcmc_trace(ext_main_int, size = .1)

So far so good
rhat(ext_main_int)
(Intercept) firstDragon1
1.0003440 1.0003428
firstDragon2 firstBaron1
1.0003775 1.0000760
firstBaron2 t1_baronKills
1.0002465 1.0000761
t1_dragonKills t2_baronKills
0.9998796 1.0002153
t2_dragonKills t1_baronKills:t2_baronKills
0.9999051 1.0002525
t1_dragonKills:t2_dragonKills sigma
1.0000077 1.0000297
Rhat looks satisfactory for all predictors
neff_ratio(ext_main_int)
(Intercept) firstDragon1
0.58795 0.50540
firstDragon2 firstBaron1
0.50105 0.56190
firstBaron2 t1_baronKills
0.53910 0.53865
t1_dragonKills t2_baronKills
0.67185 0.53005
t2_dragonKills t1_baronKills:t2_baronKills
0.67705 0.66965
t1_dragonKills:t2_dragonKills sigma
0.82210 1.12835
Neff ratios are much more reasonable now and a big improvement from
prior models with no value being > .90
tidy(ext_main_int, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.95)
Every value looks significant from the 95% CI (even the interaction
terms!!) due to none of them including 0.
# dataframe with the specified values
newdata_ext_int <- data.frame(
firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
firstBaron = factor(2, levels = levels(lol_data$firstBaron)),
t1_baronKills = 2,
t1_dragonKills = 2,
t2_baronKills = 1,
t2_dragonKills = 3
)
ext_main_int_predict <- posterior_predict(ext_main_int, newdata = newdata_ext_int)
mcmc_areas(ext_main_int_predict) +
xlab("Game Duration") +
ggtitle('Predictive distribution of a League of Legends game where first dragon, baron was team 2 and team 1 had more baron kills with interaction') +
theme(plot.title = element_text(size = 8))

The games seem to drag on even longer with this new model. This is
expected since the dragon and baron respawn times are quite long so, as
said before, the comeback would realistically take longer than
average.
Lets do some model comparisons to verify which model is best
Model Comparisons
PP Check
pp_check(main, nreps = 50) + xlab("Game Duration") +
ggtitle('Main effects model')

The main model gets a decent amount of area, yet it starts moving too
far right and undershoots at the peak
pp_check(ext_main, nreps = 50) + xlab("Game Duration") +
ggtitle('Extended Main effects model')

The extended model is objectively better than the main model, but
still undershoots area at the peak
pp_check(ext_main_int, nreps = 50) + xlab("Game Duration") +
ggtitle('Refined Interaction effects model')

The interaction model is very similar to the extended model which is
better than the main model.
The ppchecks show most of the are being covered for the last 2 models
with the interaction model doing better around the peak by a small
margin
10 Fold Cross-Validations
test_sample <- lol_data %>% head(10000)
nrow(test_sample)
[1] 10000
set.seed(84735)
p_main <- prediction_summary(model = main, data = test_sample)
p_ext_main <- prediction_summary(model = ext_main, data = test_sample)
p_ext_main_int <- prediction_summary(model = ext_main_int, data = test_sample)
rbind(p_main, p_ext_main, p_ext_main_int)
These are the raw MAE values based on a sample of 10000 observations
from the data, we can use these to determine the behavior and bias of
each model based on their cross-validation results
set.seed(84735)
cv_main <- prediction_summary_cv(
model = main, data = lol_data, k = 10)
rbind(cv_main$cv, cv_extend$cv, cv_interact$cv)
Loo Diagnostics
set.seed(34521)
main_elpd <- loo(main)
ext_main_elpd <- loo(ext_main)
ext_main_interact_elpd <- loo(ext_main_int)
main_elpd$estimates
Estimate SE
elpd_loo -3.323804e+05 174.6839030
p_loo 9.093789e+00 0.1196568
looic 6.647608e+05 349.3678059
ext_main_elpd$estimates
Estimate SE
elpd_loo -309859.5959 173.9985196
p_loo 19.2336 0.2843918
looic 619719.1918 347.9970392
ext_main_interact_elpd$estimates
Estimate SE
elpd_loo -314197.32595 179.0596124
p_loo 13.56236 0.2282117
looic 628394.65191 358.1192249
c(main_elpd$estimates[1], ext_main_elpd$estimates[1], ext_main_interact_elpd$estimates[1])
[1] -332380.4 -309859.6 -314197.3
loo_compare(main_elpd, ext_main_elpd, ext_main_interact_elpd)
elpd_diff se_diff
ext_main 0.0 0.0
ext_main_int -4337.7 96.8
main -22520.8 178.3
The extended model fairs better than the interaction model when
comparing their ELPD and MAE. The interaction model does 4337.6 points
“worse” for ELPD and is around 16 points higher in its MAE. However,
something to consider here is the idea of overfitting
and how more predictors and affect the model overall and skew our
metrics.
Despite the interaction model covering more area, the MAE and ELPD are
both worse. This may be due to these values being inflated by the larger
number of predictors in the extended model. Furthermore, the extended
model contains numerous values with higher than normal Neff Ratios which
is a cause
for concern about the validity of the model along with predictors
that were not significant.
Despite having less predictors, the interaction model garners more
area in its predictive posterior distribution while having a similar MAE
and good ELPD score too (-309859.7 vs 314197.3) which is a minor (~1%)
difference given the size of these values. However, these interaction
terms could also be having an adverse effect on the model leading to
these greater errors.
Overfitting
rbind(p_main, p_ext_main, p_ext_main_int)
rbind(cv_main$cv, cv_extend$cv, cv_interact$cv)
Based on the difference between the raw MAE values and the
cross-validation MAE values, overfitting does not seem to pose a threat
to these models, however we can only truly know this if we are given new
data entirely to test the models. Furthermore, the extended model does
have a larger (though mostly negligible due to how small it is)
difference from the cross-validation MAE compared to the other
models.
What is undoubtedly clear is that adding numerical predictors to the
model does help the predictions overall and is shown with both the
extended and interaction model fairing better than the main model in
every aspect.
Regression Inference
Let’s see how the 2 best models fair when conducting a quick
hypothesis test to see the Posterior probability of a game around 40
minutes long
We’ll use the predictive posterior models we created and visualized
earlier to conduct the tests
\[
H_0: \pi \geq 2500
\]
\[
H_a: \pi < 2500
\]
# Extract the posterior samples
posterior_samples <- ext_main_int_predict
p_H0 <- mean(posterior_samples >= 2500)
p_Ha <- mean(posterior_samples < 2500)
# Print the results
cat("Posterior probability of H0 (π ≥ 2500):", p_H0)
Posterior probability of H0 (π ≥ 2500): 0.9364
cat("Posterior probability of Ha (π < 2500):", p_Ha)
Posterior probability of Ha (π < 2500): 0.0636
# Extract the posterior samples
posterior_samples <- ext_main_predict
p_H0 <- mean(posterior_samples >= 2500)
p_Ha <- mean(posterior_samples < 2500)
# Print the results
cat("Posterior probability of H0 (π ≥ 2500):", p_H0)
Posterior probability of H0 (π ≥ 2500): 0.42085
cat("Posterior probability of Ha (π < 2500):", p_Ha)
Posterior probability of Ha (π < 2500): 0.57915
Though the two models seemingly predict the actual distribution of
the data well, the 2 models have very different outcomes when conducting
the test.
The extended model favors the alternate hypothesis while the
interaction model clearly favors the null hypothesis. This both
coincides with what we saw before from the visualizations and tells us
that the extended model favors games that are shorter while the
interaction model expects games like this to take longer.
This is also something one should consider when choosing the model.
Whether or not the predictions themselves seem realistic given the
scenario/circumstances.
To back up what I’m saying, Baron Nashor spawns at 20 minutes and
respawns every 6 minutes. Since 3 Barons are killed in both scenarios,
the game would have to be at least 38 minutes and only if the teams kill
the Baron IMMEDIATELY (which is usually never the case). Therefore, one
would expect majority of the area to lie within the ~40 minute range at
the very least for when the game would end.
Conclusion
In conclusion, it is a choice between whether one wants to take the
chance of playing with a model that may be susceptible to overfitting
for the sake of potentially less error or a slightly higher error model
that has interactions but would be less likely to be susceptible to
overfitting.
Personally, I say that the refined interaction model is the best
model to use for predicting the game duration of a LOL match given our
data due to its comparable MAE along with better posterior predictive
abilities and lack of evidence for overfitting. Furthermore, based on my
experience and game rules, the predictive values it comes up with end up
being much more realistic in the grand scheme of things and with other
factors such as respawn time considered.
Honor Pledge
On my honor, I have neither received nor given any unauthorized
assistance on this project
Signed: Thomas Christo (tjc260)
---
title: "Final Project LOL"
date: "Due: Wednesday 12/19 at 11:59am, Canvas submission"
output: html_notebook
---

## The Data

The dataset I have chosen to conduct my regression analysis is the [(LoL) League of Legends Ranked Games dataset](https://www.kaggle.com/datasets/datasnaek/league-of-legends/?select=games.csv). This data looks at ranked League of Legends games played during season 9 in the EUW region. This dataset was collected using the Riot Games API, which makes it easy to look up and collect information on a users ranked history and collect their games.

```{r setup, results=F, message=FALSE, error=FALSE, warning=FALSE}
# Load packages
library(ggplot2)
library(rstanarm)
library(bayesplot)
library(bayesrules)
library(tidyverse)
library(tidybayes)
library(dplyr)
library(broom.mixed)
library(interactions)
```

To preface this analysis, one must first understand the game of League of Legends, or LOL for short. In LOL, two 5 person teams fight to destroy each others nexus and get to choose and ban a champion, each with a different set of skills and abilities, and 2 summoner spells to do so. Along the way, the players must both fight each other, destroy objectives such as towers and inhibitors, and strategize effectively to reach the opponent's nexus. There are also several monsters that teams can defeat to buff them as well, such as Rift Herald, Baron Nashor, and an assorted set of Dragons.

We will use `lol_data` to build various models of League of Legends `gameDuration`. Throughout, we'll utilize weakly informative priors and a basic understanding that LOL games usually are [30 minutes](https://www.leagueofgraphs.com/stats/game-durations) but can range from 25-35 minutes. We will asses an array of different predictors and combinations to determine what can provide us the best fit for predicting the game duration of a League of Legends game. A base criteria for these games is there at least needs to be one tower, one inhibitor, and one champion kill for each game. This is to mitigate the amount of games that teams may have forfeited due to a player leaving or trolling.

```{r}
lol_data <- read.csv("games.csv",
                    sep=",",
                    na.strings=c(""," ","NA","N/A")
)
head(lol_data)
```

## Setting Up the Data

Transforming data to fit baseline of at least one kill, tower, and inhibitor

```{r}
lol_data <- lol_data[!(lol_data$firstBlood == 0 | lol_data$firstTower == 0 | lol_data$firstInhibitor == 0), ]
```

```{r}
nrow(lol_data)
```

```{r}
colnames(lol_data)
```

We can clean up our data by omitting unnecessary columns

```{r}
lol_data <- subset(lol_data, select = -c(t2_ban1, t2_ban2, t2_ban3, t2_ban4, t2_ban5, t1_ban1, t1_ban2, t1_ban3, t1_ban4, t1_ban5, t1_champ1_sum1, t1_champ2_sum1, t1_champ3_sum1, t1_champ4_sum1, t1_champ5_sum1, t2_champ5_sum1, t2_champ1_sum1, t2_champ2_sum1, t2_champ3_sum1, t2_champ4_sum1, t2_champ5_sum1, t1_champ1_sum2, t1_champ2_sum2, t1_champ3_sum2, t1_champ4_sum2, t1_champ5_sum2, t2_champ1_sum2, t2_champ1_sum2, t2_champ2_sum2, t2_champ3_sum2, t2_champ4_sum2, t2_champ5_sum2, t1_champ1id, t1_champ2id, t1_champ3id, t1_champ4id, t1_champ5id, t2_champ1id, t2_champ2id, t2_champ3id, t2_champ4id, t2_champ5id))

colnames(lol_data)
```

We can also covert some of the columns to factors for team 1 and team 2 for ease of visualization and interpretation

```{r}
lol_data$winner <- as.factor(lol_data$winner)
lol_data$firstBlood <- as.factor(lol_data$firstBlood)
lol_data$firstTower <- as.factor(lol_data$firstTower)
lol_data$firstInhibitor <- as.factor(lol_data$firstInhibitor)
lol_data$firstBaron <- as.factor(lol_data$firstBaron)
lol_data$firstDragon <- as.factor(lol_data$firstDragon)
lol_data$firstRiftHerald <- as.factor(lol_data$firstRiftHerald)
```

```{r}
head(lol_data)
```

## Summary Analysis

Here are some visualizations and statistics to help better understand our data and what we are working with.

```{r}
summary(lol_data)
```

Some conclusions we can make here:

-   The mean game time is around 1930 seconds (roughly 32 minutes)

-   Game duration lies mostly between 1632 and 2196

-   Team 1 has won more games, has more first bloods, towers, inhibitors, and rift heralds

-   Team 2 has more first barons and dragons.

-   Team 1's tower, inhibitor, and herald kills are higher than Team 2's which is reflected from them getting the first of each more often than not.

-   Team 2's dragon and baron kills are higher than Team 1's which is reflected from them getting the first of each more often than not.

## Visualizations

We can use ggplot do visualize some of our findings

### Response Distribution

```{r}
ggplot(lol_data, aes(x = gameDuration)) +
  geom_histogram(binwidth = 60, fill = "blue", color = "black") +
  labs(x = "Game Duration (seconds)", y = "Frequency", title = "Distribution of Game Duration") +
  theme_minimal()
```

Game duration seems to be a bit right-skewed

### First Comparisons

```{r}
ggplot(lol_data, aes(x = winner, fill = winner)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "Winners")
```

```{r}
ggplot(lol_data, aes(x = firstBlood, fill = firstBlood)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Bloods")
```

```{r}
ggplot(lol_data, aes(x = firstTower, fill = firstTower)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Towers")
```

```{r}
ggplot(lol_data, aes(x = firstInhibitor, fill = firstInhibitor)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Inhibitors")
```

```{r}
ggplot(lol_data, aes(x = firstDragon, fill = firstDragon)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Dragons")
```

```{r}
ggplot(lol_data, aes(x = firstRiftHerald, fill = firstRiftHerald)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Rift Heralds")
```

```{r}
ggplot(lol_data, aes(x = firstBaron, fill = firstBaron)) + 
      geom_bar() + 
      theme(text = element_text(size=9)) +
      labs(y = "First Barons")
```

These all are consistent with our previous summary findings about first categories

### Total Objectives Comparisons

```{r}
ggplot(lol_data, aes(x = t1_towerKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_towerKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
tower_kills <- data.frame(Team = c("T1", "T2"),
                          Kills = c(sum(lol_data$t1_towerKills, na.rm = TRUE), 
                                    sum(lol_data$t2_towerKills, na.rm = TRUE)))

ggplot(tower_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Tower Kills", title = "Comparison of Tower Kills between T1 and T2")
```

More often than not, both teams tend to kill \>6 towers per game

```{r}
ggplot(lol_data, aes(x = t1_inhibitorKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_inhibitorKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
# Calculate the total number of inhibitor kills for each team
inhibitor_kills <- data.frame(Team = c("T1", "T2"),
                              Kills = c(sum(lol_data$t1_inhibitorKills, na.rm = TRUE), 
                                        sum(lol_data$t2_inhibitorKills, na.rm = TRUE)))

ggplot(inhibitor_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Inhibitor Kills", title = "Comparison of Inhibitor Kills between T1 and T2")
```

Games usually end with 0 to 2 inhibitors being destroyed, likely due to late game forfeits or steam rolls (which is often the case in many games)

```{r}
ggplot(lol_data, aes(x = t1_dragonKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_dragonKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
# Calculate the total number of dragon kills for each team
dragon_kills <- data.frame(Team = c("T1", "T2"),
                           Kills = c(sum(lol_data$t1_dragonKills, na.rm = TRUE), 
                                     sum(lol_data$t2_dragonKills, na.rm = TRUE)))

ggplot(dragon_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Dragon Kills", title = "Comparison of Dragon Kills between T1 and T2")
```

Dragon kills comfortably lie within the 0-2 range

```{r}
ggplot(lol_data, aes(x = t1_riftHeraldKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_riftHeraldKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
# Calculate the total number of rift herald kills for each team
riftHerald_kills <- data.frame(Team = c("T1", "T2"),
                               Kills = c(sum(lol_data$t1_riftHeraldKills, na.rm = TRUE), 
                                         sum(lol_data$t2_riftHeraldKills, na.rm = TRUE)))

ggplot(riftHerald_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Rift Herald Kills", title = "Comparison of Rift Herald Kills between T1 and T2")
```

Rift herald only spawns from 9:50 - 19:45 during the game and is most often only taken once if at all

```{r}
ggplot(lol_data, aes(x = t1_baronKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
ggplot(lol_data, aes(x = t2_baronKills)) + 
      geom_bar() + 
      theme(text = element_text(size=10)) +
      labs(y = "Count")
```

```{r}
# Calculate the total number of baron kills for each team
baron_kills <- data.frame(Team = c("T1", "T2"),
                          Kills = c(sum(lol_data$t1_baronKills, na.rm = TRUE), 
                                    sum(lol_data$t2_baronKills, na.rm = TRUE)))

ggplot(baron_kills, aes(x = Team, y = Kills, fill = Team)) +
  geom_bar(stat = "identity") +
  labs(x = "Team", y = "Total Baron Kills", title = "Comparison of Baron Kills between T1 and T2")
```

Baron spawns after the rift herald and usually is game over if taken and used properly, hence why the amount taken is usually 0 or 1 as sometimes games end before it can be taken

## Base Model and Prior Modeling

Lets start off with a base model to see if we can get insight from just the winner of the match

```{r}
base <- stan_glm(gameDuration ~ winner,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
prior_summary(base) 
```

A prior predictive check can be used to see what we can potentially expect from the data alone and what the stan model viewed as appropriate for our predictions

```{r}
base_priors <- update(base, prior_PD = TRUE)
# 200 prior model lines
lol_data %>%
  add_fitted_draws(base, n = 200) %>%
  ggplot(aes(x = winner, y = gameDuration)) +
    geom_line(aes(y = .value, group = .draw), alpha = 0.05)

# 4 prior simulated datasets
set.seed(3)
lol_data %>%
  add_predicted_draws(base, n = 4) %>%
  ggplot(aes(x = winner, y = gameDuration)) +
    geom_point(aes(y = .prediction, group = .draw)) + 
    facet_wrap(~ .draw)
```

Prior distribution expects the games to range from 1910 - 1945 seconds as expected

Game duration lies between 500-3500 seconds which falls in line with our summary

Prior distribution expects games to last longer if team 2 wins as compared to team 1

```{r}
mcmc_trace(base, size = .1)
```

Plots look fast mixing and consistent

```{r}
rhat(base)
```

Rhat close to 1 and not \>1.05, good

```{r}
neff_ratio(base)
```

neff ratio is a little high but still acceptable, \>.10 which is good

```{r}
tidy(base, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)
```

Overall the model could work, but more investigation should be done to improve what we have

## Interaction Investigation

Lets see if there are some interactions we can expect

Speaking from experience in playing the game myself, I predict the dragon and baron kills will have a big impact on the game duration due to the buffs they give each member of the team being crucial to winning

```{r}
int <- stan_glm(gameDuration ~ t1_dragonKills:t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
int2 <- stan_glm(gameDuration ~ t1_baronKills:t2_baronKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
summary(int)
```

```{r}
summary(int2)
```

Summary stats look promising with both values not containing 0 for their 95% CI showing they are significant and can be useful

```{r}
# Extract the posterior samples
posterior_samples <- as.matrix(int)
posterior_samples2 <- as.matrix(int2)

# Plot the posterior distribution of the interaction term
mcmc_hist(posterior_samples, pars = c("t1_dragonKills:t2_dragonKills"))

mcmc_hist(posterior_samples2, pars = c("t1_baronKills:t2_baronKills"))
```

The predictors don't look skewed either, looks good.

We can canclude that there may be a potential interaction between dragon kills and baron kills for each team

We can utilize this info later on upon building a better model

## Main Model

Let's assess how much getting a jump start in the game does for determining the length of the game

```{r}
main <- stan_glm(gameDuration ~ winner + firstBlood + firstTower + firstDragon + firstBaron,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
prior_summary(main) 
```

```{r}
mcmc_trace(main, size = .1)
```

Chains look normal

```{r}
rhat(main)
```

```{r}
neff_ratio(main)
```

Rhat looks good, some of the neff ratio values are still high but acceptable for now

```{r}
tidy(main, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)
```

From the tidy output, it looks like winner, firstBlood, and firstTower lose their significance when adding other predictors since their 95% CI range includes 0. We'll use this info later for refining our model

```{r}
newdata <- data.frame(winner = factor(1, levels = levels(lol_data$winner)),
                      firstBlood = factor(2, levels = levels(lol_data$firstBlood)),
                      firstTower = factor(2, levels = levels(lol_data$firstTower)),
                      firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
                      firstBaron = factor(2, levels = levels(lol_data$firstBaron)))

main_predict <- posterior_predict(
  main, 
  newdata = newdata)
mcmc_areas(main_predict) +  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first blood, tower, dragon, baron was team 2 and team 1 won') +
  theme(plot.title = element_text(size = 7))
```

Here's an example output of where a game duration may lie if team 1 wins despite being behind from the start. The game is a little later than normal, but to be expected as the team would need time to make the comeback in the first place. However, I believe this is undershooting the true time we could expect from this scenario. We'll verify if this is true after refining our model

## More Predictors

We can add some numerical variables to see how much adding the amount of each objective a team has taken to see how it affects the game duration

```{r}
ext_main <- stan_glm(gameDuration ~ winner + firstBlood + firstTower + firstDragon + firstBaron + t1_towerKills + t1_inhibitorKills + t1_baronKills + t1_dragonKills + t2_towerKills + t2_inhibitorKills + t2_baronKills + t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
prior_summary(ext_main) 
```

```{r}
mcmc_trace(ext_main, size = .1)
```

Traces look good

```{r}
rhat(ext_main)
```

```{r}
neff_ratio(ext_main)
```

Rhat looks good, some neff ratios (i.e. firstBlood, firstTower) are now too high to keep and will need to be handled later

```{r}
tidy(ext_main, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)
```

Here, the predictors winner, firstBlood, and firstTower are not significant due to their 95% CI including 0

```{r}
# dataframe with the specified values
newdata_ext <- data.frame(
  winner = factor(1, levels = levels(lol_data$winner)),
  firstBlood = factor(2, levels = levels(lol_data$firstBlood)),
  firstTower = factor(2, levels = levels(lol_data$firstTower)),
  firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
  firstBaron = factor(2, levels = levels(lol_data$firstBaron)),
  t1_towerKills = 7,
  t1_inhibitorKills = 2,
  t1_baronKills = 2,
  t1_dragonKills = 2,
  t2_towerKills = 3,
  t2_inhibitorKills = 2,
  t2_baronKills = 1,
  t2_dragonKills = 3
)

ext_main_predict <- posterior_predict(ext_main, newdata = newdata_ext)

mcmc_areas(ext_main_predict) +  
  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first blood, tower, dragon, baron was team 2 and team 1 won, with additional predictors') +
  theme(plot.title = element_text(size = 8))
```

After adding a few more variables into play we can see the game gets longer (and expectedly so). Team 1 comes from behind but ends up getting 2 barons and 7 towers making us believe this game will have to go on for a while for the comeback to truly be complete

## Refined Model

Let's now remove the high neff ratio terms from the extended model (i.e. \> 1) and include the interaction terms we deemed useful beforehand

```{r}
ext_main_int <- stan_glm(gameDuration ~ firstDragon + firstBaron + t1_baronKills + t1_dragonKills + t2_baronKills + t2_dragonKills + t1_baronKills:t2_baronKills + t1_dragonKills:t2_dragonKills,
  data = lol_data, family = gaussian, 
  prior_intercept = normal(1800, 150, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(1, autoscale = TRUE),
  chains = 4, iter = 5000*2, seed = 84735)
```

```{r}
prior_summary(ext_main_int)
```

```{r}
mcmc_trace(ext_main_int, size = .1)
```

So far so good

```{r}
rhat(ext_main_int)
```

Rhat looks satisfactory for all predictors

```{r}
neff_ratio(ext_main_int)
```

Neff ratios are much more reasonable now and a big improvement from prior models with no value being \> .90

```{r}
tidy(ext_main_int, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)
```

Every value looks significant from the 95% CI (even the interaction terms!!) due to none of them including 0.

```{r}
# dataframe with the specified values
newdata_ext_int <- data.frame(
  firstDragon = factor(2, levels = levels(lol_data$firstDragon)),
  firstBaron = factor(2, levels = levels(lol_data$firstBaron)),
  t1_baronKills = 2,
  t1_dragonKills = 2,
  t2_baronKills = 1,
  t2_dragonKills = 3
)

ext_main_int_predict <- posterior_predict(ext_main_int, newdata = newdata_ext_int)

mcmc_areas(ext_main_int_predict) +  
  xlab("Game Duration") +
  ggtitle('Predictive distribution of a League of Legends game where first dragon, baron was team 2 and team 1 had more baron kills with interaction') +
  theme(plot.title = element_text(size = 8))
```

The games seem to drag on even longer with this new model. This is expected since the dragon and baron respawn times are quite long so, as said before, the comeback would realistically take longer than average.

Lets do some model comparisons to verify which model is best

## Model Comparisons

### PP Check

```{r}
pp_check(main, nreps = 50) + xlab("Game Duration") +
  ggtitle('Main effects model')
```

The main model gets a decent amount of area, yet it starts moving too far right and undershoots at the peak

```{r}
pp_check(ext_main, nreps = 50) + xlab("Game Duration") +
  ggtitle('Extended Main effects model')
```

The extended model is objectively better than the main model, but still undershoots area at the peak

```{r}
pp_check(ext_main_int, nreps = 50) + xlab("Game Duration") +
  ggtitle('Refined Interaction effects model')
```

The interaction model is very similar to the extended model which is better than the main model.

The ppchecks show most of the are being covered for the last 2 models with the interaction model doing better around the peak by a small margin

### 10 Fold Cross-Validations

```{r}
test_sample <- lol_data %>% head(10000)
nrow(test_sample)
```

```{r}
set.seed(84735)

p_main <- prediction_summary(model = main, data = test_sample)
p_ext_main <- prediction_summary(model = ext_main, data = test_sample)
p_ext_main_int <- prediction_summary(model = ext_main_int, data = test_sample)
```

```{r}
rbind(p_main, p_ext_main, p_ext_main_int)
```

These are the raw MAE values based on a sample of 10000 observations from the data, we can use these to determine the behavior and bias of each model based on their cross-validation results

```{r}
set.seed(84735)

cv_main <- prediction_summary_cv(
  model = main, data = lol_data, k = 10)

cv_extend <- prediction_summary_cv(
  model = ext_main, data = lol_data, k = 10)

cv_interact <- prediction_summary_cv(
  model = ext_main_int, data = lol_data, k = 10)
```

```{r}
rbind(cv_main$cv, cv_extend$cv, cv_interact$cv)
```

### Loo Diagnostics

```{r}
set.seed(34521)
main_elpd <- loo(main)
ext_main_elpd <- loo(ext_main)
ext_main_interact_elpd <- loo(ext_main_int)

main_elpd$estimates
ext_main_elpd$estimates
ext_main_interact_elpd$estimates
```

```{r}
c(main_elpd$estimates[1], ext_main_elpd$estimates[1], ext_main_interact_elpd$estimates[1])

loo_compare(main_elpd, ext_main_elpd, ext_main_interact_elpd)
```

The extended model fairs better than the interaction model when comparing their ELPD and MAE. The interaction model does 4337.6 points "worse" for ELPD and is around 16 points higher in its MAE. However, something to consider here is the idea of [overfitting](https://statisticsbyjim.com/regression/overfitting-regression-models/) and how more predictors and affect the model overall and skew our metrics.

Despite the interaction model covering more area, the [MAE](https://stephenallwright.com/good-mae-score/) and ELPD are both worse. This may be due to these values being inflated by the larger number of predictors in the extended model. Furthermore, the extended model contains numerous values with higher than normal Neff Ratios which is a [cause for concern](https://stats.stackexchange.com/questions/296059/effective-sample-size-greater-than-actual-sample-size) about the validity of the model along with predictors that were not significant.

Despite having less predictors, the interaction model garners more area in its predictive posterior distribution while having a similar MAE and good ELPD score too (-309859.7 vs 314197.3) which is a minor (\~1%) difference given the size of these values. However, these interaction terms could also be having an adverse effect on the model leading to these greater errors.

### Overfitting

```{r}
rbind(p_main, p_ext_main, p_ext_main_int) 
rbind(cv_main$cv, cv_extend$cv, cv_interact$cv)
```

Based on the difference between the raw MAE values and the cross-validation MAE values, overfitting does not seem to pose a threat to these models, however we can only truly know this if we are given new data entirely to test the models. Furthermore, the extended model does have a larger (though mostly negligible due to how small it is) difference from the cross-validation MAE compared to the other models.

What is undoubtedly clear is that adding numerical predictors to the model does help the predictions overall and is shown with both the extended and interaction model fairing better than the main model in every aspect.

## Regression Inference

Let's see how the 2 best models fair when conducting a quick hypothesis test to see the Posterior probability of a game around 40 minutes long

We'll use the predictive posterior models we created and visualized earlier to conduct the tests

$$
H_0: \pi \geq 2500
$$

$$
H_a: \pi < 2500
$$

```{r}
# Extract the posterior samples
posterior_samples <- ext_main_int_predict

p_H0 <- mean(posterior_samples >= 2500)

p_Ha <- mean(posterior_samples < 2500)

# Print the results
cat("Posterior probability of H0 (π ≥ 2500):", p_H0)
cat("Posterior probability of Ha (π < 2500):", p_Ha)
```

```{r}
# Extract the posterior samples
posterior_samples <- ext_main_predict

p_H0 <- mean(posterior_samples >= 2500)

p_Ha <- mean(posterior_samples < 2500)

# Print the results
cat("Posterior probability of H0 (π ≥ 2500):", p_H0)
cat("Posterior probability of Ha (π < 2500):", p_Ha)
```

Though the two models seemingly predict the actual distribution of the data well, the 2 models have very different outcomes when conducting the test.

The extended model favors the alternate hypothesis while the interaction model clearly favors the null hypothesis. This both coincides with what we saw before from the visualizations and tells us that the extended model favors games that are shorter while the interaction model expects games like this to take longer.

This is also something one should consider when choosing the model. Whether or not the predictions themselves seem realistic given the scenario/circumstances.

To back up what I'm saying, Baron Nashor spawns at 20 minutes and respawns every 6 minutes. Since 3 Barons are killed in both scenarios, the game would have to be at least 38 minutes and only if the teams kill the Baron IMMEDIATELY (which is usually never the case). Therefore, one would expect majority of the area to lie within the \~40 minute range at the very least for when the game would end.

## Conclusion

In conclusion, it is a choice between whether one wants to take the chance of playing with a model that may be susceptible to overfitting for the sake of potentially less error or a slightly higher error model that has interactions but would be less likely to be susceptible to overfitting.

Personally, I say that the refined interaction model is the best model to use for predicting the game duration of a LOL match given our data due to its comparable MAE along with better posterior predictive abilities and lack of evidence for overfitting. Furthermore, based on my experience and game rules, the predictive values it comes up with end up being much more realistic in the grand scheme of things and with other factors such as respawn time considered.

## References

-   <https://www.bayesrulesbook.com/>

-   <https://stats.stackexchange.com/questions/296059/effective-sample-size-greater-than-actual-sample-size>

-   <https://stephenallwright.com/good-mae-score/>

-   <https://discourse.mc-stan.org/t/understanding-looic/13409/6>

-   <https://discourse.mc-stan.org/t/projpred-elpd-goes-down-and-rmse-goes-up-after-x-variables/13153>

-   <https://stats.stackexchange.com/questions/313564/how-does-bayesian-analysis-make-accurate-predictions-using-subjectively-chosen-p>

-   [https://medium.com/\@ooemma83/interpretation-of-evaluation-metrics-for-regression-analysis-mae-mse-rmse-mape-r-squared-and-5693b61a9833](https://medium.com/@ooemma83/interpretation-of-evaluation-metrics-for-regression-analysis-mae-mse-rmse-mape-r-squared-and-5693b61a9833){.uri}

-   <https://statisticsbyjim.com/regression/overfitting-regression-models/>

-   <https://stats.stackexchange.com/questions/9053/how-does-cross-validation-overcome-the-overfitting-problem>

## Honor Pledge

On my honor, I have neither received nor given any unauthorized assistance on this project

Signed: Thomas Christo (tjc260)
