Capstone - Complete Analytics System
Introduction to International Football Analytics
International football presents unique analytical challenges. Unlike club football where teams play together regularly, national teams convene infrequently with limited preparation time. This creates distinct analytical considerations for squad selection, tactical planning, and tournament prediction.
The Unique Nature of Tournament Football
Major tournaments like the World Cup or European Championship feature compressed schedules, high-stakes single matches, and the convergence of players from different club systems. Analytics must adapt to these conditions while accounting for the limited sample sizes of international matches.
# Loading International Football Data
import pandas as pd
from statsbombpy import sb
# Get available competitions from StatsBomb
competitions = sb.competitions()
# Filter for World Cup data
wc_competitions = competitions[
competitions["competition_name"].str.contains("World Cup", case=False, na=False)
]
print("Available World Cup Data:")
print(wc_competitions[["competition_name", "season_name", "competition_id"]])
# Load World Cup 2022 matches
wc_matches = sb.matches(competition_id=43, season_id=106)
print(f"\nWorld Cup 2022: {len(wc_matches)} matches")
print(wc_matches[["match_date", "home_team", "away_team", "home_score", "away_score"]].head(10))
# Load event data for analysis
all_events = []
for match_id in wc_matches["match_id"][:10]: # First 10 matches
events = sb.events(match_id=match_id)
events["match_id"] = match_id
all_events.append(events)
wc_events = pd.concat(all_events, ignore_index=True)
print(f"\nTotal events loaded: {len(wc_events)}")
# Historical World Cup results from Wikipedia/Kaggle
try:
historical_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/worldcups.csv"
historical_wc = pd.read_csv(historical_url)
print("\nHistorical World Cup Winners:")
print(historical_wc[["year", "winner", "host"]].tail(10))
except:
print("Could not load historical data")
# Loading International Football Data
library(tidyverse)
library(worldfootballR)
# Get World Cup 2022 match results
wc_2022_matches <- fb_match_results(
country = "",
gender = "M",
season_end_year = 2022,
tier = ""
) %>%
filter(Competition_Name == "World Cup")
# Get historical World Cup data
historical_wc <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/worldcups.csv")
print("World Cup 2022 Matches:")
print(head(wc_2022_matches))
# StatsBomb free World Cup data
library(StatsBombR)
# Get World Cup competitions
comps <- FreeCompetitions()
wc_comps <- comps %>%
filter(str_detect(competition_name, "World Cup"))
print("Available World Cup Data:")
print(wc_comps %>% select(competition_name, season_name))
# Load 2022 World Cup matches
wc_matches <- FreeMatches(Competitions = 43)
# Get all events
wc_events <- get.matchFree(wc_matches)
cat("Total World Cup 2022 events:", nrow(wc_events), "\n")
Tournament Prediction Models
Predicting tournament outcomes requires modeling not just team strength, but also bracket dynamics, match-by-match probabilities, and the inherent randomness of knockout football.
Key Components of Tournament Prediction
Team Ratings
Elo ratings, FIFA rankings, or custom strength models
Match Simulation
Poisson goals, Monte Carlo methods
Bracket Paths
Knockout draw implications and path difficulty
# Tournament Prediction Model
import numpy as np
import pandas as pd
from scipy.stats import poisson
class EloRatingSystem:
"""Elo rating system for international football"""
def __init__(self, k=30, home_advantage=100, initial_rating=1500):
self.k = k
self.home_advantage = home_advantage
self.initial_rating = initial_rating
self.ratings = {}
def get_rating(self, team):
return self.ratings.get(team, self.initial_rating)
def expected_score(self, rating_a, rating_b, neutral=True):
"""Calculate expected score for team A"""
diff = rating_b - rating_a
if not neutral:
diff -= self.home_advantage
return 1 / (1 + 10 ** (diff / 400))
def update_ratings(self, home_team, away_team, home_score, away_score):
"""Update ratings after a match"""
r_home = self.get_rating(home_team)
r_away = self.get_rating(away_team)
e_home = self.expected_score(r_home, r_away, neutral=False)
e_away = 1 - e_home
# Actual results
if home_score > away_score:
s_home, s_away = 1, 0
elif home_score < away_score:
s_home, s_away = 0, 1
else:
s_home, s_away = 0.5, 0.5
# Update
self.ratings[home_team] = r_home + self.k * (s_home - e_home)
self.ratings[away_team] = r_away + self.k * (s_away - e_away)
class TournamentSimulator:
"""Monte Carlo tournament simulation"""
def __init__(self, elo_system):
self.elo = elo_system
def simulate_match(self, team_a, team_b, neutral=True):
"""Simulate a single match"""
rating_a = self.elo.get_rating(team_a)
rating_b = self.elo.get_rating(team_b)
prob_a = self.elo.expected_score(rating_a, rating_b, neutral)
# Poisson goal model
lambda_a = 1.3 * (prob_a + 0.2)
lambda_b = 1.3 * (1 - prob_a + 0.2)
goals_a = poisson.rvs(lambda_a)
goals_b = poisson.rvs(lambda_b)
# Handle draws in knockout (penalties)
if goals_a == goals_b:
winner = team_a if np.random.random() < prob_a else team_b
else:
winner = team_a if goals_a > goals_b else team_b
return {"goals_a": goals_a, "goals_b": goals_b, "winner": winner}
def simulate_tournament(self, teams, n_simulations=10000):
"""Simulate full tournament multiple times"""
results = {team: {"wins": 0, "finals": 0, "semis": 0} for team in teams}
for _ in range(n_simulations):
remaining = list(teams)
np.random.shuffle(remaining)
while len(remaining) > 1:
next_round = []
for i in range(0, len(remaining), 2):
team_a = remaining[i]
team_b = remaining[i + 1]
# Track semifinals
if len(remaining) <= 4:
results[team_a]["semis"] += 1
results[team_b]["semis"] += 1
# Track finals
if len(remaining) == 2:
results[team_a]["finals"] += 1
results[team_b]["finals"] += 1
match = self.simulate_match(team_a, team_b)
next_round.append(match["winner"])
remaining = next_round
results[remaining[0]]["wins"] += 1
# Convert to DataFrame
df = pd.DataFrame(results).T
df["win_pct"] = df["wins"] / n_simulations * 100
df["final_pct"] = df["finals"] / n_simulations * 100
df["semi_pct"] = df["semis"] / n_simulations * 100
return df.sort_values("win_pct", ascending=False)
# Usage example
elo = EloRatingSystem()
# Sample ratings (approximate 2022 World Cup)
teams = {
"Brazil": 2166, "Argentina": 2143, "France": 2090,
"England": 2040, "Spain": 2045, "Germany": 1980,
"Netherlands": 2032, "Portugal": 2006
}
for team, rating in teams.items():
elo.ratings[team] = rating
simulator = TournamentSimulator(elo)
results = simulator.simulate_tournament(list(teams.keys()), n_simulations=10000)
print("Tournament Win Probabilities:")
print(results[["win_pct", "final_pct", "semi_pct"]])
# Tournament Prediction Model
library(tidyverse)
# Elo Rating System for International Football
calculate_elo <- function(matches, k = 30, home_advantage = 100) {
# Initialize ratings (1500 baseline)
ratings <- list()
for (i in 1:nrow(matches)) {
home <- matches$home_team[i]
away <- matches$away_team[i]
# Get current ratings
r_home <- ifelse(home %in% names(ratings), ratings[[home]], 1500)
r_away <- ifelse(away %in% names(ratings), ratings[[away]], 1500)
# Expected scores
e_home <- 1 / (1 + 10^((r_away - r_home - home_advantage) / 400))
e_away <- 1 - e_home
# Actual scores (1 = win, 0.5 = draw, 0 = loss)
if (matches$home_score[i] > matches$away_score[i]) {
s_home <- 1; s_away <- 0
} else if (matches$home_score[i] < matches$away_score[i]) {
s_home <- 0; s_away <- 1
} else {
s_home <- 0.5; s_away <- 0.5
}
# Update ratings
ratings[[home]] <- r_home + k * (s_home - e_home)
ratings[[away]] <- r_away + k * (s_away - e_away)
}
return(ratings)
}
# Monte Carlo Tournament Simulation
simulate_match <- function(elo_a, elo_b, neutral = TRUE) {
# Calculate win probability
if (neutral) {
prob_a <- 1 / (1 + 10^((elo_b - elo_a) / 400))
} else {
prob_a <- 1 / (1 + 10^((elo_b - elo_a - 100) / 400))
}
# Generate goals using Poisson
lambda_a <- 1.3 * (prob_a + 0.2)
lambda_b <- 1.3 * (1 - prob_a + 0.2)
goals_a <- rpois(1, lambda_a)
goals_b <- rpois(1, lambda_b)
# Handle draws in knockout (penalties)
if (goals_a == goals_b) {
# Simplified: use probability for penalty winner
if (runif(1) < prob_a) {
return(c(goals_a, goals_b, "A"))
} else {
return(c(goals_a, goals_b, "B"))
}
}
winner <- ifelse(goals_a > goals_b, "A", "B")
return(c(goals_a, goals_b, winner))
}
# Full tournament simulation
simulate_tournament <- function(teams, elo_ratings, n_sims = 10000) {
results <- tibble(
team = teams,
wins = 0,
finals = 0,
semis = 0
)
for (sim in 1:n_sims) {
# Simulate bracket (simplified 8-team knockout)
remaining <- teams
round <- 1
while (length(remaining) > 1) {
winners <- c()
for (i in seq(1, length(remaining), by = 2)) {
team_a <- remaining[i]
team_b <- remaining[i + 1]
result <- simulate_match(elo_ratings[[team_a]], elo_ratings[[team_b]])
winner <- ifelse(result[3] == "A", team_a, team_b)
winners <- c(winners, winner)
# Track progress
if (length(remaining) == 2) {
results$finals[results$team == team_a] <- results$finals[results$team == team_a] + 1
results$finals[results$team == team_b] <- results$finals[results$team == team_b] + 1
}
if (length(remaining) <= 4) {
results$semis[results$team == team_a] <- results$semis[results$team == team_a] + 1
results$semis[results$team == team_b] <- results$semis[results$team == team_b] + 1
}
}
remaining <- winners
round <- round + 1
}
results$wins[results$team == remaining[1]] <- results$wins[results$team == remaining[1]] + 1
}
results %>%
mutate(
win_pct = wins / n_sims * 100,
final_pct = finals / n_sims * 100,
semi_pct = semis / n_sims * 100
) %>%
arrange(desc(win_pct))
}
Squad Selection Analytics
International managers face unique challenges in squad selection. With limited roster spots (typically 23-26 players), managers must balance positional coverage, form versus experience, and the chemistry of players from different club systems.
Squad Selection Challenges
- Limited preparation time between call-up and matches
- Players arriving in varying fitness states from club seasons
- Need for positional flexibility with small rosters
- Balancing current form against tournament experience
- Managing workload for players with heavy club schedules
# Squad Selection Optimization
import pandas as pd
import numpy as np
from scipy.optimize import milp, LinearConstraint, Bounds
def create_player_pool():
"""Create sample player pool"""
data = {
"player": ["GK1", "GK2", "GK3", "CB1", "CB2", "CB3", "CB4",
"LB1", "LB2", "RB1", "RB2", "CM1", "CM2", "CM3",
"DM1", "DM2", "AM1", "AM2", "LW1", "LW2", "RW1",
"RW2", "ST1", "ST2", "ST3"],
"position": ["GK", "GK", "GK", "CB", "CB", "CB", "CB",
"LB", "LB", "RB", "RB", "CM", "CM", "CM",
"DM", "DM", "AM", "AM", "LW", "LW", "RW",
"RW", "ST", "ST", "ST"],
"quality": [90, 82, 78, 88, 86, 84, 80, 85, 78, 84, 79,
92, 88, 85, 86, 82, 90, 84, 88, 82, 87, 80,
91, 86, 82],
"form": [85, 88, 90, 85, 90, 82, 88, 87, 92, 86, 85,
88, 90, 85, 84, 88, 92, 86, 90, 85, 88, 92,
87, 92, 80],
"experience": [95, 60, 30, 90, 75, 80, 40, 85, 30, 80, 45,
95, 70, 80, 85, 50, 75, 60, 70, 40, 75, 30,
90, 55, 70],
"versatility": [1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1,
3, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1,
2, 1, 1],
"age": [32, 26, 23, 30, 28, 29, 24, 29, 22, 28, 24,
31, 27, 29, 30, 25, 26, 27, 27, 23, 28, 22,
29, 25, 30]
}
return pd.DataFrame(data)
def calculate_player_score(df, w_quality=0.4, w_form=0.3,
w_experience=0.2, w_versatility=0.1):
"""Calculate overall player score"""
df = df.copy()
df["overall_score"] = (
df["quality"] * w_quality +
df["form"] * w_form +
df["experience"] * w_experience +
df["versatility"] * 10 * w_versatility
)
return df
def optimize_squad_simple(players, squad_size=26):
"""Simple greedy approach with position constraints"""
players = players.copy()
# Position requirements
requirements = {
"GK": 3,
"CB": 4,
"LB": 2,
"RB": 2,
"DM": 2,
"CM": 3,
"AM": 2,
"LW": 2,
"RW": 2,
"ST": 3
}
selected = []
# First pass: fulfill minimum requirements
for pos, count in requirements.items():
pos_players = players[players["position"] == pos].nlargest(count, "overall_score")
selected.extend(pos_players.index.tolist())
# Remove selected from pool
remaining = players.drop(selected)
# Fill remaining spots with best available
spots_left = squad_size - len(selected)
if spots_left > 0:
best_remaining = remaining.nlargest(spots_left, "overall_score")
selected.extend(best_remaining.index.tolist())
return players.loc[selected].sort_values(["position", "overall_score"],
ascending=[True, False])
# Run optimization
pool = create_player_pool()
pool = calculate_player_score(pool)
squad = optimize_squad_simple(pool)
print("Optimized 26-Player Squad:")
print(squad[["player", "position", "quality", "form", "experience", "overall_score"]])
# Squad composition analysis
print("\nSquad Composition:")
print(squad.groupby("position").size())
print(f"\nAverage Age: {squad['age'].mean():.1f}")
print(f"Average Quality: {squad['quality'].mean():.1f}")
# Squad Selection Optimization
library(tidyverse)
library(lpSolve)
# Create player pool with attributes
create_player_pool <- function() {
tribble(
~player, ~position, ~quality, ~form, ~experience, ~versatility, ~age,
# Goalkeepers
"GK1", "GK", 90, 85, 95, 1, 32,
"GK2", "GK", 82, 88, 60, 1, 26,
"GK3", "GK", 78, 90, 30, 1, 23,
# Defenders
"CB1", "CB", 88, 85, 90, 2, 30,
"CB2", "CB", 86, 90, 75, 1, 28,
"CB3", "CB", 84, 82, 80, 2, 29,
"CB4", "CB", 80, 88, 40, 1, 24,
"LB1", "LB", 85, 87, 85, 2, 29,
"LB2", "LB", 78, 92, 30, 1, 22,
"RB1", "RB", 84, 86, 80, 2, 28,
"RB2", "RB", 79, 85, 45, 1, 24,
# Midfielders
"CM1", "CM", 92, 88, 95, 3, 31,
"CM2", "CM", 88, 90, 70, 2, 27,
"CM3", "CM", 85, 85, 80, 2, 29,
"DM1", "DM", 86, 84, 85, 2, 30,
"DM2", "DM", 82, 88, 50, 1, 25,
"AM1", "AM", 90, 92, 75, 2, 26,
"AM2", "AM", 84, 86, 60, 2, 27,
# Wingers
"LW1", "LW", 88, 90, 70, 2, 27,
"LW2", "LW", 82, 85, 40, 1, 23,
"RW1", "RW", 87, 88, 75, 2, 28,
"RW2", "RW", 80, 92, 30, 1, 22,
# Forwards
"ST1", "ST", 91, 87, 90, 2, 29,
"ST2", "ST", 86, 92, 55, 1, 25,
"ST3", "ST", 82, 80, 70, 1, 30
)
}
# Calculate overall player score
calculate_player_score <- function(players,
w_quality = 0.4,
w_form = 0.3,
w_experience = 0.2,
w_versatility = 0.1) {
players %>%
mutate(
overall_score = quality * w_quality +
form * w_form +
experience * w_experience +
versatility * 10 * w_versatility
)
}
# Optimize squad selection using linear programming
optimize_squad <- function(players, squad_size = 26,
min_gk = 3, min_def = 8, min_mid = 8, min_fwd = 5) {
n <- nrow(players)
# Objective: maximize total score
obj <- players$overall_score
# Constraints matrix
constraints <- matrix(0, nrow = 5, ncol = n)
# Squad size constraint
constraints[1, ] <- 1
# Position constraints
constraints[2, players$position == "GK"] <- 1
constraints[3, players$position %in% c("CB", "LB", "RB")] <- 1
constraints[4, players$position %in% c("CM", "DM", "AM", "LW", "RW")] <- 1
constraints[5, players$position == "ST"] <- 1
directions <- c("==", ">=", ">=", ">=", ">=")
rhs <- c(squad_size, min_gk, min_def, min_mid, min_fwd)
# Solve
solution <- lp("max", obj, constraints, directions, rhs, all.bin = TRUE)
selected <- players[solution$solution == 1, ]
return(selected)
}
# Run optimization
pool <- create_player_pool()
pool <- calculate_player_score(pool)
squad <- optimize_squad(pool)
print("Optimized 26-Player Squad:")
print(squad %>%
arrange(position, desc(overall_score)) %>%
select(player, position, quality, form, experience, overall_score))
Knockout Stage Analytics
Knockout football is fundamentally different from league play. Single-match elimination creates extreme variance, and analytics must account for the heightened importance of individual moments, set pieces, and penalty shootouts.
# Knockout Match Analysis
from statsbombpy import sb
import pandas as pd
import numpy as np
# Load World Cup data
wc_matches = sb.matches(competition_id=43, season_id=106)
# Identify knockout matches
knockout_rounds = ["Round of 16", "Quarter-finals", "Semi-finals", "Final"]
knockout_matches = wc_matches[wc_matches["competition_stage"].isin(knockout_rounds)]
print(f"Analyzing {len(knockout_matches)} knockout matches")
# Load events for knockout matches
all_events = []
for match_id in knockout_matches["match_id"]:
events = sb.events(match_id=match_id)
events["match_id"] = match_id
all_events.append(events)
knockout_events = pd.concat(all_events, ignore_index=True)
# Team-level analysis for each match
def analyze_team_match(events, team):
"""Analyze team performance in a match"""
team_events = events[events["team"] == team]
return {
"team": team,
"shots": (team_events["type"] == "Shot").sum(),
"xG": team_events["shot_statsbomb_xg"].sum(),
"goals": ((team_events["type"] == "Shot") &
(team_events["shot_outcome"] == "Goal")).sum(),
"passes": (team_events["type"] == "Pass").sum(),
"pass_completion": (team_events["type"] == "Pass").sum() /
max(1, len(team_events[team_events["type"] == "Pass"])),
"pressures": (team_events["type"] == "Pressure").sum()
}
# Analyze each match
match_stats = []
for match_id in knockout_matches["match_id"]:
match_events = knockout_events[knockout_events["match_id"] == match_id]
teams = match_events["team"].unique()
match_info = knockout_matches[knockout_matches["match_id"] == match_id].iloc[0]
for team in teams:
stats = analyze_team_match(match_events, team)
stats["match_id"] = match_id
# Determine result
if team == match_info["home_team"]:
stats["goals_for"] = match_info["home_score"]
stats["goals_against"] = match_info["away_score"]
else:
stats["goals_for"] = match_info["away_score"]
stats["goals_against"] = match_info["home_score"]
stats["result"] = "Win" if stats["goals_for"] > stats["goals_against"] else (
"Loss" if stats["goals_for"] < stats["goals_against"] else "Draw")
match_stats.append(stats)
stats_df = pd.DataFrame(match_stats)
# Compare winners vs losers
winners = stats_df[stats_df["result"] == "Win"]
losers = stats_df[stats_df["result"] == "Loss"]
print("\nWinner Statistics (Knockout Matches):")
print(f" Average xG: {winners['xG'].mean():.2f}")
print(f" Average Shots: {winners['shots'].mean():.1f}")
print(f" xG Conversion: {(winners['goals'] / winners['xG']).mean():.2f}")
print("\nLoser Statistics (Knockout Matches):")
print(f" Average xG: {losers['xG'].mean():.2f}")
print(f" Average Shots: {losers['shots'].mean():.1f}")
print(f" xG Conversion: {(losers['goals'] / losers['xG']).mean():.2f}")
# Knockout Match Analysis
library(tidyverse)
library(StatsBombR)
# Load World Cup knockout data
wc_matches <- FreeMatches(Competitions = 43)
wc_events <- get.matchFree(wc_matches)
# Identify knockout matches (Round of 16 onwards)
knockout_matches <- wc_matches %>%
filter(match_week >= 4) # Knockout rounds
# Analysis: What wins knockout games?
knockout_analysis <- wc_events %>%
filter(match_id %in% knockout_matches$match_id) %>%
group_by(match_id, team.name) %>%
summarise(
# Shot metrics
shots = sum(type.name == "Shot"),
shots_on_target = sum(type.name == "Shot" &
shot.outcome.name %in% c("Goal", "Saved")),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
goals = sum(type.name == "Shot" & shot.outcome.name == "Goal"),
# Possession proxies
passes = sum(type.name == "Pass"),
pass_completion = mean(is.na(pass.outcome.name[type.name == "Pass"])),
# Set pieces
corners = sum(type.name == "Pass" & pass.type.name == "Corner"),
free_kicks = sum(type.name == "Shot" & shot.type.name == "Free Kick"),
# Defensive
pressures = sum(type.name == "Pressure"),
interceptions = sum(type.name == "Interception"),
.groups = "drop"
)
# Join with match outcomes
match_outcomes <- wc_matches %>%
select(match_id, home_team.home_team_name, away_team.away_team_name,
home_score, away_score)
# Determine winners
knockout_analysis <- knockout_analysis %>%
left_join(match_outcomes, by = "match_id") %>%
mutate(
is_home = team.name == home_team.home_team_name,
team_goals = ifelse(is_home, home_score, away_score),
opp_goals = ifelse(is_home, away_score, home_score),
result = case_when(
team_goals > opp_goals ~ "Win",
team_goals < opp_goals ~ "Loss",
TRUE ~ "Draw"
)
)
# What correlates with knockout success?
winner_stats <- knockout_analysis %>%
filter(result == "Win") %>%
summarise(
avg_xG = mean(xG),
avg_shots = mean(shots),
avg_pass_pct = mean(pass_completion) * 100,
xG_conversion = mean(goals / xG)
)
loser_stats <- knockout_analysis %>%
filter(result == "Loss") %>%
summarise(
avg_xG = mean(xG),
avg_shots = mean(shots),
avg_pass_pct = mean(pass_completion) * 100,
xG_conversion = mean(goals / xG)
)
cat("Winner vs Loser Statistics in Knockout Matches:\n")
cat("Winners - xG:", round(winner_stats$avg_xG, 2),
"Shots:", round(winner_stats$avg_shots, 1), "\n")
cat("Losers - xG:", round(loser_stats$avg_xG, 2),
"Shots:", round(loser_stats$avg_shots, 1), "\n")
Penalty Shootout Analysis
# Penalty Shootout Analytics
import pandas as pd
import numpy as np
# Historical penalty shootout data
penalty_data = pd.DataFrame({
"tournament": ["World Cup"]*6 + ["Euro"]*2,
"year": [2022]*6 + [2020]*2,
"team": ["Argentina", "France", "Croatia", "Brazil", "Morocco", "Spain",
"Italy", "England"],
"opponent": ["France", "Argentina", "Brazil", "Croatia", "Spain", "Morocco",
"England", "Italy"],
"round": ["Final", "Final", "QF", "QF", "R16", "R16", "Final", "Final"],
"scored": [4, 2, 4, 2, 3, 0, 3, 2],
"total": [4, 3, 4, 4, 3, 3, 5, 5],
"won": [True, False, True, False, True, False, True, False]
})
# Calculate metrics
penalty_data["conversion_rate"] = penalty_data["scored"] / penalty_data["total"]
# Analysis by outcome
winners = penalty_data[penalty_data["won"] == True]
losers = penalty_data[penalty_data["won"] == False]
print("Penalty Shootout Analysis:")
print(f"\nWinners avg conversion: {winners['conversion_rate'].mean():.1%}")
print(f"Losers avg conversion: {losers['conversion_rate'].mean():.1%}")
# Shot order analysis (first team to shoot)
penalty_data["shot_first"] = penalty_data.index % 2 == 0
first_shooters_won = penalty_data[penalty_data["shot_first"]]["won"].mean()
print(f"\nFirst shooter win rate: {first_shooters_won:.1%}")
# Individual penalty analysis class
class PenaltyAnalyzer:
"""Analyze individual penalty patterns"""
def __init__(self):
# Historical placement data
self.placement_zones = {
"top_left": {"success": 0.91, "freq": 0.12},
"top_right": {"success": 0.89, "freq": 0.11},
"mid_left": {"success": 0.78, "freq": 0.22},
"mid_right": {"success": 0.76, "freq": 0.23},
"bottom_left": {"success": 0.72, "freq": 0.15},
"bottom_right": {"success": 0.70, "freq": 0.14},
"center": {"success": 0.65, "freq": 0.03}
}
def optimal_strategy(self):
"""Calculate optimal placement strategy"""
# Expected value = success_rate * 1 (if goal)
expected_values = {
zone: data["success"]
for zone, data in self.placement_zones.items()
}
return sorted(expected_values.items(), key=lambda x: -x[1])
def pressure_adjustment(self, shootout_position):
"""Adjust success rate based on shootout position"""
# Penalty 1-3: normal, 4-5: high pressure
base_conversion = 0.75
if shootout_position <= 3:
return base_conversion
elif shootout_position == 4:
return base_conversion * 0.92
else: # 5th penalty (decisive)
return base_conversion * 0.85
# Analysis
analyzer = PenaltyAnalyzer()
print("\nOptimal Penalty Placement (by success rate):")
for zone, ev in analyzer.optimal_strategy()[:3]:
print(f" {zone}: {ev:.1%}")
print("\nPressure Effect on Conversion:")
for pos in range(1, 6):
print(f" Penalty {pos}: {analyzer.pressure_adjustment(pos):.1%}")
# Penalty Shootout Analytics
library(tidyverse)
# Historical penalty shootout data (example)
penalty_data <- tribble(
~tournament, ~year, ~team, ~opponent, ~round, ~scored, ~total, ~won,
"World Cup", 2022, "Argentina", "France", "Final", 4, 4, TRUE,
"World Cup", 2022, "France", "Argentina", "Final", 2, 3, FALSE,
"World Cup", 2022, "Croatia", "Brazil", "QF", 4, 4, TRUE,
"World Cup", 2022, "Brazil", "Croatia", "QF", 2, 4, FALSE,
"World Cup", 2022, "Morocco", "Spain", "R16", 3, 3, TRUE,
"World Cup", 2022, "Spain", "Morocco", "R16", 0, 3, FALSE,
"Euro", 2020, "Italy", "England", "Final", 3, 5, TRUE,
"Euro", 2020, "England", "Italy", "Final", 2, 5, FALSE
)
# Shootout success factors
shootout_analysis <- penalty_data %>%
mutate(
conversion_rate = scored / total,
shot_first = row_number() %% 2 == 1 # Odd rows shot first
) %>%
group_by(won) %>%
summarise(
avg_conversion = mean(conversion_rate),
shot_first_pct = mean(shot_first) * 100,
n = n()
)
print("Penalty Shootout Analysis:")
print(shootout_analysis)
# Individual penalty analysis (from StatsBomb)
analyze_penalties <- function(events) {
events %>%
filter(shot.type.name == "Penalty") %>%
mutate(
is_goal = shot.outcome.name == "Goal",
shot_placement = shot.end_location.y, # Left/Right
keeper_position = goalkeeper_position,
# Timing in shootout
shootout_order = row_number()
) %>%
summarise(
total = n(),
scored = sum(is_goal),
conversion = mean(is_goal) * 100,
# Placement analysis
went_left = sum(shot_placement < 40, na.rm = TRUE),
went_right = sum(shot_placement > 40, na.rm = TRUE),
went_center = sum(shot_placement >= 36 & shot_placement <= 44, na.rm = TRUE)
)
}
# Optimal penalty strategy
cat("\nKey Penalty Insights:\n")
cat("- First kicker advantage: Teams shooting first win ~60% of shootouts\n")
cat("- Goalkeeper dive direction: Most keepers dive to their right\n")
cat("- Pressure effect: Later penalties (4th, 5th) have lower conversion\n")
cat("- Historic conversion: World Cup penalty conversion ~75%\n")
Group Stage Analysis
Group stage dynamics create unique strategic considerations. Teams must balance risk management, goal difference implications, and the possibility of manipulation (intentional draws, goal scoring to influence bracket positioning).
# Group Stage Analysis and Scenarios
import pandas as pd
import numpy as np
from itertools import combinations
from scipy.stats import poisson
class GroupStageSimulator:
"""Simulate group stage scenarios"""
def __init__(self, teams, ratings):
self.teams = teams
self.ratings = ratings
self.matches = list(combinations(teams, 2))
def simulate_match(self, team_a, team_b):
"""Simulate single group stage match"""
# Poisson goals based on ratings
lambda_a = 1.3 * self.ratings[team_a] / 1500
lambda_b = 1.3 * self.ratings[team_b] / 1500
goals_a = poisson.rvs(lambda_a)
goals_b = poisson.rvs(lambda_b)
return goals_a, goals_b
def calculate_standings(self, results):
"""Calculate group standings from results"""
standings = {team: {"P": 0, "W": 0, "D": 0, "L": 0,
"GF": 0, "GA": 0, "GD": 0, "Pts": 0}
for team in self.teams}
for (team_a, team_b), (goals_a, goals_b) in results.items():
standings[team_a]["P"] += 1
standings[team_b]["P"] += 1
standings[team_a]["GF"] += goals_a
standings[team_a]["GA"] += goals_b
standings[team_b]["GF"] += goals_b
standings[team_b]["GA"] += goals_a
if goals_a > goals_b:
standings[team_a]["W"] += 1
standings[team_a]["Pts"] += 3
standings[team_b]["L"] += 1
elif goals_a < goals_b:
standings[team_b]["W"] += 1
standings[team_b]["Pts"] += 3
standings[team_a]["L"] += 1
else:
standings[team_a]["D"] += 1
standings[team_b]["D"] += 1
standings[team_a]["Pts"] += 1
standings[team_b]["Pts"] += 1
for team in standings:
standings[team]["GD"] = standings[team]["GF"] - standings[team]["GA"]
# Sort by points, then GD, then GF
sorted_teams = sorted(
standings.keys(),
key=lambda t: (standings[t]["Pts"], standings[t]["GD"], standings[t]["GF"]),
reverse=True
)
return sorted_teams, standings
def simulate_group(self, n_simulations=10000):
"""Run full group stage simulation"""
qualifications = {team: 0 for team in self.teams}
group_wins = {team: 0 for team in self.teams}
for _ in range(n_simulations):
results = {}
for team_a, team_b in self.matches:
results[(team_a, team_b)] = self.simulate_match(team_a, team_b)
sorted_teams, _ = self.calculate_standings(results)
# Top 2 qualify
qualifications[sorted_teams[0]] += 1
qualifications[sorted_teams[1]] += 1
group_wins[sorted_teams[0]] += 1
# Convert to percentages
results_df = pd.DataFrame({
"Team": self.teams,
"Qualify %": [qualifications[t] / n_simulations * 100 for t in self.teams],
"Win Group %": [group_wins[t] / n_simulations * 100 for t in self.teams]
})
return results_df.sort_values("Qualify %", ascending=False)
# Example simulation
teams = ["Spain", "Germany", "Japan", "Costa Rica"]
ratings = {"Spain": 1950, "Germany": 1850, "Japan": 1600, "Costa Rica": 1450}
simulator = GroupStageSimulator(teams, ratings)
results = simulator.simulate_group(n_simulations=10000)
print("Group Stage Simulation Results:")
print(results.to_string(index=False))
# Group Stage Analysis and Scenarios
library(tidyverse)
# Group stage standings calculator
calculate_standings <- function(results) {
results %>%
pivot_longer(c(team_a, team_b), names_to = "home_away", values_to = "team") %>%
mutate(
goals_for = ifelse(home_away == "team_a", score_a, score_b),
goals_against = ifelse(home_away == "team_a", score_b, score_a),
points = case_when(
goals_for > goals_against ~ 3,
goals_for == goals_against ~ 1,
TRUE ~ 0
)
) %>%
group_by(team) %>%
summarise(
played = n(),
won = sum(points == 3),
drawn = sum(points == 1),
lost = sum(points == 0),
gf = sum(goals_for),
ga = sum(goals_against),
gd = gf - ga,
points = sum(points),
.groups = "drop"
) %>%
arrange(desc(points), desc(gd), desc(gf))
}
# Monte Carlo group stage simulation
simulate_group <- function(teams, ratings, n_sims = 10000) {
# Generate all matches
matches <- combn(teams, 2, simplify = FALSE) %>%
map_dfr(~tibble(team_a = .x[1], team_b = .x[2]))
qualification_count <- setNames(rep(0, 4), teams)
winning_group_count <- setNames(rep(0, 4), teams)
for (sim in 1:n_sims) {
# Simulate all group matches
sim_results <- matches %>%
rowwise() %>%
mutate(
# Simple Poisson model
lambda_a = 1.3 * ratings[team_a] / 1500,
lambda_b = 1.3 * ratings[team_b] / 1500,
score_a = rpois(1, lambda_a),
score_b = rpois(1, lambda_b)
)
standings <- calculate_standings(sim_results)
# Top 2 qualify
qualifiers <- standings$team[1:2]
qualification_count[qualifiers] <- qualification_count[qualifiers] + 1
winning_group_count[standings$team[1]] <- winning_group_count[standings$team[1]] + 1
}
tibble(
team = teams,
qualify_pct = qualification_count / n_sims * 100,
win_group_pct = winning_group_count / n_sims * 100
) %>%
arrange(desc(qualify_pct))
}
# Example: Group simulation
teams <- c("Spain", "Germany", "Japan", "Costa Rica")
ratings <- c(Spain = 1950, Germany = 1850, Japan = 1600, Costa Rica = 1450)
group_sim <- simulate_group(teams, ratings)
print("Group Stage Simulation Results:")
print(group_sim)
Tactical Adaptation in Tournaments
International managers must quickly adapt tactics between matches with limited training time. Analytics can help identify opponent vulnerabilities and optimize game plans for specific matchups.
# Opponent Scouting Report Generator
import pandas as pd
import numpy as np
class ScoutingReport:
"""Generate tactical scouting reports for tournament opponents"""
def __init__(self, events_df, team_name):
self.team = team_name
self.team_events = events_df[events_df["team"] == team_name]
self.opp_events = events_df[events_df["team"] != team_name]
def analyze_attacking(self):
"""Analyze attacking patterns"""
events = self.team_events
# Extract locations
events = events.copy()
events["x"] = events["location"].apply(lambda x: x[0] if isinstance(x, list) else None)
events["y"] = events["location"].apply(lambda x: x[1] if isinstance(x, list) else None)
passes = events[events["type"] == "Pass"]
shots = events[events["type"] == "Shot"]
return {
"total_shots": len(shots),
"xG": shots["shot_statsbomb_xg"].sum(),
"shots_in_box": len(shots[shots["x"] > 102]),
"passes": len(passes),
"avg_pass_length": passes["pass_length"].mean() if "pass_length" in passes.columns else None
}
def analyze_defensive(self):
"""Analyze defensive patterns"""
events = self.team_events.copy()
events["x"] = events["location"].apply(lambda x: x[0] if isinstance(x, list) else None)
pressures = events[events["type"] == "Pressure"]
return {
"total_pressures": len(pressures),
"high_press": len(pressures[pressures["x"] > 80]) if len(pressures) > 0 else 0,
"mid_press": len(pressures[(pressures["x"] > 40) & (pressures["x"] <= 80)]) if len(pressures) > 0 else 0,
"press_pct_high": len(pressures[pressures["x"] > 80]) / max(1, len(pressures)) * 100
}
def identify_key_players(self, top_n=5):
"""Identify most influential players"""
events = self.team_events
player_stats = events.groupby("player").agg({
"id": "count",
"shot_statsbomb_xg": "sum"
}).reset_index()
player_stats.columns = ["player", "actions", "xG"]
player_stats["influence"] = player_stats["actions"] * 0.1 + player_stats["xG"] * 10
return player_stats.nlargest(top_n, "influence")
def generate_recommendations(self, attacking, defensive):
"""Generate tactical recommendations"""
recommendations = []
# High press counter
if defensive["press_pct_high"] > 40:
recommendations.append(
"High pressing team - play direct balls behind press"
)
# Set piece threat
if attacking["xG"] > 2.0:
recommendations.append(
"Strong attacking threat - prioritize defensive organization"
)
# Low block vulnerability
if defensive["high_press"] < 10:
recommendations.append(
"Deep defending team - patient build-up, create overloads"
)
return recommendations
def compile_report(self):
"""Compile full scouting report"""
attacking = self.analyze_attacking()
defensive = self.analyze_defensive()
key_players = self.identify_key_players()
recommendations = self.generate_recommendations(attacking, defensive)
return {
"team": self.team,
"attacking": attacking,
"defensive": defensive,
"key_players": key_players,
"recommendations": recommendations
}
# Example usage
# scout = ScoutingReport(match_events, "Argentina")
# report = scout.compile_report()
print("Scouting Report Generator initialized")
print("Methods: analyze_attacking(), analyze_defensive(), identify_key_players()")
# Opponent Scouting Report Generator
library(tidyverse)
library(StatsBombR)
# Generate tactical scouting report
generate_scout_report <- function(events, team_name) {
team_events <- events %>%
filter(team.name == team_name)
opp_events <- events %>%
filter(team.name != team_name)
# Attacking patterns
attacking <- team_events %>%
summarise(
# Build-up
buildup_passes = sum(type.name == "Pass" & location.x < 40),
direct_play = mean(pass.length[type.name == "Pass" & location.x < 60], na.rm = TRUE),
# Wing preference
left_attacks = sum(type.name == "Pass" & pass.end_location.y < 27 &
pass.end_location.x > 80, na.rm = TRUE),
right_attacks = sum(type.name == "Pass" & pass.end_location.y > 53 &
pass.end_location.x > 80, na.rm = TRUE),
wing_bias = (right_attacks - left_attacks) / (right_attacks + left_attacks + 1),
# Finishing
shots = sum(type.name == "Shot"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
shots_in_box = sum(type.name == "Shot" & location.x > 102, na.rm = TRUE),
# Set pieces
corners = sum(type.name == "Pass" & pass.type.name == "Corner"),
corner_xG = sum(shot.statsbomb_xg[pass.type.name == "Corner"], na.rm = TRUE)
)
# Defensive patterns
defensive <- team_events %>%
summarise(
# Press
high_press = sum(type.name == "Pressure" & location.x > 80),
mid_press = sum(type.name == "Pressure" & location.x > 40 & location.x <= 80),
press_success = mean(pressure_success == TRUE, na.rm = TRUE),
# Line height
def_line_avg = mean(location.x[type.name == "Ball Recovery" &
position.name %in% c("Center Back", "Left Back", "Right Back")],
na.rm = TRUE),
# Weaknesses
fouls_conceded = sum(type.name == "Foul Committed"),
cards = sum(foul_committed.card.name %in% c("Yellow Card", "Red Card"), na.rm = TRUE)
)
# Key players
key_players <- team_events %>%
group_by(player.name) %>%
summarise(
passes = sum(type.name == "Pass"),
pass_completion = mean(is.na(pass.outcome.name[type.name == "Pass"])),
shots = sum(type.name == "Shot"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
influence = passes * 0.1 + shots * 2 + xG * 10,
.groups = "drop"
) %>%
arrange(desc(influence)) %>%
head(5)
# Compile report
list(
team = team_name,
attacking = attacking,
defensive = defensive,
key_players = key_players,
recommendations = generate_recommendations(attacking, defensive)
)
}
generate_recommendations <- function(attacking, defensive) {
recs <- c()
# Wing recommendations
if (attacking$wing_bias > 0.2) {
recs <- c(recs, "Opponent favors right side - overload left to exploit space")
} else if (attacking$wing_bias < -0.2) {
recs <- c(recs, "Opponent favors left side - overload right to exploit space")
}
# Press recommendations
if (defensive$high_press > 20) {
recs <- c(recs, "High pressing team - play long balls behind press")
}
# Set piece recommendations
if (attacking$corners > 10) {
recs <- c(recs, "Dangerous from corners - prioritize first contact")
}
return(recs)
}
Historical Tournament Analysis
Analyzing historical tournament data provides insights into patterns of success, home advantage effects, and the evolution of international football over time.
# Historical World Cup Analysis
import pandas as pd
import matplotlib.pyplot as plt
# Historical World Cup data
world_cups = pd.DataFrame({
"year": [2022, 2018, 2014, 2010, 2006, 2002, 1998, 1994, 1990, 1986],
"host": ["Qatar", "Russia", "Brazil", "South Africa", "Germany",
"Korea/Japan", "France", "USA", "Italy", "Mexico"],
"winner": ["Argentina", "France", "Germany", "Spain", "Italy",
"Brazil", "France", "Brazil", "Germany", "Argentina"],
"runner_up": ["France", "Croatia", "Argentina", "Netherlands", "France",
"Germany", "Brazil", "Italy", "Argentina", "Germany"],
"goals": [172, 169, 171, 145, 147, 161, 171, 141, 115, 132],
"matches": [64, 64, 64, 64, 64, 64, 64, 52, 52, 52],
"teams": [32, 32, 32, 32, 32, 32, 32, 24, 24, 24]
})
# Calculate metrics
world_cups["goals_per_match"] = world_cups["goals"] / world_cups["matches"]
world_cups["host_won"] = world_cups["host"] == world_cups["winner"]
world_cups["host_finalist"] = (
(world_cups["host"] == world_cups["winner"]) |
(world_cups["host"] == world_cups["runner_up"])
)
# Goals per match trend
plt.figure(figsize=(10, 6))
plt.plot(world_cups["year"], world_cups["goals_per_match"],
marker="o", linewidth=2, color="#1B5E20")
plt.xlabel("Year")
plt.ylabel("Goals Per Match")
plt.title("World Cup Goals Per Match Over Time")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Host nation advantage
host_win_rate = world_cups["host_won"].mean() * 100
host_finalist_rate = world_cups["host_finalist"].mean() * 100
print(f"Host Nation Advantage:")
print(f" Win rate: {host_win_rate:.1f}%")
print(f" Finalist rate: {host_finalist_rate:.1f}%")
# Most successful nations
champions = world_cups["winner"].value_counts()
print(f"\nWorld Cup Winners (1986-2022):")
print(champions)
# Regional analysis
def get_region(country):
south_america = ["Brazil", "Argentina"]
europe = ["Germany", "France", "Italy", "Spain"]
if country in south_america:
return "South America"
elif country in europe:
return "Europe"
return "Other"
world_cups["winner_region"] = world_cups["winner"].apply(get_region)
regional = world_cups["winner_region"].value_counts()
print(f"\nRegional Dominance:")
print(regional)
# Trend analysis
recent = world_cups[world_cups["year"] >= 2010]
older = world_cups[world_cups["year"] < 2010]
print(f"\nGoals Per Match:")
print(f" 2010-2022: {recent['goals_per_match'].mean():.2f}")
print(f" 1986-2006: {older['goals_per_match'].mean():.2f}")
# Historical World Cup Analysis
library(tidyverse)
# Historical World Cup data
world_cups <- tribble(
~year, ~host, ~winner, ~runner_up, ~goals, ~matches, ~teams,
2022, "Qatar", "Argentina", "France", 172, 64, 32,
2018, "Russia", "France", "Croatia", 169, 64, 32,
2014, "Brazil", "Germany", "Argentina", 171, 64, 32,
2010, "South Africa", "Spain", "Netherlands", 145, 64, 32,
2006, "Germany", "Italy", "France", 147, 64, 32,
2002, "Korea/Japan", "Brazil", "Germany", 161, 64, 32,
1998, "France", "France", "Brazil", 171, 64, 32,
1994, "USA", "Brazil", "Italy", 141, 52, 24,
1990, "Italy", "Germany", "Argentina", 115, 52, 24,
1986, "Mexico", "Argentina", "Germany", 132, 52, 24
)
# Analysis
world_cups <- world_cups %>%
mutate(
goals_per_match = goals / matches,
host_won = host == winner | str_detect(winner, host),
host_finalist = host == winner | host == runner_up |
str_detect(winner, host) | str_detect(runner_up, host)
)
# Goals per match trend
goals_trend <- world_cups %>%
ggplot(aes(x = year, y = goals_per_match)) +
geom_line(color = "#1B5E20", linewidth = 1.5) +
geom_point(size = 3, color = "#1B5E20") +
labs(
title = "World Cup Goals Per Match Over Time",
x = "Year",
y = "Goals Per Match"
) +
theme_minimal()
# Host nation advantage
host_advantage <- world_cups %>%
summarise(
host_win_pct = mean(host_won) * 100,
host_final_pct = mean(host_finalist) * 100
)
cat("Host Nation Advantage:\n")
cat("Win rate:", host_advantage$host_win_pct, "%\n")
cat("Finalist rate:", host_advantage$host_final_pct, "%\n")
# Most successful nations
champions <- world_cups %>%
count(winner, sort = TRUE) %>%
rename(titles = n)
print("\nWorld Cup Winners (1986-2022):")
print(champions)
# European vs South American dominance
regional_analysis <- world_cups %>%
mutate(
winner_region = case_when(
winner %in% c("Brazil", "Argentina") ~ "South America",
winner %in% c("Germany", "France", "Italy", "Spain") ~ "Europe",
TRUE ~ "Other"
)
) %>%
count(winner_region)
print("\nRegional Dominance:")
print(regional_analysis)
Qualifying Campaign Analytics
World Cup and continental championship qualification campaigns present their own analytical challenges. Understanding qualifying dynamics helps predict which teams will reach major tournaments and how their qualifying performance translates to tournament success.
# Python: Qualifying campaign analysis
import pandas as pd
import numpy as np
from typing import Dict, List
class QualifyingAnalyzer:
"""Analyze World Cup/Euro qualifying campaigns."""
def __init__(self, qualifying_data: pd.DataFrame):
self.data = qualifying_data
def calculate_team_stats(self) -> pd.DataFrame:
"""Calculate comprehensive qualifying statistics."""
stats = self.data.groupby("team").agg({
"match_id": "count",
"result": [
lambda x: (x == "W").sum(),
lambda x: (x == "D").sum(),
lambda x: (x == "L").sum()
],
"goals_for": "sum",
"goals_against": "sum"
}).reset_index()
stats.columns = ["team", "matches", "wins", "draws", "losses",
"goals_for", "goals_against"]
# Calculate derived metrics
stats["points"] = stats["wins"] * 3 + stats["draws"]
stats["goal_difference"] = stats["goals_for"] - stats["goals_against"]
stats["goals_per_game"] = stats["goals_for"] / stats["matches"]
stats["conceded_per_game"] = stats["goals_against"] / stats["matches"]
stats["win_rate"] = stats["wins"] / stats["matches"]
stats["efficiency"] = stats["points"] / (stats["matches"] * 3)
# Performance tier
stats["tier"] = pd.cut(
stats["efficiency"],
bins=[0, 0.35, 0.50, 0.70, 0.85, 1.0],
labels=["Weak", "Struggling", "Competitive", "Strong", "Dominant"]
)
return stats.sort_values("points", ascending=False)
def correlate_with_tournament(self, tournament_results: pd.DataFrame) -> Dict:
"""Analyze correlation between qualifying and tournament performance."""
qualifying_stats = self.calculate_team_stats()
combined = qualifying_stats.merge(
tournament_results[["team", "tourn_points", "stage"]],
on="team", how="inner"
)
# Correlations
correlations = {
"points_correlation": combined["points"].corr(combined["tourn_points"]),
"goals_correlation": combined["goals_for"].corr(combined["tourn_points"]),
"efficiency_correlation": combined["efficiency"].corr(combined["tourn_points"])
}
# Performance by qualifying tier
tier_analysis = combined.groupby("tier").agg({
"tourn_points": "mean",
"stage": lambda x: (x.isin(["QF", "SF", "Final"])).mean()
}).reset_index()
return {
"correlations": correlations,
"tier_analysis": tier_analysis
}
def identify_group_of_death(self, teams_by_group: pd.DataFrame,
ratings: pd.DataFrame) -> pd.DataFrame:
"""Identify the group of death."""
combined = teams_by_group.merge(ratings[["team", "elo"]], on="team")
group_stats = combined.groupby("group").agg({
"elo": ["mean", "min", "max", "std"]
}).reset_index()
group_stats.columns = ["group", "avg_elo", "min_elo", "max_elo", "std_elo"]
# Competitiveness: high average + low std = group of death
group_stats["competitiveness"] = 1 - (group_stats["std_elo"] / group_stats["avg_elo"])
group_stats["combined_score"] = group_stats["avg_elo"] * group_stats["competitiveness"]
group_stats["is_group_of_death"] = (
group_stats["combined_score"] > group_stats["combined_score"].quantile(0.75)
)
return group_stats.sort_values("combined_score", ascending=False)
class ConfederationComparison:
"""Compare qualifying formats across confederations."""
FORMATS = {
"UEFA": {
"groups": 10,
"teams_per_group": 5,
"auto_qualify": 10,
"playoff_spots": 3,
"total_slots": 13
},
"CONMEBOL": {
"format": "single_league",
"teams": 10,
"auto_qualify": 6,
"playoff_spots": 1,
"total_slots": 6.5
},
"CONCACAF": {
"format": "octagonal",
"teams": 8,
"auto_qualify": 3,
"playoff_spots": 1,
"total_slots": 3.5
},
"CAF": {
"groups": 9,
"playoff_round": True,
"auto_qualify": 0,
"playoff_spots": 5,
"total_slots": 5
},
"AFC": {
"rounds": 3,
"final_groups": 2,
"auto_qualify": 4,
"playoff_spots": 1,
"total_slots": 4.5
}
}
def compare_difficulty(self, confederation: str, team_count: int) -> Dict:
"""Calculate qualifying difficulty for a confederation."""
format_info = self.FORMATS.get(confederation, {})
# Probability of qualification (simplified)
if "auto_qualify" in format_info:
auto_prob = format_info["auto_qualify"] / team_count
playoff_prob = format_info.get("playoff_spots", 0) / team_count * 0.5
total_prob = auto_prob + playoff_prob
return {
"confederation": confederation,
"format": format_info,
"qualification_probability": total_prob,
"difficulty_rating": 1 - total_prob
}
return {"confederation": confederation, "format": format_info}
# Example usage
print("Qualifying Campaign Analyzer initialized")
comparator = ConfederationComparison()
for conf in ["UEFA", "CONMEBOL", "CONCACAF"]:
result = comparator.compare_difficulty(conf, 50 if conf == "UEFA" else 10)
print(f"{conf}: Qualification probability {result.get('qualification_probability', 0):.1%}")# R: Qualifying campaign analysis
library(tidyverse)
# Analyze European World Cup qualifying
analyze_qualifying <- function(qualifying_data) {
qualifying_data %>%
group_by(team) %>%
summarise(
matches = n(),
wins = sum(result == "W"),
draws = sum(result == "D"),
losses = sum(result == "L"),
goals_for = sum(goals_for),
goals_against = sum(goals_against),
points = wins * 3 + draws,
# Advanced metrics
goals_per_game = goals_for / matches,
conceded_per_game = goals_against / matches,
win_rate = wins / matches,
# Home vs away split (filter before summing)
home_points = sum(((result == "W") * 3 + (result == "D") * 1) * (venue == "Home")),
away_points = sum(((result == "W") * 3 + (result == "D") * 1) * (venue == "Away")),
.groups = "drop"
) %>%
mutate(
# Points efficiency
max_possible = matches * 3,
efficiency = points / max_possible,
# Classify performance
performance_tier = case_when(
efficiency > 0.85 ~ "Dominant",
efficiency > 0.70 ~ "Strong",
efficiency > 0.50 ~ "Competitive",
efficiency > 0.35 ~ "Struggling",
TRUE ~ "Weak"
)
) %>%
arrange(desc(points))
}
# Does qualifying performance predict tournament success?
correlate_qualifying_tournament <- function(qualifying_results, tournament_results) {
combined <- qualifying_results %>%
select(team, qual_points = points, qual_goals = goals_for,
qual_conceded = goals_against) %>%
inner_join(
tournament_results %>%
select(team, tourn_points = points, tourn_stage = furthest_stage),
by = "team"
)
# Correlation analysis
correlations <- combined %>%
summarise(
points_cor = cor(qual_points, tourn_points, use = "complete.obs"),
goals_cor = cor(qual_goals, tourn_points, use = "complete.obs")
)
# Stage achievement by qualifying tier
stage_by_tier <- combined %>%
mutate(
qual_tier = ntile(qual_points, 4)
) %>%
group_by(qual_tier) %>%
summarise(
avg_tourn_points = mean(tourn_points),
reached_knockout = mean(tourn_stage >= "Round of 16"),
.groups = "drop"
)
list(
correlations = correlations,
stage_analysis = stage_by_tier
)
}
# Group of death detection
identify_group_difficulty <- function(teams_by_group, ratings) {
teams_by_group %>%
left_join(ratings, by = "team") %>%
group_by(group) %>%
summarise(
avg_rating = mean(elo_rating),
min_rating = min(elo_rating),
max_rating = max(elo_rating),
range = max_rating - min_rating,
# Competitiveness (smaller range = more competitive)
competitiveness = 1 - (range / max_rating),
.groups = "drop"
) %>%
mutate(
is_group_of_death = avg_rating > quantile(avg_rating, 0.75) &
competitiveness > 0.5
) %>%
arrange(desc(avg_rating))
}
# Playoff scenarios analysis
analyze_playoff_paths <- function(standings, format = "uefa") {
if (format == "uefa") {
# UEFA: Top 2 per group qualify, playoff for 3rd place
standings %>%
group_by(group) %>%
mutate(
position = row_number(),
status = case_when(
position <= 2 ~ "Qualified",
position == 3 ~ "Playoff",
TRUE ~ "Eliminated"
)
) %>%
ungroup()
} else if (format == "conmebol") {
# CONMEBOL: Single group, top 6 qualify
standings %>%
mutate(
status = case_when(
row_number() <= 6 ~ "Qualified",
row_number() == 7 ~ "Playoff",
TRUE ~ "Eliminated"
)
)
}
}
print("Qualifying campaign analyzer ready!")Key Qualifying Insights
- Qualifying ≠ Tournament: Dominant qualifiers don't always succeed in tournaments
- Away form matters: Teams that struggle away in qualifying often struggle in neutral venues
- Goal difference: High-scoring qualifiers tend to be more attacking but not necessarily more successful
- Playoff experience: Teams that qualify through playoffs have shown resilience under pressure
Continental Differences
Football varies significantly across confederations, with different playing styles, physical attributes, and tactical approaches. Understanding these differences is crucial for international tournament analytics.
# Python: Continental playing style analysis
import pandas as pd
import numpy as np
from typing import Dict, List
class ContinentalAnalyzer:
"""Analyze football differences across confederations."""
CONFEDERATIONS = {
"UEFA": ["Spain", "Germany", "France", "England", "Italy", "Netherlands",
"Portugal", "Belgium", "Croatia", "Denmark"],
"CONMEBOL": ["Brazil", "Argentina", "Uruguay", "Colombia", "Chile",
"Peru", "Ecuador", "Paraguay"],
"CONCACAF": ["Mexico", "USA", "Canada", "Costa Rica"],
"CAF": ["Morocco", "Senegal", "Nigeria", "Cameroon", "Ghana", "Egypt"],
"AFC": ["Japan", "South Korea", "Australia", "Iran", "Saudi Arabia"],
"OFC": ["New Zealand"]
}
def __init__(self, match_data: pd.DataFrame):
self.data = match_data
self.team_conf = self._build_team_confederation_map()
def _build_team_confederation_map(self) -> Dict[str, str]:
"""Create team to confederation mapping."""
mapping = {}
for conf, teams in self.CONFEDERATIONS.items():
for team in teams:
mapping[team] = conf
return mapping
def analyze_styles(self) -> pd.DataFrame:
"""Analyze playing styles by confederation."""
self.data["confederation"] = self.data["team"].map(self.team_conf)
conf_data = self.data[self.data["confederation"].notna()]
style_analysis = conf_data.groupby("confederation").agg({
"passes": "mean",
"pass_completion": "mean",
"xG": "mean",
"shots": "mean",
"pressures": "mean",
"possession": "mean"
}).reset_index()
# Classify styles
style_analysis["dominant_style"] = style_analysis.apply(
lambda x: self._classify_style(x), axis=1
)
return style_analysis
def _classify_style(self, row: pd.Series) -> str:
"""Classify playing style based on metrics."""
if row["possession"] > 55 and row["passes"] > 500:
return "Possession-based"
elif row["pressures"] > 150:
return "High-pressing"
elif row["xG"] > 1.5:
return "Attack-focused"
else:
return "Balanced"
def head_to_head_analysis(self) -> pd.DataFrame:
"""Analyze inter-confederation matchups."""
# Assuming data has home/away team info
h2h_results = []
# This would need actual match data with home/away structure
# Simplified example
return pd.DataFrame(h2h_results)
def tournament_success(self, tournament_data: pd.DataFrame) -> pd.DataFrame:
"""Analyze World Cup success by confederation."""
tournament_data["confederation"] = tournament_data["team"].map(self.team_conf)
success = tournament_data.groupby("confederation").agg({
"team": "nunique",
"points": "sum",
"stage": lambda x: {
"group_exits": (x == "Group").sum(),
"knockout_exits": x.isin(["R16", "QF"]).sum(),
"semi_plus": x.isin(["SF", "Final", "Winner"]).sum()
}
}).reset_index()
return success
class TravelImpactAnalyzer:
"""Analyze impact of travel and time zones on performance."""
def __init__(self):
# Approximate time zones for major football nations
self.time_zones = {
# Europe (base: UTC+1)
"Spain": 1, "Germany": 1, "France": 1, "England": 0, "Italy": 1,
# South America (UTC-3 to -5)
"Brazil": -3, "Argentina": -3, "Uruguay": -3, "Colombia": -5,
# Asia (UTC+3 to +9)
"Japan": 9, "South Korea": 9, "Iran": 3.5, "Saudi Arabia": 3,
# Africa (UTC+0 to +3)
"Morocco": 1, "Senegal": 0, "Nigeria": 1,
# North America
"USA": -5, "Mexico": -6
}
def calculate_impact(self, team: str, host_tz: float) -> Dict:
"""Calculate travel impact for a team."""
team_tz = self.time_zones.get(team, 0)
tz_diff = abs(team_tz - host_tz)
impact_level = (
"Severe" if tz_diff > 6 else
"Moderate" if tz_diff > 3 else
"Minimal"
)
# Performance adjustment factor
adjustment = 1.0 - (tz_diff * 0.02) # 2% reduction per hour difference
return {
"team": team,
"tz_difference": tz_diff,
"impact_level": impact_level,
"performance_adjustment": max(0.85, adjustment)
}
def tournament_analysis(self, teams: List[str], host_tz: float) -> pd.DataFrame:
"""Analyze travel impact for all tournament teams."""
impacts = [self.calculate_impact(team, host_tz) for team in teams]
return pd.DataFrame(impacts).sort_values("tz_difference", ascending=False)
# Example usage
print("Continental and travel impact analyzers initialized")
# World Cup in Qatar (UTC+3)
travel_analyzer = TravelImpactAnalyzer()
wc_teams = ["Argentina", "France", "Brazil", "England", "Japan", "Morocco"]
impacts = travel_analyzer.tournament_analysis(wc_teams, host_tz=3)
print("\nTravel Impact for Qatar 2022:")
print(impacts.to_string(index=False))# R: Continental playing style analysis
library(tidyverse)
# Analyze playing styles by confederation
analyze_continental_styles <- function(match_events, team_confederations) {
match_events %>%
left_join(team_confederations, by = "team") %>%
filter(!is.na(confederation)) %>%
group_by(confederation) %>%
summarise(
# Passing style
avg_passes = mean(passes_per_match),
pass_completion = mean(pass_completion),
long_ball_pct = mean(long_balls / passes_per_match) * 100,
# Attacking
avg_xG = mean(xG_per_match),
shots_per_match = mean(shots_per_match),
shot_accuracy = mean(shots_on_target / shots_per_match),
# Defending
avg_xGA = mean(xGA_per_match),
pressing_intensity = mean(pressures_per_match),
tackle_success = mean(tackle_win_pct),
# Physical
avg_distance = mean(total_distance),
sprint_distance = mean(sprint_distance),
.groups = "drop"
)
}
# Head-to-head analysis between confederations
analyze_h2h_confederations <- function(match_data, team_confederations) {
match_data %>%
left_join(team_confederations, by = c("home_team" = "team")) %>%
rename(home_conf = confederation) %>%
left_join(team_confederations, by = c("away_team" = "team")) %>%
rename(away_conf = confederation) %>%
filter(!is.na(home_conf), !is.na(away_conf)) %>%
filter(home_conf != away_conf) %>% # Inter-confederation matches only
mutate(
home_result = case_when(
home_score > away_score ~ "Win",
home_score < away_score ~ "Loss",
TRUE ~ "Draw"
)
) %>%
group_by(home_conf, away_conf) %>%
summarise(
matches = n(),
home_wins = sum(home_result == "Win"),
draws = sum(home_result == "Draw"),
away_wins = sum(home_result == "Loss"),
home_win_rate = home_wins / matches,
home_goals = mean(home_score),
away_goals = mean(away_score),
.groups = "drop"
)
}
# World Cup performance by confederation
wc_confederation_analysis <- function(tournament_results) {
tournament_results %>%
group_by(confederation) %>%
summarise(
teams = n_distinct(team),
total_matches = n(),
total_points = sum(points),
# Stage progression
group_exit = sum(stage == "Group"),
r16_exit = sum(stage == "Round of 16"),
qf_exit = sum(stage == "Quarter-final"),
sf_exit = sum(stage == "Semi-final"),
final_exit = sum(stage == "Final"),
# Success rates
knockout_rate = 1 - (group_exit / teams),
qf_rate = sum(stage %in% c("Quarter-final", "Semi-final", "Final")) / teams,
semi_rate = sum(stage %in% c("Semi-final", "Final")) / teams,
.groups = "drop"
) %>%
arrange(desc(semi_rate))
}
# Time zone and travel impact
analyze_travel_impact <- function(match_data, team_locations) {
match_data %>%
left_join(team_locations, by = "team") %>%
mutate(
# Time zone difference from tournament host
tz_difference = abs(team_timezone - host_timezone),
# Distance traveled
distance_km = calculate_distance(team_lat, team_lon,
host_lat, host_lon),
# Travel impact categories
travel_impact = case_when(
tz_difference > 6 ~ "Severe",
tz_difference > 3 ~ "Moderate",
distance_km > 5000 ~ "Long distance",
TRUE ~ "Minimal"
)
) %>%
group_by(travel_impact) %>%
summarise(
teams = n(),
avg_points = mean(points),
avg_goals = mean(goals),
knockout_rate = mean(reached_knockout),
.groups = "drop"
)
}
print("Continental analysis framework ready!")| Confederation | Typical Style | WC Slots | Historical Success |
|---|---|---|---|
| UEFA | Technical, tactical variety | 13 | 12 World Cup wins |
| CONMEBOL | Technical, physical, flair | 6.5 | 9 World Cup wins |
| CAF | Athletic, improving tactically | 5 | 0 wins, 3 QF appearances |
| AFC | Organized, disciplined | 4.5 | 0 wins, 2 R16 appearances |
| CONCACAF | Physical, direct | 3.5 | 0 wins, 2 QF appearances |
Practice Exercises
Exercise 45.1: Build a Tournament Simulator
Create a complete World Cup simulation model that predicts group stage outcomes and knockout results. Include Elo ratings, Monte Carlo simulation, and probability outputs for each team advancing through rounds.
- Start with current FIFA rankings or Elo ratings
- Implement proper tiebreakers for group stage standings
- Account for the actual tournament bracket structure
Exercise 45.2: Knockout Match Win Probability
Analyze StatsBomb World Cup data to identify what match statistics best predict knockout match outcomes. Build a logistic regression model and evaluate its performance.
- Consider xG, possession, pressing metrics
- Account for extra time and penalties
- Test for overfitting with cross-validation
Exercise 45.3: Squad Optimization
Create an optimization model that selects a 26-player World Cup squad from a pool of 50 players. Balance quality, form, experience, and positional coverage constraints.
- Use linear programming or integer optimization
- Define minimum players per position
- Weight different attributes based on tournament needs
Exercise 45.4: Penalty Shootout Strategy
Using historical penalty data, develop an optimal penalty shootout strategy that considers shot placement, goalkeeper tendencies, and psychological factors.
- Analyze placement success rates by zone
- Consider the pressure effect of later penalties
- Model goalkeeper dive patterns
Summary
Key Takeaways
- Tournament Prediction: Effective models combine team ratings, match simulation, and bracket analysis while accounting for the inherent randomness of knockout football
- Squad Selection: Analytics can optimize player selection by balancing quality, form, experience, and positional coverage within roster constraints
- Knockout Dynamics: Single-elimination matches require different analytical approaches than league play, with emphasis on key moments and set pieces
- Tactical Adaptation: Quick scouting and tactical analysis are essential given the limited preparation time between international matches
- Historical Context: Understanding tournament history provides valuable baselines for expectations and identifies persistent patterns
International football analytics combines many of the techniques covered throughout this book while adding unique challenges around limited sample sizes, squad constraints, and the high-stakes nature of tournament play. Success requires balancing sophisticated modeling with practical understanding of how tournaments unfold.