Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples
0 of 60 chapters completed (0%)

Introduction to International Football Analytics

International football presents unique analytical challenges. Unlike club football where teams play together regularly, national teams convene infrequently with limited preparation time. This creates distinct analytical considerations for squad selection, tactical planning, and tournament prediction.

The Unique Nature of Tournament Football

Major tournaments like the World Cup or European Championship feature compressed schedules, high-stakes single matches, and the convergence of players from different club systems. Analytics must adapt to these conditions while accounting for the limited sample sizes of international matches.

Loading international football and World Cup data

# Loading International Football Data
import pandas as pd
from statsbombpy import sb

# Get available competitions from StatsBomb
competitions = sb.competitions()

# Filter for World Cup data
wc_competitions = competitions[
    competitions["competition_name"].str.contains("World Cup", case=False, na=False)
]

print("Available World Cup Data:")
print(wc_competitions[["competition_name", "season_name", "competition_id"]])

# Load World Cup 2022 matches
wc_matches = sb.matches(competition_id=43, season_id=106)

print(f"\nWorld Cup 2022: {len(wc_matches)} matches")
print(wc_matches[["match_date", "home_team", "away_team", "home_score", "away_score"]].head(10))

# Load event data for analysis
all_events = []
for match_id in wc_matches["match_id"][:10]:  # First 10 matches
    events = sb.events(match_id=match_id)
    events["match_id"] = match_id
    all_events.append(events)

wc_events = pd.concat(all_events, ignore_index=True)
print(f"\nTotal events loaded: {len(wc_events)}")

# Historical World Cup results from Wikipedia/Kaggle
try:
    historical_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/worldcups.csv"
    historical_wc = pd.read_csv(historical_url)
    print("\nHistorical World Cup Winners:")
    print(historical_wc[["year", "winner", "host"]].tail(10))
except:
    print("Could not load historical data")

# Loading International Football Data
library(tidyverse)
library(worldfootballR)

# Get World Cup 2022 match results
wc_2022_matches <- fb_match_results(
  country = "",
  gender = "M",
  season_end_year = 2022,
  tier = ""
) %>%
  filter(Competition_Name == "World Cup")

# Get historical World Cup data
historical_wc <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/worldcups.csv")

print("World Cup 2022 Matches:")
print(head(wc_2022_matches))

# StatsBomb free World Cup data
library(StatsBombR)

# Get World Cup competitions
comps <- FreeCompetitions()
wc_comps <- comps %>%
  filter(str_detect(competition_name, "World Cup"))

print("Available World Cup Data:")
print(wc_comps %>% select(competition_name, season_name))

# Load 2022 World Cup matches
wc_matches <- FreeMatches(Competitions = 43)

# Get all events
wc_events <- get.matchFree(wc_matches)

cat("Total World Cup 2022 events:", nrow(wc_events), "\n")

Tournament Prediction Models

Predicting tournament outcomes requires modeling not just team strength, but also bracket dynamics, match-by-match probabilities, and the inherent randomness of knockout football.

Key Components of Tournament Prediction

Team Ratings

Elo ratings, FIFA rankings, or custom strength models

Match Simulation

Poisson goals, Monte Carlo methods

Bracket Paths

Knockout draw implications and path difficulty

Building a tournament prediction system with Elo ratings and Monte Carlo simulation

# Tournament Prediction Model
import numpy as np
import pandas as pd
from scipy.stats import poisson

class EloRatingSystem:
    """Elo rating system for international football"""

    def __init__(self, k=30, home_advantage=100, initial_rating=1500):
        self.k = k
        self.home_advantage = home_advantage
        self.initial_rating = initial_rating
        self.ratings = {}

    def get_rating(self, team):
        return self.ratings.get(team, self.initial_rating)

    def expected_score(self, rating_a, rating_b, neutral=True):
        """Calculate expected score for team A"""
        diff = rating_b - rating_a
        if not neutral:
            diff -= self.home_advantage
        return 1 / (1 + 10 ** (diff / 400))

    def update_ratings(self, home_team, away_team, home_score, away_score):
        """Update ratings after a match"""
        r_home = self.get_rating(home_team)
        r_away = self.get_rating(away_team)

        e_home = self.expected_score(r_home, r_away, neutral=False)
        e_away = 1 - e_home

        # Actual results
        if home_score > away_score:
            s_home, s_away = 1, 0
        elif home_score < away_score:
            s_home, s_away = 0, 1
        else:
            s_home, s_away = 0.5, 0.5

        # Update
        self.ratings[home_team] = r_home + self.k * (s_home - e_home)
        self.ratings[away_team] = r_away + self.k * (s_away - e_away)


class TournamentSimulator:
    """Monte Carlo tournament simulation"""

    def __init__(self, elo_system):
        self.elo = elo_system

    def simulate_match(self, team_a, team_b, neutral=True):
        """Simulate a single match"""
        rating_a = self.elo.get_rating(team_a)
        rating_b = self.elo.get_rating(team_b)

        prob_a = self.elo.expected_score(rating_a, rating_b, neutral)

        # Poisson goal model
        lambda_a = 1.3 * (prob_a + 0.2)
        lambda_b = 1.3 * (1 - prob_a + 0.2)

        goals_a = poisson.rvs(lambda_a)
        goals_b = poisson.rvs(lambda_b)

        # Handle draws in knockout (penalties)
        if goals_a == goals_b:
            winner = team_a if np.random.random() < prob_a else team_b
        else:
            winner = team_a if goals_a > goals_b else team_b

        return {"goals_a": goals_a, "goals_b": goals_b, "winner": winner}

    def simulate_tournament(self, teams, n_simulations=10000):
        """Simulate full tournament multiple times"""
        results = {team: {"wins": 0, "finals": 0, "semis": 0} for team in teams}

        for _ in range(n_simulations):
            remaining = list(teams)
            np.random.shuffle(remaining)

            while len(remaining) > 1:
                next_round = []
                for i in range(0, len(remaining), 2):
                    team_a = remaining[i]
                    team_b = remaining[i + 1]

                    # Track semifinals
                    if len(remaining) <= 4:
                        results[team_a]["semis"] += 1
                        results[team_b]["semis"] += 1

                    # Track finals
                    if len(remaining) == 2:
                        results[team_a]["finals"] += 1
                        results[team_b]["finals"] += 1

                    match = self.simulate_match(team_a, team_b)
                    next_round.append(match["winner"])

                remaining = next_round

            results[remaining[0]]["wins"] += 1

        # Convert to DataFrame
        df = pd.DataFrame(results).T
        df["win_pct"] = df["wins"] / n_simulations * 100
        df["final_pct"] = df["finals"] / n_simulations * 100
        df["semi_pct"] = df["semis"] / n_simulations * 100

        return df.sort_values("win_pct", ascending=False)


# Usage example
elo = EloRatingSystem()

# Sample ratings (approximate 2022 World Cup)
teams = {
    "Brazil": 2166, "Argentina": 2143, "France": 2090,
    "England": 2040, "Spain": 2045, "Germany": 1980,
    "Netherlands": 2032, "Portugal": 2006
}

for team, rating in teams.items():
    elo.ratings[team] = rating

simulator = TournamentSimulator(elo)
results = simulator.simulate_tournament(list(teams.keys()), n_simulations=10000)
print("Tournament Win Probabilities:")
print(results[["win_pct", "final_pct", "semi_pct"]])

# Tournament Prediction Model
library(tidyverse)

# Elo Rating System for International Football
calculate_elo <- function(matches, k = 30, home_advantage = 100) {

  # Initialize ratings (1500 baseline)
  ratings <- list()

  for (i in 1:nrow(matches)) {
    home <- matches$home_team[i]
    away <- matches$away_team[i]

    # Get current ratings
    r_home <- ifelse(home %in% names(ratings), ratings[[home]], 1500)
    r_away <- ifelse(away %in% names(ratings), ratings[[away]], 1500)

    # Expected scores
    e_home <- 1 / (1 + 10^((r_away - r_home - home_advantage) / 400))
    e_away <- 1 - e_home

    # Actual scores (1 = win, 0.5 = draw, 0 = loss)
    if (matches$home_score[i] > matches$away_score[i]) {
      s_home <- 1; s_away <- 0
    } else if (matches$home_score[i] < matches$away_score[i]) {
      s_home <- 0; s_away <- 1
    } else {
      s_home <- 0.5; s_away <- 0.5
    }

    # Update ratings
    ratings[[home]] <- r_home + k * (s_home - e_home)
    ratings[[away]] <- r_away + k * (s_away - e_away)
  }

  return(ratings)
}

# Monte Carlo Tournament Simulation
simulate_match <- function(elo_a, elo_b, neutral = TRUE) {
  # Calculate win probability
  if (neutral) {
    prob_a <- 1 / (1 + 10^((elo_b - elo_a) / 400))
  } else {
    prob_a <- 1 / (1 + 10^((elo_b - elo_a - 100) / 400))
  }

  # Generate goals using Poisson
  lambda_a <- 1.3 * (prob_a + 0.2)
  lambda_b <- 1.3 * (1 - prob_a + 0.2)

  goals_a <- rpois(1, lambda_a)
  goals_b <- rpois(1, lambda_b)

  # Handle draws in knockout (penalties)
  if (goals_a == goals_b) {
    # Simplified: use probability for penalty winner
    if (runif(1) < prob_a) {
      return(c(goals_a, goals_b, "A"))
    } else {
      return(c(goals_a, goals_b, "B"))
    }
  }

  winner <- ifelse(goals_a > goals_b, "A", "B")
  return(c(goals_a, goals_b, winner))
}

# Full tournament simulation
simulate_tournament <- function(teams, elo_ratings, n_sims = 10000) {

  results <- tibble(
    team = teams,
    wins = 0,
    finals = 0,
    semis = 0
  )

  for (sim in 1:n_sims) {
    # Simulate bracket (simplified 8-team knockout)
    remaining <- teams
    round <- 1

    while (length(remaining) > 1) {
      winners <- c()
      for (i in seq(1, length(remaining), by = 2)) {
        team_a <- remaining[i]
        team_b <- remaining[i + 1]
        result <- simulate_match(elo_ratings[[team_a]], elo_ratings[[team_b]])
        winner <- ifelse(result[3] == "A", team_a, team_b)
        winners <- c(winners, winner)

        # Track progress
        if (length(remaining) == 2) {
          results$finals[results$team == team_a] <- results$finals[results$team == team_a] + 1
          results$finals[results$team == team_b] <- results$finals[results$team == team_b] + 1
        }
        if (length(remaining) <= 4) {
          results$semis[results$team == team_a] <- results$semis[results$team == team_a] + 1
          results$semis[results$team == team_b] <- results$semis[results$team == team_b] + 1
        }
      }
      remaining <- winners
      round <- round + 1
    }

    results$wins[results$team == remaining[1]] <- results$wins[results$team == remaining[1]] + 1
  }

  results %>%
    mutate(
      win_pct = wins / n_sims * 100,
      final_pct = finals / n_sims * 100,
      semi_pct = semis / n_sims * 100
    ) %>%
    arrange(desc(win_pct))
}

Squad Selection Analytics

International managers face unique challenges in squad selection. With limited roster spots (typically 23-26 players), managers must balance positional coverage, form versus experience, and the chemistry of players from different club systems.

Squad Selection Challenges
  • Limited preparation time between call-up and matches
  • Players arriving in varying fitness states from club seasons
  • Need for positional flexibility with small rosters
  • Balancing current form against tournament experience
  • Managing workload for players with heavy club schedules
Optimizing squad selection for international tournaments

# Squad Selection Optimization
import pandas as pd
import numpy as np
from scipy.optimize import milp, LinearConstraint, Bounds

def create_player_pool():
    """Create sample player pool"""
    data = {
        "player": ["GK1", "GK2", "GK3", "CB1", "CB2", "CB3", "CB4",
                   "LB1", "LB2", "RB1", "RB2", "CM1", "CM2", "CM3",
                   "DM1", "DM2", "AM1", "AM2", "LW1", "LW2", "RW1",
                   "RW2", "ST1", "ST2", "ST3"],
        "position": ["GK", "GK", "GK", "CB", "CB", "CB", "CB",
                     "LB", "LB", "RB", "RB", "CM", "CM", "CM",
                     "DM", "DM", "AM", "AM", "LW", "LW", "RW",
                     "RW", "ST", "ST", "ST"],
        "quality": [90, 82, 78, 88, 86, 84, 80, 85, 78, 84, 79,
                    92, 88, 85, 86, 82, 90, 84, 88, 82, 87, 80,
                    91, 86, 82],
        "form": [85, 88, 90, 85, 90, 82, 88, 87, 92, 86, 85,
                 88, 90, 85, 84, 88, 92, 86, 90, 85, 88, 92,
                 87, 92, 80],
        "experience": [95, 60, 30, 90, 75, 80, 40, 85, 30, 80, 45,
                       95, 70, 80, 85, 50, 75, 60, 70, 40, 75, 30,
                       90, 55, 70],
        "versatility": [1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1,
                        3, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1,
                        2, 1, 1],
        "age": [32, 26, 23, 30, 28, 29, 24, 29, 22, 28, 24,
                31, 27, 29, 30, 25, 26, 27, 27, 23, 28, 22,
                29, 25, 30]
    }
    return pd.DataFrame(data)


def calculate_player_score(df, w_quality=0.4, w_form=0.3,
                           w_experience=0.2, w_versatility=0.1):
    """Calculate overall player score"""
    df = df.copy()
    df["overall_score"] = (
        df["quality"] * w_quality +
        df["form"] * w_form +
        df["experience"] * w_experience +
        df["versatility"] * 10 * w_versatility
    )
    return df


def optimize_squad_simple(players, squad_size=26):
    """Simple greedy approach with position constraints"""
    players = players.copy()

    # Position requirements
    requirements = {
        "GK": 3,
        "CB": 4,
        "LB": 2,
        "RB": 2,
        "DM": 2,
        "CM": 3,
        "AM": 2,
        "LW": 2,
        "RW": 2,
        "ST": 3
    }

    selected = []

    # First pass: fulfill minimum requirements
    for pos, count in requirements.items():
        pos_players = players[players["position"] == pos].nlargest(count, "overall_score")
        selected.extend(pos_players.index.tolist())

    # Remove selected from pool
    remaining = players.drop(selected)

    # Fill remaining spots with best available
    spots_left = squad_size - len(selected)
    if spots_left > 0:
        best_remaining = remaining.nlargest(spots_left, "overall_score")
        selected.extend(best_remaining.index.tolist())

    return players.loc[selected].sort_values(["position", "overall_score"],
                                              ascending=[True, False])


# Run optimization
pool = create_player_pool()
pool = calculate_player_score(pool)
squad = optimize_squad_simple(pool)

print("Optimized 26-Player Squad:")
print(squad[["player", "position", "quality", "form", "experience", "overall_score"]])

# Squad composition analysis
print("\nSquad Composition:")
print(squad.groupby("position").size())
print(f"\nAverage Age: {squad['age'].mean():.1f}")
print(f"Average Quality: {squad['quality'].mean():.1f}")

# Squad Selection Optimization
library(tidyverse)
library(lpSolve)

# Create player pool with attributes
create_player_pool <- function() {
  tribble(
    ~player, ~position, ~quality, ~form, ~experience, ~versatility, ~age,
    # Goalkeepers
    "GK1", "GK", 90, 85, 95, 1, 32,
    "GK2", "GK", 82, 88, 60, 1, 26,
    "GK3", "GK", 78, 90, 30, 1, 23,

    # Defenders
    "CB1", "CB", 88, 85, 90, 2, 30,
    "CB2", "CB", 86, 90, 75, 1, 28,
    "CB3", "CB", 84, 82, 80, 2, 29,
    "CB4", "CB", 80, 88, 40, 1, 24,
    "LB1", "LB", 85, 87, 85, 2, 29,
    "LB2", "LB", 78, 92, 30, 1, 22,
    "RB1", "RB", 84, 86, 80, 2, 28,
    "RB2", "RB", 79, 85, 45, 1, 24,

    # Midfielders
    "CM1", "CM", 92, 88, 95, 3, 31,
    "CM2", "CM", 88, 90, 70, 2, 27,
    "CM3", "CM", 85, 85, 80, 2, 29,
    "DM1", "DM", 86, 84, 85, 2, 30,
    "DM2", "DM", 82, 88, 50, 1, 25,
    "AM1", "AM", 90, 92, 75, 2, 26,
    "AM2", "AM", 84, 86, 60, 2, 27,

    # Wingers
    "LW1", "LW", 88, 90, 70, 2, 27,
    "LW2", "LW", 82, 85, 40, 1, 23,
    "RW1", "RW", 87, 88, 75, 2, 28,
    "RW2", "RW", 80, 92, 30, 1, 22,

    # Forwards
    "ST1", "ST", 91, 87, 90, 2, 29,
    "ST2", "ST", 86, 92, 55, 1, 25,
    "ST3", "ST", 82, 80, 70, 1, 30
  )
}

# Calculate overall player score
calculate_player_score <- function(players,
                                    w_quality = 0.4,
                                    w_form = 0.3,
                                    w_experience = 0.2,
                                    w_versatility = 0.1) {
  players %>%
    mutate(
      overall_score = quality * w_quality +
                      form * w_form +
                      experience * w_experience +
                      versatility * 10 * w_versatility
    )
}

# Optimize squad selection using linear programming
optimize_squad <- function(players, squad_size = 26,
                           min_gk = 3, min_def = 8, min_mid = 8, min_fwd = 5) {

  n <- nrow(players)

  # Objective: maximize total score
  obj <- players$overall_score

  # Constraints matrix
  constraints <- matrix(0, nrow = 5, ncol = n)

  # Squad size constraint
  constraints[1, ] <- 1

  # Position constraints
  constraints[2, players$position == "GK"] <- 1
  constraints[3, players$position %in% c("CB", "LB", "RB")] <- 1
  constraints[4, players$position %in% c("CM", "DM", "AM", "LW", "RW")] <- 1
  constraints[5, players$position == "ST"] <- 1

  directions <- c("==", ">=", ">=", ">=", ">=")
  rhs <- c(squad_size, min_gk, min_def, min_mid, min_fwd)

  # Solve
  solution <- lp("max", obj, constraints, directions, rhs, all.bin = TRUE)

  selected <- players[solution$solution == 1, ]
  return(selected)
}

# Run optimization
pool <- create_player_pool()
pool <- calculate_player_score(pool)
squad <- optimize_squad(pool)

print("Optimized 26-Player Squad:")
print(squad %>%
        arrange(position, desc(overall_score)) %>%
        select(player, position, quality, form, experience, overall_score))

Knockout Stage Analytics

Knockout football is fundamentally different from league play. Single-match elimination creates extreme variance, and analytics must account for the heightened importance of individual moments, set pieces, and penalty shootouts.

Analyzing what determines success in knockout matches

# Knockout Match Analysis
from statsbombpy import sb
import pandas as pd
import numpy as np

# Load World Cup data
wc_matches = sb.matches(competition_id=43, season_id=106)

# Identify knockout matches
knockout_rounds = ["Round of 16", "Quarter-finals", "Semi-finals", "Final"]
knockout_matches = wc_matches[wc_matches["competition_stage"].isin(knockout_rounds)]

print(f"Analyzing {len(knockout_matches)} knockout matches")

# Load events for knockout matches
all_events = []
for match_id in knockout_matches["match_id"]:
    events = sb.events(match_id=match_id)
    events["match_id"] = match_id
    all_events.append(events)

knockout_events = pd.concat(all_events, ignore_index=True)

# Team-level analysis for each match
def analyze_team_match(events, team):
    """Analyze team performance in a match"""
    team_events = events[events["team"] == team]

    return {
        "team": team,
        "shots": (team_events["type"] == "Shot").sum(),
        "xG": team_events["shot_statsbomb_xg"].sum(),
        "goals": ((team_events["type"] == "Shot") &
                  (team_events["shot_outcome"] == "Goal")).sum(),
        "passes": (team_events["type"] == "Pass").sum(),
        "pass_completion": (team_events["type"] == "Pass").sum() /
                          max(1, len(team_events[team_events["type"] == "Pass"])),
        "pressures": (team_events["type"] == "Pressure").sum()
    }

# Analyze each match
match_stats = []
for match_id in knockout_matches["match_id"]:
    match_events = knockout_events[knockout_events["match_id"] == match_id]
    teams = match_events["team"].unique()

    match_info = knockout_matches[knockout_matches["match_id"] == match_id].iloc[0]

    for team in teams:
        stats = analyze_team_match(match_events, team)
        stats["match_id"] = match_id

        # Determine result
        if team == match_info["home_team"]:
            stats["goals_for"] = match_info["home_score"]
            stats["goals_against"] = match_info["away_score"]
        else:
            stats["goals_for"] = match_info["away_score"]
            stats["goals_against"] = match_info["home_score"]

        stats["result"] = "Win" if stats["goals_for"] > stats["goals_against"] else (
            "Loss" if stats["goals_for"] < stats["goals_against"] else "Draw")

        match_stats.append(stats)

stats_df = pd.DataFrame(match_stats)

# Compare winners vs losers
winners = stats_df[stats_df["result"] == "Win"]
losers = stats_df[stats_df["result"] == "Loss"]

print("\nWinner Statistics (Knockout Matches):")
print(f"  Average xG: {winners['xG'].mean():.2f}")
print(f"  Average Shots: {winners['shots'].mean():.1f}")
print(f"  xG Conversion: {(winners['goals'] / winners['xG']).mean():.2f}")

print("\nLoser Statistics (Knockout Matches):")
print(f"  Average xG: {losers['xG'].mean():.2f}")
print(f"  Average Shots: {losers['shots'].mean():.1f}")
print(f"  xG Conversion: {(losers['goals'] / losers['xG']).mean():.2f}")

# Knockout Match Analysis
library(tidyverse)
library(StatsBombR)

# Load World Cup knockout data
wc_matches <- FreeMatches(Competitions = 43)
wc_events <- get.matchFree(wc_matches)

# Identify knockout matches (Round of 16 onwards)
knockout_matches <- wc_matches %>%
  filter(match_week >= 4)  # Knockout rounds

# Analysis: What wins knockout games?
knockout_analysis <- wc_events %>%
  filter(match_id %in% knockout_matches$match_id) %>%
  group_by(match_id, team.name) %>%
  summarise(
    # Shot metrics
    shots = sum(type.name == "Shot"),
    shots_on_target = sum(type.name == "Shot" &
                          shot.outcome.name %in% c("Goal", "Saved")),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    goals = sum(type.name == "Shot" & shot.outcome.name == "Goal"),

    # Possession proxies
    passes = sum(type.name == "Pass"),
    pass_completion = mean(is.na(pass.outcome.name[type.name == "Pass"])),

    # Set pieces
    corners = sum(type.name == "Pass" & pass.type.name == "Corner"),
    free_kicks = sum(type.name == "Shot" & shot.type.name == "Free Kick"),

    # Defensive
    pressures = sum(type.name == "Pressure"),
    interceptions = sum(type.name == "Interception"),

    .groups = "drop"
  )

# Join with match outcomes
match_outcomes <- wc_matches %>%
  select(match_id, home_team.home_team_name, away_team.away_team_name,
         home_score, away_score)

# Determine winners
knockout_analysis <- knockout_analysis %>%
  left_join(match_outcomes, by = "match_id") %>%
  mutate(
    is_home = team.name == home_team.home_team_name,
    team_goals = ifelse(is_home, home_score, away_score),
    opp_goals = ifelse(is_home, away_score, home_score),
    result = case_when(
      team_goals > opp_goals ~ "Win",
      team_goals < opp_goals ~ "Loss",
      TRUE ~ "Draw"
    )
  )

# What correlates with knockout success?
winner_stats <- knockout_analysis %>%
  filter(result == "Win") %>%
  summarise(
    avg_xG = mean(xG),
    avg_shots = mean(shots),
    avg_pass_pct = mean(pass_completion) * 100,
    xG_conversion = mean(goals / xG)
  )

loser_stats <- knockout_analysis %>%
  filter(result == "Loss") %>%
  summarise(
    avg_xG = mean(xG),
    avg_shots = mean(shots),
    avg_pass_pct = mean(pass_completion) * 100,
    xG_conversion = mean(goals / xG)
  )

cat("Winner vs Loser Statistics in Knockout Matches:\n")
cat("Winners - xG:", round(winner_stats$avg_xG, 2),
    "Shots:", round(winner_stats$avg_shots, 1), "\n")
cat("Losers  - xG:", round(loser_stats$avg_xG, 2),
    "Shots:", round(loser_stats$avg_shots, 1), "\n")

Penalty Shootout Analysis

Analyzing penalty shootouts and optimizing strategy

# Penalty Shootout Analytics
import pandas as pd
import numpy as np

# Historical penalty shootout data
penalty_data = pd.DataFrame({
    "tournament": ["World Cup"]*6 + ["Euro"]*2,
    "year": [2022]*6 + [2020]*2,
    "team": ["Argentina", "France", "Croatia", "Brazil", "Morocco", "Spain",
             "Italy", "England"],
    "opponent": ["France", "Argentina", "Brazil", "Croatia", "Spain", "Morocco",
                 "England", "Italy"],
    "round": ["Final", "Final", "QF", "QF", "R16", "R16", "Final", "Final"],
    "scored": [4, 2, 4, 2, 3, 0, 3, 2],
    "total": [4, 3, 4, 4, 3, 3, 5, 5],
    "won": [True, False, True, False, True, False, True, False]
})

# Calculate metrics
penalty_data["conversion_rate"] = penalty_data["scored"] / penalty_data["total"]

# Analysis by outcome
winners = penalty_data[penalty_data["won"] == True]
losers = penalty_data[penalty_data["won"] == False]

print("Penalty Shootout Analysis:")
print(f"\nWinners avg conversion: {winners['conversion_rate'].mean():.1%}")
print(f"Losers avg conversion: {losers['conversion_rate'].mean():.1%}")

# Shot order analysis (first team to shoot)
penalty_data["shot_first"] = penalty_data.index % 2 == 0
first_shooters_won = penalty_data[penalty_data["shot_first"]]["won"].mean()
print(f"\nFirst shooter win rate: {first_shooters_won:.1%}")

# Individual penalty analysis class
class PenaltyAnalyzer:
    """Analyze individual penalty patterns"""

    def __init__(self):
        # Historical placement data
        self.placement_zones = {
            "top_left": {"success": 0.91, "freq": 0.12},
            "top_right": {"success": 0.89, "freq": 0.11},
            "mid_left": {"success": 0.78, "freq": 0.22},
            "mid_right": {"success": 0.76, "freq": 0.23},
            "bottom_left": {"success": 0.72, "freq": 0.15},
            "bottom_right": {"success": 0.70, "freq": 0.14},
            "center": {"success": 0.65, "freq": 0.03}
        }

    def optimal_strategy(self):
        """Calculate optimal placement strategy"""
        # Expected value = success_rate * 1 (if goal)
        expected_values = {
            zone: data["success"]
            for zone, data in self.placement_zones.items()
        }
        return sorted(expected_values.items(), key=lambda x: -x[1])

    def pressure_adjustment(self, shootout_position):
        """Adjust success rate based on shootout position"""
        # Penalty 1-3: normal, 4-5: high pressure
        base_conversion = 0.75
        if shootout_position <= 3:
            return base_conversion
        elif shootout_position == 4:
            return base_conversion * 0.92
        else:  # 5th penalty (decisive)
            return base_conversion * 0.85

# Analysis
analyzer = PenaltyAnalyzer()

print("\nOptimal Penalty Placement (by success rate):")
for zone, ev in analyzer.optimal_strategy()[:3]:
    print(f"  {zone}: {ev:.1%}")

print("\nPressure Effect on Conversion:")
for pos in range(1, 6):
    print(f"  Penalty {pos}: {analyzer.pressure_adjustment(pos):.1%}")

# Penalty Shootout Analytics
library(tidyverse)

# Historical penalty shootout data (example)
penalty_data <- tribble(
  ~tournament, ~year, ~team, ~opponent, ~round, ~scored, ~total, ~won,
  "World Cup", 2022, "Argentina", "France", "Final", 4, 4, TRUE,
  "World Cup", 2022, "France", "Argentina", "Final", 2, 3, FALSE,
  "World Cup", 2022, "Croatia", "Brazil", "QF", 4, 4, TRUE,
  "World Cup", 2022, "Brazil", "Croatia", "QF", 2, 4, FALSE,
  "World Cup", 2022, "Morocco", "Spain", "R16", 3, 3, TRUE,
  "World Cup", 2022, "Spain", "Morocco", "R16", 0, 3, FALSE,
  "Euro", 2020, "Italy", "England", "Final", 3, 5, TRUE,
  "Euro", 2020, "England", "Italy", "Final", 2, 5, FALSE
)

# Shootout success factors
shootout_analysis <- penalty_data %>%
  mutate(
    conversion_rate = scored / total,
    shot_first = row_number() %% 2 == 1  # Odd rows shot first
  ) %>%
  group_by(won) %>%
  summarise(
    avg_conversion = mean(conversion_rate),
    shot_first_pct = mean(shot_first) * 100,
    n = n()
  )

print("Penalty Shootout Analysis:")
print(shootout_analysis)

# Individual penalty analysis (from StatsBomb)
analyze_penalties <- function(events) {
  events %>%
    filter(shot.type.name == "Penalty") %>%
    mutate(
      is_goal = shot.outcome.name == "Goal",
      shot_placement = shot.end_location.y,  # Left/Right
      keeper_position = goalkeeper_position,

      # Timing in shootout
      shootout_order = row_number()
    ) %>%
    summarise(
      total = n(),
      scored = sum(is_goal),
      conversion = mean(is_goal) * 100,

      # Placement analysis
      went_left = sum(shot_placement < 40, na.rm = TRUE),
      went_right = sum(shot_placement > 40, na.rm = TRUE),
      went_center = sum(shot_placement >= 36 & shot_placement <= 44, na.rm = TRUE)
    )
}

# Optimal penalty strategy
cat("\nKey Penalty Insights:\n")
cat("- First kicker advantage: Teams shooting first win ~60% of shootouts\n")
cat("- Goalkeeper dive direction: Most keepers dive to their right\n")
cat("- Pressure effect: Later penalties (4th, 5th) have lower conversion\n")
cat("- Historic conversion: World Cup penalty conversion ~75%\n")

Group Stage Analysis

Group stage dynamics create unique strategic considerations. Teams must balance risk management, goal difference implications, and the possibility of manipulation (intentional draws, goal scoring to influence bracket positioning).

Simulating group stage scenarios and qualification probabilities

# Group Stage Analysis and Scenarios
import pandas as pd
import numpy as np
from itertools import combinations
from scipy.stats import poisson

class GroupStageSimulator:
    """Simulate group stage scenarios"""

    def __init__(self, teams, ratings):
        self.teams = teams
        self.ratings = ratings
        self.matches = list(combinations(teams, 2))

    def simulate_match(self, team_a, team_b):
        """Simulate single group stage match"""
        # Poisson goals based on ratings
        lambda_a = 1.3 * self.ratings[team_a] / 1500
        lambda_b = 1.3 * self.ratings[team_b] / 1500

        goals_a = poisson.rvs(lambda_a)
        goals_b = poisson.rvs(lambda_b)

        return goals_a, goals_b

    def calculate_standings(self, results):
        """Calculate group standings from results"""
        standings = {team: {"P": 0, "W": 0, "D": 0, "L": 0,
                           "GF": 0, "GA": 0, "GD": 0, "Pts": 0}
                     for team in self.teams}

        for (team_a, team_b), (goals_a, goals_b) in results.items():
            standings[team_a]["P"] += 1
            standings[team_b]["P"] += 1
            standings[team_a]["GF"] += goals_a
            standings[team_a]["GA"] += goals_b
            standings[team_b]["GF"] += goals_b
            standings[team_b]["GA"] += goals_a

            if goals_a > goals_b:
                standings[team_a]["W"] += 1
                standings[team_a]["Pts"] += 3
                standings[team_b]["L"] += 1
            elif goals_a < goals_b:
                standings[team_b]["W"] += 1
                standings[team_b]["Pts"] += 3
                standings[team_a]["L"] += 1
            else:
                standings[team_a]["D"] += 1
                standings[team_b]["D"] += 1
                standings[team_a]["Pts"] += 1
                standings[team_b]["Pts"] += 1

        for team in standings:
            standings[team]["GD"] = standings[team]["GF"] - standings[team]["GA"]

        # Sort by points, then GD, then GF
        sorted_teams = sorted(
            standings.keys(),
            key=lambda t: (standings[t]["Pts"], standings[t]["GD"], standings[t]["GF"]),
            reverse=True
        )

        return sorted_teams, standings

    def simulate_group(self, n_simulations=10000):
        """Run full group stage simulation"""
        qualifications = {team: 0 for team in self.teams}
        group_wins = {team: 0 for team in self.teams}

        for _ in range(n_simulations):
            results = {}
            for team_a, team_b in self.matches:
                results[(team_a, team_b)] = self.simulate_match(team_a, team_b)

            sorted_teams, _ = self.calculate_standings(results)

            # Top 2 qualify
            qualifications[sorted_teams[0]] += 1
            qualifications[sorted_teams[1]] += 1
            group_wins[sorted_teams[0]] += 1

        # Convert to percentages
        results_df = pd.DataFrame({
            "Team": self.teams,
            "Qualify %": [qualifications[t] / n_simulations * 100 for t in self.teams],
            "Win Group %": [group_wins[t] / n_simulations * 100 for t in self.teams]
        })

        return results_df.sort_values("Qualify %", ascending=False)


# Example simulation
teams = ["Spain", "Germany", "Japan", "Costa Rica"]
ratings = {"Spain": 1950, "Germany": 1850, "Japan": 1600, "Costa Rica": 1450}

simulator = GroupStageSimulator(teams, ratings)
results = simulator.simulate_group(n_simulations=10000)

print("Group Stage Simulation Results:")
print(results.to_string(index=False))

# Group Stage Analysis and Scenarios
library(tidyverse)

# Group stage standings calculator
calculate_standings <- function(results) {
  results %>%
    pivot_longer(c(team_a, team_b), names_to = "home_away", values_to = "team") %>%
    mutate(
      goals_for = ifelse(home_away == "team_a", score_a, score_b),
      goals_against = ifelse(home_away == "team_a", score_b, score_a),
      points = case_when(
        goals_for > goals_against ~ 3,
        goals_for == goals_against ~ 1,
        TRUE ~ 0
      )
    ) %>%
    group_by(team) %>%
    summarise(
      played = n(),
      won = sum(points == 3),
      drawn = sum(points == 1),
      lost = sum(points == 0),
      gf = sum(goals_for),
      ga = sum(goals_against),
      gd = gf - ga,
      points = sum(points),
      .groups = "drop"
    ) %>%
    arrange(desc(points), desc(gd), desc(gf))
}

# Monte Carlo group stage simulation
simulate_group <- function(teams, ratings, n_sims = 10000) {

  # Generate all matches
  matches <- combn(teams, 2, simplify = FALSE) %>%
    map_dfr(~tibble(team_a = .x[1], team_b = .x[2]))

  qualification_count <- setNames(rep(0, 4), teams)
  winning_group_count <- setNames(rep(0, 4), teams)

  for (sim in 1:n_sims) {
    # Simulate all group matches
    sim_results <- matches %>%
      rowwise() %>%
      mutate(
        # Simple Poisson model
        lambda_a = 1.3 * ratings[team_a] / 1500,
        lambda_b = 1.3 * ratings[team_b] / 1500,
        score_a = rpois(1, lambda_a),
        score_b = rpois(1, lambda_b)
      )

    standings <- calculate_standings(sim_results)

    # Top 2 qualify
    qualifiers <- standings$team[1:2]
    qualification_count[qualifiers] <- qualification_count[qualifiers] + 1
    winning_group_count[standings$team[1]] <- winning_group_count[standings$team[1]] + 1
  }

  tibble(
    team = teams,
    qualify_pct = qualification_count / n_sims * 100,
    win_group_pct = winning_group_count / n_sims * 100
  ) %>%
    arrange(desc(qualify_pct))
}

# Example: Group simulation
teams <- c("Spain", "Germany", "Japan", "Costa Rica")
ratings <- c(Spain = 1950, Germany = 1850, Japan = 1600, Costa Rica = 1450)

group_sim <- simulate_group(teams, ratings)
print("Group Stage Simulation Results:")
print(group_sim)

Tactical Adaptation in Tournaments

International managers must quickly adapt tactics between matches with limited training time. Analytics can help identify opponent vulnerabilities and optimize game plans for specific matchups.

Generating tactical scouting reports for tournament opponents

# Opponent Scouting Report Generator
import pandas as pd
import numpy as np

class ScoutingReport:
    """Generate tactical scouting reports for tournament opponents"""

    def __init__(self, events_df, team_name):
        self.team = team_name
        self.team_events = events_df[events_df["team"] == team_name]
        self.opp_events = events_df[events_df["team"] != team_name]

    def analyze_attacking(self):
        """Analyze attacking patterns"""
        events = self.team_events

        # Extract locations
        events = events.copy()
        events["x"] = events["location"].apply(lambda x: x[0] if isinstance(x, list) else None)
        events["y"] = events["location"].apply(lambda x: x[1] if isinstance(x, list) else None)

        passes = events[events["type"] == "Pass"]
        shots = events[events["type"] == "Shot"]

        return {
            "total_shots": len(shots),
            "xG": shots["shot_statsbomb_xg"].sum(),
            "shots_in_box": len(shots[shots["x"] > 102]),
            "passes": len(passes),
            "avg_pass_length": passes["pass_length"].mean() if "pass_length" in passes.columns else None
        }

    def analyze_defensive(self):
        """Analyze defensive patterns"""
        events = self.team_events.copy()
        events["x"] = events["location"].apply(lambda x: x[0] if isinstance(x, list) else None)

        pressures = events[events["type"] == "Pressure"]

        return {
            "total_pressures": len(pressures),
            "high_press": len(pressures[pressures["x"] > 80]) if len(pressures) > 0 else 0,
            "mid_press": len(pressures[(pressures["x"] > 40) & (pressures["x"] <= 80)]) if len(pressures) > 0 else 0,
            "press_pct_high": len(pressures[pressures["x"] > 80]) / max(1, len(pressures)) * 100
        }

    def identify_key_players(self, top_n=5):
        """Identify most influential players"""
        events = self.team_events

        player_stats = events.groupby("player").agg({
            "id": "count",
            "shot_statsbomb_xg": "sum"
        }).reset_index()

        player_stats.columns = ["player", "actions", "xG"]
        player_stats["influence"] = player_stats["actions"] * 0.1 + player_stats["xG"] * 10

        return player_stats.nlargest(top_n, "influence")

    def generate_recommendations(self, attacking, defensive):
        """Generate tactical recommendations"""
        recommendations = []

        # High press counter
        if defensive["press_pct_high"] > 40:
            recommendations.append(
                "High pressing team - play direct balls behind press"
            )

        # Set piece threat
        if attacking["xG"] > 2.0:
            recommendations.append(
                "Strong attacking threat - prioritize defensive organization"
            )

        # Low block vulnerability
        if defensive["high_press"] < 10:
            recommendations.append(
                "Deep defending team - patient build-up, create overloads"
            )

        return recommendations

    def compile_report(self):
        """Compile full scouting report"""
        attacking = self.analyze_attacking()
        defensive = self.analyze_defensive()
        key_players = self.identify_key_players()
        recommendations = self.generate_recommendations(attacking, defensive)

        return {
            "team": self.team,
            "attacking": attacking,
            "defensive": defensive,
            "key_players": key_players,
            "recommendations": recommendations
        }


# Example usage
# scout = ScoutingReport(match_events, "Argentina")
# report = scout.compile_report()

print("Scouting Report Generator initialized")
print("Methods: analyze_attacking(), analyze_defensive(), identify_key_players()")

# Opponent Scouting Report Generator
library(tidyverse)
library(StatsBombR)

# Generate tactical scouting report
generate_scout_report <- function(events, team_name) {

  team_events <- events %>%
    filter(team.name == team_name)

  opp_events <- events %>%
    filter(team.name != team_name)

  # Attacking patterns
  attacking <- team_events %>%
    summarise(
      # Build-up
      buildup_passes = sum(type.name == "Pass" & location.x < 40),
      direct_play = mean(pass.length[type.name == "Pass" & location.x < 60], na.rm = TRUE),

      # Wing preference
      left_attacks = sum(type.name == "Pass" & pass.end_location.y < 27 &
                         pass.end_location.x > 80, na.rm = TRUE),
      right_attacks = sum(type.name == "Pass" & pass.end_location.y > 53 &
                          pass.end_location.x > 80, na.rm = TRUE),
      wing_bias = (right_attacks - left_attacks) / (right_attacks + left_attacks + 1),

      # Finishing
      shots = sum(type.name == "Shot"),
      xG = sum(shot.statsbomb_xg, na.rm = TRUE),
      shots_in_box = sum(type.name == "Shot" & location.x > 102, na.rm = TRUE),

      # Set pieces
      corners = sum(type.name == "Pass" & pass.type.name == "Corner"),
      corner_xG = sum(shot.statsbomb_xg[pass.type.name == "Corner"], na.rm = TRUE)
    )

  # Defensive patterns
  defensive <- team_events %>%
    summarise(
      # Press
      high_press = sum(type.name == "Pressure" & location.x > 80),
      mid_press = sum(type.name == "Pressure" & location.x > 40 & location.x <= 80),
      press_success = mean(pressure_success == TRUE, na.rm = TRUE),

      # Line height
      def_line_avg = mean(location.x[type.name == "Ball Recovery" &
                                      position.name %in% c("Center Back", "Left Back", "Right Back")],
                          na.rm = TRUE),

      # Weaknesses
      fouls_conceded = sum(type.name == "Foul Committed"),
      cards = sum(foul_committed.card.name %in% c("Yellow Card", "Red Card"), na.rm = TRUE)
    )

  # Key players
  key_players <- team_events %>%
    group_by(player.name) %>%
    summarise(
      passes = sum(type.name == "Pass"),
      pass_completion = mean(is.na(pass.outcome.name[type.name == "Pass"])),
      shots = sum(type.name == "Shot"),
      xG = sum(shot.statsbomb_xg, na.rm = TRUE),
      influence = passes * 0.1 + shots * 2 + xG * 10,
      .groups = "drop"
    ) %>%
    arrange(desc(influence)) %>%
    head(5)

  # Compile report
  list(
    team = team_name,
    attacking = attacking,
    defensive = defensive,
    key_players = key_players,
    recommendations = generate_recommendations(attacking, defensive)
  )
}

generate_recommendations <- function(attacking, defensive) {
  recs <- c()

  # Wing recommendations
  if (attacking$wing_bias > 0.2) {
    recs <- c(recs, "Opponent favors right side - overload left to exploit space")
  } else if (attacking$wing_bias < -0.2) {
    recs <- c(recs, "Opponent favors left side - overload right to exploit space")
  }

  # Press recommendations
  if (defensive$high_press > 20) {
    recs <- c(recs, "High pressing team - play long balls behind press")
  }

  # Set piece recommendations
  if (attacking$corners > 10) {
    recs <- c(recs, "Dangerous from corners - prioritize first contact")
  }

  return(recs)
}

Historical Tournament Analysis

Analyzing historical tournament data provides insights into patterns of success, home advantage effects, and the evolution of international football over time.

Analyzing historical World Cup trends and patterns

# Historical World Cup Analysis
import pandas as pd
import matplotlib.pyplot as plt

# Historical World Cup data
world_cups = pd.DataFrame({
    "year": [2022, 2018, 2014, 2010, 2006, 2002, 1998, 1994, 1990, 1986],
    "host": ["Qatar", "Russia", "Brazil", "South Africa", "Germany",
             "Korea/Japan", "France", "USA", "Italy", "Mexico"],
    "winner": ["Argentina", "France", "Germany", "Spain", "Italy",
               "Brazil", "France", "Brazil", "Germany", "Argentina"],
    "runner_up": ["France", "Croatia", "Argentina", "Netherlands", "France",
                  "Germany", "Brazil", "Italy", "Argentina", "Germany"],
    "goals": [172, 169, 171, 145, 147, 161, 171, 141, 115, 132],
    "matches": [64, 64, 64, 64, 64, 64, 64, 52, 52, 52],
    "teams": [32, 32, 32, 32, 32, 32, 32, 24, 24, 24]
})

# Calculate metrics
world_cups["goals_per_match"] = world_cups["goals"] / world_cups["matches"]
world_cups["host_won"] = world_cups["host"] == world_cups["winner"]
world_cups["host_finalist"] = (
    (world_cups["host"] == world_cups["winner"]) |
    (world_cups["host"] == world_cups["runner_up"])
)

# Goals per match trend
plt.figure(figsize=(10, 6))
plt.plot(world_cups["year"], world_cups["goals_per_match"],
         marker="o", linewidth=2, color="#1B5E20")
plt.xlabel("Year")
plt.ylabel("Goals Per Match")
plt.title("World Cup Goals Per Match Over Time")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Host nation advantage
host_win_rate = world_cups["host_won"].mean() * 100
host_finalist_rate = world_cups["host_finalist"].mean() * 100

print(f"Host Nation Advantage:")
print(f"  Win rate: {host_win_rate:.1f}%")
print(f"  Finalist rate: {host_finalist_rate:.1f}%")

# Most successful nations
champions = world_cups["winner"].value_counts()
print(f"\nWorld Cup Winners (1986-2022):")
print(champions)

# Regional analysis
def get_region(country):
    south_america = ["Brazil", "Argentina"]
    europe = ["Germany", "France", "Italy", "Spain"]
    if country in south_america:
        return "South America"
    elif country in europe:
        return "Europe"
    return "Other"

world_cups["winner_region"] = world_cups["winner"].apply(get_region)
regional = world_cups["winner_region"].value_counts()
print(f"\nRegional Dominance:")
print(regional)

# Trend analysis
recent = world_cups[world_cups["year"] >= 2010]
older = world_cups[world_cups["year"] < 2010]

print(f"\nGoals Per Match:")
print(f"  2010-2022: {recent['goals_per_match'].mean():.2f}")
print(f"  1986-2006: {older['goals_per_match'].mean():.2f}")

# Historical World Cup Analysis
library(tidyverse)

# Historical World Cup data
world_cups <- tribble(
  ~year, ~host, ~winner, ~runner_up, ~goals, ~matches, ~teams,
  2022, "Qatar", "Argentina", "France", 172, 64, 32,
  2018, "Russia", "France", "Croatia", 169, 64, 32,
  2014, "Brazil", "Germany", "Argentina", 171, 64, 32,
  2010, "South Africa", "Spain", "Netherlands", 145, 64, 32,
  2006, "Germany", "Italy", "France", 147, 64, 32,
  2002, "Korea/Japan", "Brazil", "Germany", 161, 64, 32,
  1998, "France", "France", "Brazil", 171, 64, 32,
  1994, "USA", "Brazil", "Italy", 141, 52, 24,
  1990, "Italy", "Germany", "Argentina", 115, 52, 24,
  1986, "Mexico", "Argentina", "Germany", 132, 52, 24
)

# Analysis
world_cups <- world_cups %>%
  mutate(
    goals_per_match = goals / matches,
    host_won = host == winner | str_detect(winner, host),
    host_finalist = host == winner | host == runner_up |
                    str_detect(winner, host) | str_detect(runner_up, host)
  )

# Goals per match trend
goals_trend <- world_cups %>%
  ggplot(aes(x = year, y = goals_per_match)) +
  geom_line(color = "#1B5E20", linewidth = 1.5) +
  geom_point(size = 3, color = "#1B5E20") +
  labs(
    title = "World Cup Goals Per Match Over Time",
    x = "Year",
    y = "Goals Per Match"
  ) +
  theme_minimal()

# Host nation advantage
host_advantage <- world_cups %>%
  summarise(
    host_win_pct = mean(host_won) * 100,
    host_final_pct = mean(host_finalist) * 100
  )

cat("Host Nation Advantage:\n")
cat("Win rate:", host_advantage$host_win_pct, "%\n")
cat("Finalist rate:", host_advantage$host_final_pct, "%\n")

# Most successful nations
champions <- world_cups %>%
  count(winner, sort = TRUE) %>%
  rename(titles = n)

print("\nWorld Cup Winners (1986-2022):")
print(champions)

# European vs South American dominance
regional_analysis <- world_cups %>%
  mutate(
    winner_region = case_when(
      winner %in% c("Brazil", "Argentina") ~ "South America",
      winner %in% c("Germany", "France", "Italy", "Spain") ~ "Europe",
      TRUE ~ "Other"
    )
  ) %>%
  count(winner_region)

print("\nRegional Dominance:")
print(regional_analysis)

Qualifying Campaign Analytics

World Cup and continental championship qualification campaigns present their own analytical challenges. Understanding qualifying dynamics helps predict which teams will reach major tournaments and how their qualifying performance translates to tournament success.

qualifying_analysis
# Python: Qualifying campaign analysis
import pandas as pd
import numpy as np
from typing import Dict, List

class QualifyingAnalyzer:
    """Analyze World Cup/Euro qualifying campaigns."""

    def __init__(self, qualifying_data: pd.DataFrame):
        self.data = qualifying_data

    def calculate_team_stats(self) -> pd.DataFrame:
        """Calculate comprehensive qualifying statistics."""

        stats = self.data.groupby("team").agg({
            "match_id": "count",
            "result": [
                lambda x: (x == "W").sum(),
                lambda x: (x == "D").sum(),
                lambda x: (x == "L").sum()
            ],
            "goals_for": "sum",
            "goals_against": "sum"
        }).reset_index()

        stats.columns = ["team", "matches", "wins", "draws", "losses",
                        "goals_for", "goals_against"]

        # Calculate derived metrics
        stats["points"] = stats["wins"] * 3 + stats["draws"]
        stats["goal_difference"] = stats["goals_for"] - stats["goals_against"]
        stats["goals_per_game"] = stats["goals_for"] / stats["matches"]
        stats["conceded_per_game"] = stats["goals_against"] / stats["matches"]
        stats["win_rate"] = stats["wins"] / stats["matches"]
        stats["efficiency"] = stats["points"] / (stats["matches"] * 3)

        # Performance tier
        stats["tier"] = pd.cut(
            stats["efficiency"],
            bins=[0, 0.35, 0.50, 0.70, 0.85, 1.0],
            labels=["Weak", "Struggling", "Competitive", "Strong", "Dominant"]
        )

        return stats.sort_values("points", ascending=False)

    def correlate_with_tournament(self, tournament_results: pd.DataFrame) -> Dict:
        """Analyze correlation between qualifying and tournament performance."""

        qualifying_stats = self.calculate_team_stats()

        combined = qualifying_stats.merge(
            tournament_results[["team", "tourn_points", "stage"]],
            on="team", how="inner"
        )

        # Correlations
        correlations = {
            "points_correlation": combined["points"].corr(combined["tourn_points"]),
            "goals_correlation": combined["goals_for"].corr(combined["tourn_points"]),
            "efficiency_correlation": combined["efficiency"].corr(combined["tourn_points"])
        }

        # Performance by qualifying tier
        tier_analysis = combined.groupby("tier").agg({
            "tourn_points": "mean",
            "stage": lambda x: (x.isin(["QF", "SF", "Final"])).mean()
        }).reset_index()

        return {
            "correlations": correlations,
            "tier_analysis": tier_analysis
        }

    def identify_group_of_death(self, teams_by_group: pd.DataFrame,
                               ratings: pd.DataFrame) -> pd.DataFrame:
        """Identify the group of death."""

        combined = teams_by_group.merge(ratings[["team", "elo"]], on="team")

        group_stats = combined.groupby("group").agg({
            "elo": ["mean", "min", "max", "std"]
        }).reset_index()

        group_stats.columns = ["group", "avg_elo", "min_elo", "max_elo", "std_elo"]

        # Competitiveness: high average + low std = group of death
        group_stats["competitiveness"] = 1 - (group_stats["std_elo"] / group_stats["avg_elo"])
        group_stats["combined_score"] = group_stats["avg_elo"] * group_stats["competitiveness"]

        group_stats["is_group_of_death"] = (
            group_stats["combined_score"] > group_stats["combined_score"].quantile(0.75)
        )

        return group_stats.sort_values("combined_score", ascending=False)


class ConfederationComparison:
    """Compare qualifying formats across confederations."""

    FORMATS = {
        "UEFA": {
            "groups": 10,
            "teams_per_group": 5,
            "auto_qualify": 10,
            "playoff_spots": 3,
            "total_slots": 13
        },
        "CONMEBOL": {
            "format": "single_league",
            "teams": 10,
            "auto_qualify": 6,
            "playoff_spots": 1,
            "total_slots": 6.5
        },
        "CONCACAF": {
            "format": "octagonal",
            "teams": 8,
            "auto_qualify": 3,
            "playoff_spots": 1,
            "total_slots": 3.5
        },
        "CAF": {
            "groups": 9,
            "playoff_round": True,
            "auto_qualify": 0,
            "playoff_spots": 5,
            "total_slots": 5
        },
        "AFC": {
            "rounds": 3,
            "final_groups": 2,
            "auto_qualify": 4,
            "playoff_spots": 1,
            "total_slots": 4.5
        }
    }

    def compare_difficulty(self, confederation: str, team_count: int) -> Dict:
        """Calculate qualifying difficulty for a confederation."""

        format_info = self.FORMATS.get(confederation, {})

        # Probability of qualification (simplified)
        if "auto_qualify" in format_info:
            auto_prob = format_info["auto_qualify"] / team_count
            playoff_prob = format_info.get("playoff_spots", 0) / team_count * 0.5
            total_prob = auto_prob + playoff_prob

            return {
                "confederation": confederation,
                "format": format_info,
                "qualification_probability": total_prob,
                "difficulty_rating": 1 - total_prob
            }

        return {"confederation": confederation, "format": format_info}

# Example usage
print("Qualifying Campaign Analyzer initialized")

comparator = ConfederationComparison()
for conf in ["UEFA", "CONMEBOL", "CONCACAF"]:
    result = comparator.compare_difficulty(conf, 50 if conf == "UEFA" else 10)
    print(f"{conf}: Qualification probability {result.get('qualification_probability', 0):.1%}")
# R: Qualifying campaign analysis
library(tidyverse)

# Analyze European World Cup qualifying
analyze_qualifying <- function(qualifying_data) {

    qualifying_data %>%
        group_by(team) %>%
        summarise(
            matches = n(),
            wins = sum(result == "W"),
            draws = sum(result == "D"),
            losses = sum(result == "L"),
            goals_for = sum(goals_for),
            goals_against = sum(goals_against),
            points = wins * 3 + draws,

            # Advanced metrics
            goals_per_game = goals_for / matches,
            conceded_per_game = goals_against / matches,
            win_rate = wins / matches,

            # Home vs away split (filter before summing)
            home_points = sum(((result == "W") * 3 + (result == "D") * 1) * (venue == "Home")),
            away_points = sum(((result == "W") * 3 + (result == "D") * 1) * (venue == "Away")),

            .groups = "drop"
        ) %>%
        mutate(
            # Points efficiency
            max_possible = matches * 3,
            efficiency = points / max_possible,

            # Classify performance
            performance_tier = case_when(
                efficiency > 0.85 ~ "Dominant",
                efficiency > 0.70 ~ "Strong",
                efficiency > 0.50 ~ "Competitive",
                efficiency > 0.35 ~ "Struggling",
                TRUE ~ "Weak"
            )
        ) %>%
        arrange(desc(points))
}

# Does qualifying performance predict tournament success?
correlate_qualifying_tournament <- function(qualifying_results, tournament_results) {

    combined <- qualifying_results %>%
        select(team, qual_points = points, qual_goals = goals_for,
               qual_conceded = goals_against) %>%
        inner_join(
            tournament_results %>%
                select(team, tourn_points = points, tourn_stage = furthest_stage),
            by = "team"
        )

    # Correlation analysis
    correlations <- combined %>%
        summarise(
            points_cor = cor(qual_points, tourn_points, use = "complete.obs"),
            goals_cor = cor(qual_goals, tourn_points, use = "complete.obs")
        )

    # Stage achievement by qualifying tier
    stage_by_tier <- combined %>%
        mutate(
            qual_tier = ntile(qual_points, 4)
        ) %>%
        group_by(qual_tier) %>%
        summarise(
            avg_tourn_points = mean(tourn_points),
            reached_knockout = mean(tourn_stage >= "Round of 16"),
            .groups = "drop"
        )

    list(
        correlations = correlations,
        stage_analysis = stage_by_tier
    )
}

# Group of death detection
identify_group_difficulty <- function(teams_by_group, ratings) {

    teams_by_group %>%
        left_join(ratings, by = "team") %>%
        group_by(group) %>%
        summarise(
            avg_rating = mean(elo_rating),
            min_rating = min(elo_rating),
            max_rating = max(elo_rating),
            range = max_rating - min_rating,

            # Competitiveness (smaller range = more competitive)
            competitiveness = 1 - (range / max_rating),

            .groups = "drop"
        ) %>%
        mutate(
            is_group_of_death = avg_rating > quantile(avg_rating, 0.75) &
                                 competitiveness > 0.5
        ) %>%
        arrange(desc(avg_rating))
}

# Playoff scenarios analysis
analyze_playoff_paths <- function(standings, format = "uefa") {

    if (format == "uefa") {
        # UEFA: Top 2 per group qualify, playoff for 3rd place
        standings %>%
            group_by(group) %>%
            mutate(
                position = row_number(),
                status = case_when(
                    position <= 2 ~ "Qualified",
                    position == 3 ~ "Playoff",
                    TRUE ~ "Eliminated"
                )
            ) %>%
            ungroup()
    } else if (format == "conmebol") {
        # CONMEBOL: Single group, top 6 qualify
        standings %>%
            mutate(
                status = case_when(
                    row_number() <= 6 ~ "Qualified",
                    row_number() == 7 ~ "Playoff",
                    TRUE ~ "Eliminated"
                )
            )
    }
}

print("Qualifying campaign analyzer ready!")

Continental Differences

Football varies significantly across confederations, with different playing styles, physical attributes, and tactical approaches. Understanding these differences is crucial for international tournament analytics.

continental_analysis
# Python: Continental playing style analysis
import pandas as pd
import numpy as np
from typing import Dict, List

class ContinentalAnalyzer:
    """Analyze football differences across confederations."""

    CONFEDERATIONS = {
        "UEFA": ["Spain", "Germany", "France", "England", "Italy", "Netherlands",
                 "Portugal", "Belgium", "Croatia", "Denmark"],
        "CONMEBOL": ["Brazil", "Argentina", "Uruguay", "Colombia", "Chile",
                    "Peru", "Ecuador", "Paraguay"],
        "CONCACAF": ["Mexico", "USA", "Canada", "Costa Rica"],
        "CAF": ["Morocco", "Senegal", "Nigeria", "Cameroon", "Ghana", "Egypt"],
        "AFC": ["Japan", "South Korea", "Australia", "Iran", "Saudi Arabia"],
        "OFC": ["New Zealand"]
    }

    def __init__(self, match_data: pd.DataFrame):
        self.data = match_data
        self.team_conf = self._build_team_confederation_map()

    def _build_team_confederation_map(self) -> Dict[str, str]:
        """Create team to confederation mapping."""
        mapping = {}
        for conf, teams in self.CONFEDERATIONS.items():
            for team in teams:
                mapping[team] = conf
        return mapping

    def analyze_styles(self) -> pd.DataFrame:
        """Analyze playing styles by confederation."""

        self.data["confederation"] = self.data["team"].map(self.team_conf)
        conf_data = self.data[self.data["confederation"].notna()]

        style_analysis = conf_data.groupby("confederation").agg({
            "passes": "mean",
            "pass_completion": "mean",
            "xG": "mean",
            "shots": "mean",
            "pressures": "mean",
            "possession": "mean"
        }).reset_index()

        # Classify styles
        style_analysis["dominant_style"] = style_analysis.apply(
            lambda x: self._classify_style(x), axis=1
        )

        return style_analysis

    def _classify_style(self, row: pd.Series) -> str:
        """Classify playing style based on metrics."""

        if row["possession"] > 55 and row["passes"] > 500:
            return "Possession-based"
        elif row["pressures"] > 150:
            return "High-pressing"
        elif row["xG"] > 1.5:
            return "Attack-focused"
        else:
            return "Balanced"

    def head_to_head_analysis(self) -> pd.DataFrame:
        """Analyze inter-confederation matchups."""

        # Assuming data has home/away team info
        h2h_results = []

        # This would need actual match data with home/away structure
        # Simplified example
        return pd.DataFrame(h2h_results)

    def tournament_success(self, tournament_data: pd.DataFrame) -> pd.DataFrame:
        """Analyze World Cup success by confederation."""

        tournament_data["confederation"] = tournament_data["team"].map(self.team_conf)

        success = tournament_data.groupby("confederation").agg({
            "team": "nunique",
            "points": "sum",
            "stage": lambda x: {
                "group_exits": (x == "Group").sum(),
                "knockout_exits": x.isin(["R16", "QF"]).sum(),
                "semi_plus": x.isin(["SF", "Final", "Winner"]).sum()
            }
        }).reset_index()

        return success


class TravelImpactAnalyzer:
    """Analyze impact of travel and time zones on performance."""

    def __init__(self):
        # Approximate time zones for major football nations
        self.time_zones = {
            # Europe (base: UTC+1)
            "Spain": 1, "Germany": 1, "France": 1, "England": 0, "Italy": 1,
            # South America (UTC-3 to -5)
            "Brazil": -3, "Argentina": -3, "Uruguay": -3, "Colombia": -5,
            # Asia (UTC+3 to +9)
            "Japan": 9, "South Korea": 9, "Iran": 3.5, "Saudi Arabia": 3,
            # Africa (UTC+0 to +3)
            "Morocco": 1, "Senegal": 0, "Nigeria": 1,
            # North America
            "USA": -5, "Mexico": -6
        }

    def calculate_impact(self, team: str, host_tz: float) -> Dict:
        """Calculate travel impact for a team."""

        team_tz = self.time_zones.get(team, 0)
        tz_diff = abs(team_tz - host_tz)

        impact_level = (
            "Severe" if tz_diff > 6 else
            "Moderate" if tz_diff > 3 else
            "Minimal"
        )

        # Performance adjustment factor
        adjustment = 1.0 - (tz_diff * 0.02)  # 2% reduction per hour difference

        return {
            "team": team,
            "tz_difference": tz_diff,
            "impact_level": impact_level,
            "performance_adjustment": max(0.85, adjustment)
        }

    def tournament_analysis(self, teams: List[str], host_tz: float) -> pd.DataFrame:
        """Analyze travel impact for all tournament teams."""

        impacts = [self.calculate_impact(team, host_tz) for team in teams]
        return pd.DataFrame(impacts).sort_values("tz_difference", ascending=False)


# Example usage
print("Continental and travel impact analyzers initialized")

# World Cup in Qatar (UTC+3)
travel_analyzer = TravelImpactAnalyzer()
wc_teams = ["Argentina", "France", "Brazil", "England", "Japan", "Morocco"]
impacts = travel_analyzer.tournament_analysis(wc_teams, host_tz=3)
print("\nTravel Impact for Qatar 2022:")
print(impacts.to_string(index=False))
# R: Continental playing style analysis
library(tidyverse)

# Analyze playing styles by confederation
analyze_continental_styles <- function(match_events, team_confederations) {

    match_events %>%
        left_join(team_confederations, by = "team") %>%
        filter(!is.na(confederation)) %>%
        group_by(confederation) %>%
        summarise(
            # Passing style
            avg_passes = mean(passes_per_match),
            pass_completion = mean(pass_completion),
            long_ball_pct = mean(long_balls / passes_per_match) * 100,

            # Attacking
            avg_xG = mean(xG_per_match),
            shots_per_match = mean(shots_per_match),
            shot_accuracy = mean(shots_on_target / shots_per_match),

            # Defending
            avg_xGA = mean(xGA_per_match),
            pressing_intensity = mean(pressures_per_match),
            tackle_success = mean(tackle_win_pct),

            # Physical
            avg_distance = mean(total_distance),
            sprint_distance = mean(sprint_distance),

            .groups = "drop"
        )
}

# Head-to-head analysis between confederations
analyze_h2h_confederations <- function(match_data, team_confederations) {

    match_data %>%
        left_join(team_confederations, by = c("home_team" = "team")) %>%
        rename(home_conf = confederation) %>%
        left_join(team_confederations, by = c("away_team" = "team")) %>%
        rename(away_conf = confederation) %>%
        filter(!is.na(home_conf), !is.na(away_conf)) %>%
        filter(home_conf != away_conf) %>%  # Inter-confederation matches only
        mutate(
            home_result = case_when(
                home_score > away_score ~ "Win",
                home_score < away_score ~ "Loss",
                TRUE ~ "Draw"
            )
        ) %>%
        group_by(home_conf, away_conf) %>%
        summarise(
            matches = n(),
            home_wins = sum(home_result == "Win"),
            draws = sum(home_result == "Draw"),
            away_wins = sum(home_result == "Loss"),

            home_win_rate = home_wins / matches,
            home_goals = mean(home_score),
            away_goals = mean(away_score),

            .groups = "drop"
        )
}

# World Cup performance by confederation
wc_confederation_analysis <- function(tournament_results) {

    tournament_results %>%
        group_by(confederation) %>%
        summarise(
            teams = n_distinct(team),
            total_matches = n(),
            total_points = sum(points),

            # Stage progression
            group_exit = sum(stage == "Group"),
            r16_exit = sum(stage == "Round of 16"),
            qf_exit = sum(stage == "Quarter-final"),
            sf_exit = sum(stage == "Semi-final"),
            final_exit = sum(stage == "Final"),

            # Success rates
            knockout_rate = 1 - (group_exit / teams),
            qf_rate = sum(stage %in% c("Quarter-final", "Semi-final", "Final")) / teams,
            semi_rate = sum(stage %in% c("Semi-final", "Final")) / teams,

            .groups = "drop"
        ) %>%
        arrange(desc(semi_rate))
}

# Time zone and travel impact
analyze_travel_impact <- function(match_data, team_locations) {

    match_data %>%
        left_join(team_locations, by = "team") %>%
        mutate(
            # Time zone difference from tournament host
            tz_difference = abs(team_timezone - host_timezone),

            # Distance traveled
            distance_km = calculate_distance(team_lat, team_lon,
                                             host_lat, host_lon),

            # Travel impact categories
            travel_impact = case_when(
                tz_difference > 6 ~ "Severe",
                tz_difference > 3 ~ "Moderate",
                distance_km > 5000 ~ "Long distance",
                TRUE ~ "Minimal"
            )
        ) %>%
        group_by(travel_impact) %>%
        summarise(
            teams = n(),
            avg_points = mean(points),
            avg_goals = mean(goals),
            knockout_rate = mean(reached_knockout),
            .groups = "drop"
        )
}

print("Continental analysis framework ready!")
Confederation Typical Style WC Slots Historical Success
UEFA Technical, tactical variety 13 12 World Cup wins
CONMEBOL Technical, physical, flair 6.5 9 World Cup wins
CAF Athletic, improving tactically 5 0 wins, 3 QF appearances
AFC Organized, disciplined 4.5 0 wins, 2 R16 appearances
CONCACAF Physical, direct 3.5 0 wins, 2 QF appearances

Practice Exercises

Exercise 45.1: Build a Tournament Simulator

Create a complete World Cup simulation model that predicts group stage outcomes and knockout results. Include Elo ratings, Monte Carlo simulation, and probability outputs for each team advancing through rounds.

Hints:
  • Start with current FIFA rankings or Elo ratings
  • Implement proper tiebreakers for group stage standings
  • Account for the actual tournament bracket structure
Exercise 45.2: Knockout Match Win Probability

Analyze StatsBomb World Cup data to identify what match statistics best predict knockout match outcomes. Build a logistic regression model and evaluate its performance.

Hints:
  • Consider xG, possession, pressing metrics
  • Account for extra time and penalties
  • Test for overfitting with cross-validation
Exercise 45.3: Squad Optimization

Create an optimization model that selects a 26-player World Cup squad from a pool of 50 players. Balance quality, form, experience, and positional coverage constraints.

Hints:
  • Use linear programming or integer optimization
  • Define minimum players per position
  • Weight different attributes based on tournament needs
Exercise 45.4: Penalty Shootout Strategy

Using historical penalty data, develop an optimal penalty shootout strategy that considers shot placement, goalkeeper tendencies, and psychological factors.

Hints:
  • Analyze placement success rates by zone
  • Consider the pressure effect of later penalties
  • Model goalkeeper dive patterns

Summary

Key Takeaways
  • Tournament Prediction: Effective models combine team ratings, match simulation, and bracket analysis while accounting for the inherent randomness of knockout football
  • Squad Selection: Analytics can optimize player selection by balancing quality, form, experience, and positional coverage within roster constraints
  • Knockout Dynamics: Single-elimination matches require different analytical approaches than league play, with emphasis on key moments and set pieces
  • Tactical Adaptation: Quick scouting and tactical analysis are essential given the limited preparation time between international matches
  • Historical Context: Understanding tournament history provides valuable baselines for expectations and identifies persistent patterns

International football analytics combines many of the techniques covered throughout this book while adding unique challenges around limited sample sizes, squad constraints, and the high-stakes nature of tournament play. Success requires balancing sophisticated modeling with practical understanding of how tournaments unfold.