Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples
0 of 60 chapters completed (0%)

Learning from Other Sports

Football analytics can accelerate its development by learning from analytics revolutions in other sports. Baseball, basketball, American football, and hockey have all faced similar challenges and developed innovative solutions that translate to the beautiful game.

The Evolution of Sports Analytics

script

import pandas as pd
from tabulate import tabulate

# Sports Analytics Timeline and Maturity Assessment
sports_analytics_evolution = pd.DataFrame({
    "sport": ["Baseball (MLB)", "Basketball (NBA)", "American Football (NFL)",
              "Hockey (NHL)", "Football (Soccer)"],
    "analytics_start": [1970, 2002, 2010, 2012, 2012],
    "major_breakthrough": ["Sabermetrics/Moneyball", "SportVU Tracking",
                           "NFL Next Gen Stats", "Expected Goals", "Opta/StatsBomb"],
    "current_maturity": ["Very High", "High", "High", "Medium-High", "Medium"],
    "key_metrics": ["WAR, OPS+, FIP", "RAPTOR, EPM, PIE", "EPA, CPOE, Win Rate",
                    "xG, Corsi, GSAA", "xG, xA, PPDA"],
    "tracking_adoption": ["2015 (Statcast)", "2013 (SportVU)", "2016 (Zebra)",
                          "2020 (Puck/Player)", "2019 (Limited)"],
    "open_data": ["High (Baseball Reference)", "Medium (NBA API)",
                  "Low (Limited)", "Medium (NHL API)", "Medium (StatsBomb)"]
})

print("Sports Analytics Maturity Comparison")
print("=" * 80)
print(tabulate(sports_analytics_evolution, headers="keys", tablefmt="grid", showindex=False))

# Summary statistics
print("\n\nKey Insights:")
print(f"- Earliest analytics adoption: Baseball ({sports_analytics_evolution['analytics_start'].min()})")
print(f"- Most recent tracking adoption: Hockey/Football (post-2019)")
print(f"- Football has medium maturity but fastest growth potential")

library(tidyverse)
library(gt)

# Sports Analytics Timeline and Maturity Assessment
sports_analytics_evolution <- tibble(
  sport = c("Baseball (MLB)", "Basketball (NBA)", "American Football (NFL)",
            "Hockey (NHL)", "Football (Soccer)"),
  analytics_start = c(1970, 2002, 2010, 2012, 2012),
  major_breakthrough = c("Sabermetrics/Moneyball", "SportVU Tracking",
                         "NFL Next Gen Stats", "Expected Goals", "Opta/StatsBomb"),
  current_maturity = c("Very High", "High", "High", "Medium-High", "Medium"),
  key_metrics = c("WAR, OPS+, FIP", "RAPTOR, EPM, PIE", "EPA, CPOE, Win Rate",
                  "xG, Corsi, GSAA", "xG, xA, PPDA"),
  tracking_adoption = c("2015 (Statcast)", "2013 (SportVU)", "2016 (Zebra)",
                        "2020 (Puck/Player)", "2019 (Limited)"),
  open_data = c("High (Baseball Reference)", "Medium (NBA API)",
                "Low (Limited)", "Medium (NHL API)", "Medium (StatsBomb)")
)

# Create comparison table
sports_analytics_evolution %>%
  gt() %>%
  tab_header(
    title = "Sports Analytics Maturity Comparison",
    subtitle = "Evolution and current state across major sports"
  ) %>%
  cols_label(
    sport = "Sport",
    analytics_start = "Analytics Era Start",
    major_breakthrough = "Key Breakthrough",
    current_maturity = "Maturity Level",
    key_metrics = "Signature Metrics",
    tracking_adoption = "Tracking Data",
    open_data = "Data Accessibility"
  ) %>%
  tab_style(
    style = cell_fill(color = "#E8F5E9"),
    locations = cells_body(rows = sport == "Football (Soccer)")
  )

Lessons from Baseball Analytics

Baseball's sabermetrics revolution offers the longest track record of analytics adoption in professional sports. Key lessons include the importance of isolating individual contribution, the power of market inefficiency exploitation, and the value of open data for ecosystem growth.

Baseball Concepts
  • WAR - Wins Above Replacement
  • OBP - On-Base Percentage
  • FIP - Fielding Independent Pitching
  • BABIP - Batting Average on Balls in Play
  • wOBA - Weighted On-Base Average
Football Equivalents
  • xG Added - Value Above Average
  • xG per Shot - Shot Quality
  • PSxG - Post-Shot Expected Goals
  • Conversion Rate - Finishing Variance
  • Non-Penalty xG - Core Attacking Value
script

import pandas as pd
import numpy as np

# Concept Translation: Baseball to Football
# WAR (Wins Above Replacement) -> Goals Above Replacement (GAR)

def calculate_football_gar(player_data: pd.DataFrame) -> pd.DataFrame:
    """
    Football adaptation of baseball's WAR concept.
    Calculates Goals Above Replacement for football players.
    """

    # Define replacement level by position
    replacement_levels = {
        "Forward": 0.15,
        "Midfielder": 0.08,
        "Defender": 0.02,
        "Goalkeeper": -0.05
    }

    results = player_data.copy()

    # Get replacement level for each player
    results["replacement_level"] = results["position"].map(replacement_levels)

    # Calculate 90s played
    results["nineties"] = results["minutes"] / 90

    # Offensive contribution (xG + xA above replacement)
    results["offensive_gar"] = (
        (results["npxg_per_90"] + results["xa_per_90"] - results["replacement_level"])
        * results["nineties"]
    )

    # Defensive contribution
    results["defensive_gar"] = (
        (results["tackles_won_per_90"] * 0.05 +
         results["interceptions_per_90"] * 0.04 +
         results["blocks_per_90"] * 0.03)
        * results["nineties"]
    )

    # Possession contribution
    results["possession_gar"] = (
        (results["progressive_passes_per_90"] * 0.02 +
         results["progressive_carries_per_90"] * 0.015)
        * results["nineties"]
    )

    # Total Goals Above Replacement
    results["total_gar"] = (
        results["offensive_gar"] +
        results["defensive_gar"] +
        results["possession_gar"]
    )

    # Convert to Wins (roughly 2.5 goals per win)
    results["war_equivalent"] = results["total_gar"] / 2.5

    return results


# Example player data
example_players = pd.DataFrame({
    "player": ["Elite Forward", "Good Midfielder", "Solid Defender", "Average GK"],
    "position": ["Forward", "Midfielder", "Defender", "Goalkeeper"],
    "minutes": [2800, 3000, 2500, 3200],
    "npxg_per_90": [0.65, 0.15, 0.05, 0.00],
    "xa_per_90": [0.25, 0.20, 0.08, 0.02],
    "tackles_won_per_90": [0.8, 2.1, 3.5, 0.1],
    "interceptions_per_90": [0.5, 1.8, 2.8, 0.2],
    "blocks_per_90": [0.3, 0.8, 1.5, 0.0],
    "progressive_passes_per_90": [2.5, 5.8, 4.2, 3.5],
    "progressive_carries_per_90": [4.2, 3.5, 1.8, 0.1]
})

# Calculate GAR
gar_results = calculate_football_gar(example_players)

print("Goals Above Replacement (GAR) Analysis")
print("=" * 60)
print(gar_results[["player", "position", "total_gar", "war_equivalent"]]
      .sort_values("total_gar", ascending=False)
      .to_string(index=False))

# Breakdown by component
print("\n\nGAR Component Breakdown:")
print(gar_results[["player", "offensive_gar", "defensive_gar", "possession_gar"]]
      .to_string(index=False))

library(tidyverse)

# Concept Translation: Baseball to Football
# WAR (Wins Above Replacement) -> Goals Above Replacement (GAR)

calculate_football_gar <- function(player_data, position) {
  # Football adaptation of WAR concept

  # Define replacement level by position (goals added per 90 for replacement player)
  replacement_level <- case_when(
    position == "Forward" ~ 0.15,
    position == "Midfielder" ~ 0.08,
    position == "Defender" ~ 0.02,
    position == "Goalkeeper" ~ -0.05,
    TRUE ~ 0.05
  )

  player_data %>%
    mutate(
      # Offensive contribution (xG + xA above replacement)
      offensive_gar = (npxg_per_90 + xa_per_90 - replacement_level) * (minutes / 90),

      # Defensive contribution (defensive actions value)
      defensive_gar = (tackles_won_per_90 * 0.05 +
                       interceptions_per_90 * 0.04 +
                       blocks_per_90 * 0.03) * (minutes / 90),

      # Possession contribution
      possession_gar = (progressive_passes_per_90 * 0.02 +
                        progressive_carries_per_90 * 0.015) * (minutes / 90),

      # Total Goals Above Replacement
      total_gar = offensive_gar + defensive_gar + possession_gar,

      # Convert to Wins (roughly 2.5 goals per win)
      war_equivalent = total_gar / 2.5
    )
}

# Example player data
example_players <- tibble(
  player = c("Elite Forward", "Good Midfielder", "Solid Defender", "Average GK"),
  position = c("Forward", "Midfielder", "Defender", "Goalkeeper"),
  minutes = c(2800, 3000, 2500, 3200),
  npxg_per_90 = c(0.65, 0.15, 0.05, 0.00),
  xa_per_90 = c(0.25, 0.20, 0.08, 0.02),
  tackles_won_per_90 = c(0.8, 2.1, 3.5, 0.1),
  interceptions_per_90 = c(0.5, 1.8, 2.8, 0.2),
  blocks_per_90 = c(0.3, 0.8, 1.5, 0.0),
  progressive_passes_per_90 = c(2.5, 5.8, 4.2, 3.5),
  progressive_carries_per_90 = c(4.2, 3.5, 1.8, 0.1)
)

# Calculate GAR for each player
gar_results <- example_players %>%
  rowwise() %>%
  mutate(
    replacement_level = case_when(
      position == "Forward" ~ 0.15,
      position == "Midfielder" ~ 0.08,
      position == "Defender" ~ 0.02,
      position == "Goalkeeper" ~ -0.05
    )
  ) %>%
  ungroup() %>%
  mutate(
    offensive_gar = (npxg_per_90 + xa_per_90 - replacement_level) * (minutes / 90),
    defensive_gar = (tackles_won_per_90 * 0.05 +
                     interceptions_per_90 * 0.04 +
                     blocks_per_90 * 0.03) * (minutes / 90),
    possession_gar = (progressive_passes_per_90 * 0.02 +
                      progressive_carries_per_90 * 0.015) * (minutes / 90),
    total_gar = offensive_gar + defensive_gar + possession_gar,
    war_equivalent = total_gar / 2.5
  )

print("Goals Above Replacement (GAR) Analysis:")
print(gar_results %>%
        select(player, position, total_gar, war_equivalent) %>%
        arrange(desc(total_gar)))
Key Baseball Lesson: Market Inefficiencies

Billy Beane's A's found value in on-base percentage when other teams overvalued batting average. In football, similar inefficiencies exist:

  • Players from smaller leagues are often undervalued
  • Defensive contributions are harder to measure, creating value opportunities
  • Age curves differ from perception (peak years vary by position)
  • Set-piece specialists add value not captured in market prices

Lessons from Basketball Analytics

Basketball's spatial revolution transformed how teams evaluate players and tactics. The NBA's adoption of tracking data (SportVU, then Second Spectrum) created entirely new analytical frameworks that football is now beginning to adapt.

script

import pandas as pd
import numpy as np

# Basketball Spatial Analytics Concepts Applied to Football
# Shot Charts -> Shot Maps with xG

def create_football_shot_chart(shots_data: pd.DataFrame) -> pd.DataFrame:
    """
    Apply basketball-style spatial analysis to football shots.
    Creates zones and calculates value above average.
    """

    result = shots_data.copy()

    # Zone classification (inspired by basketball court zones)
    def classify_zone(row):
        if row["distance"] <= 6:
            return "Six-Yard Box"
        elif row["distance"] <= 18 and abs(row["angle"]) < 30:
            return "Central Penalty Area"
        elif row["distance"] <= 18:
            return "Wide Penalty Area"
        elif row["distance"] <= 25 and abs(row["angle"]) < 25:
            return "Central Edge"
        elif row["distance"] <= 30:
            return "Long Range Central"
        else:
            return "Long Range Wide"

    result["zone"] = result.apply(classify_zone, axis=1)

    # Zone average xG (league benchmarks)
    zone_avg_xg = {
        "Six-Yard Box": 0.45,
        "Central Penalty Area": 0.22,
        "Wide Penalty Area": 0.08,
        "Central Edge": 0.06,
        "Long Range Central": 0.04,
        "Long Range Wide": 0.02
    }

    result["zone_xg_avg"] = result["zone"].map(zone_avg_xg)
    result["xg_above_average"] = result["xg"] - result["zone_xg_avg"]

    return result


# Simulate shot data
np.random.seed(42)
n_shots = 200

shot_data = pd.DataFrame({
    "shot_id": range(1, n_shots + 1),
    "player": np.random.choice(["Player A", "Player B", "Player C"], n_shots),
    "distance": np.random.uniform(3, 35, n_shots),
    "angle": np.random.uniform(-45, 45, n_shots)
})

# Generate xG based on distance
shot_data["xg"] = shot_data["distance"].apply(
    lambda d: np.random.uniform(0.35, 0.65) if d <= 6
    else (np.random.uniform(0.05, 0.35) if d <= 18
          else np.random.uniform(0.01, 0.08))
)

# Simulate goals
shot_data["goal"] = np.random.binomial(1, shot_data["xg"].clip(upper=0.8))

# Apply zone analysis
shot_analysis = create_football_shot_chart(shot_data)

# Summarize by zone (like basketball shot chart analysis)
zone_summary = shot_analysis.groupby("zone").agg(
    shots=("shot_id", "count"),
    goals=("goal", "sum"),
    total_xg=("xg", "sum"),
    conversion_rate=("goal", "mean"),
    avg_xg=("xg", "mean")
).reset_index()

zone_summary["xg_outperformance"] = zone_summary["conversion_rate"] - zone_summary["avg_xg"]
zone_summary = zone_summary.sort_values("avg_xg", ascending=False)

print("Shot Zone Analysis (Basketball-Style)")
print("=" * 70)
print(zone_summary.to_string(index=False))

# Player shot selection quality
player_shot_quality = shot_analysis.groupby("player").agg(
    shots=("shot_id", "count"),
    avg_shot_xg=("xg", "mean")
).reset_index()

player_shot_quality["shot_quality_percentile"] = (
    player_shot_quality["avg_shot_xg"].rank(pct=True) * 100
)

print("\n\nPlayer Shot Selection Quality:")
print(player_shot_quality.to_string(index=False))

library(tidyverse)

# Basketball Spatial Analytics Concepts Applied to Football
# Shot Charts -> Shot Maps with xG

create_football_shot_chart <- function(shots_data) {
  # Basketball pioneered spatial shot analysis
  # Football adaptation with expected goals context

  shots_data %>%
    mutate(
      # Zone classification (inspired by basketball court zones)
      zone = case_when(
        distance <= 6 ~ "Six-Yard Box",
        distance <= 18 & abs(angle) < 30 ~ "Central Penalty Area",
        distance <= 18 ~ "Wide Penalty Area",
        distance <= 25 & abs(angle) < 25 ~ "Central Edge",
        distance <= 30 ~ "Long Range Central",
        TRUE ~ "Long Range Wide"
      ),

      # Value added vs league average (like basketball eFG% vs average)
      zone_xg_avg = case_when(
        zone == "Six-Yard Box" ~ 0.45,
        zone == "Central Penalty Area" ~ 0.22,
        zone == "Wide Penalty Area" ~ 0.08,
        zone == "Central Edge" ~ 0.06,
        zone == "Long Range Central" ~ 0.04,
        TRUE ~ 0.02
      ),

      xg_above_average = xg - zone_xg_avg
    )
}

# Simulate shot data
set.seed(42)
shot_data <- tibble(
  shot_id = 1:200,
  player = sample(c("Player A", "Player B", "Player C"), 200, replace = TRUE),
  distance = runif(200, 3, 35),
  angle = runif(200, -45, 45),
  xg = case_when(
    distance <= 6 ~ runif(200, 0.35, 0.65),
    distance <= 18 ~ runif(200, 0.05, 0.35),
    TRUE ~ runif(200, 0.01, 0.08)
  )[1:200],
  goal = rbinom(200, 1, prob = pmin(xg, 0.8))
)

# Apply zone analysis
shot_analysis <- create_football_shot_chart(shot_data)

# Summarize by zone (like basketball shot chart analysis)
zone_summary <- shot_analysis %>%
  group_by(zone) %>%
  summarise(
    shots = n(),
    goals = sum(goal),
    total_xg = sum(xg),
    conversion_rate = mean(goal),
    avg_xg = mean(xg),
    xg_outperformance = mean(goal) - mean(xg),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_xg))

print("Shot Zone Analysis (Basketball-Style):")
print(zone_summary)

# Player shot selection quality (like basketball shot selection metrics)
player_shot_quality <- shot_analysis %>%
  group_by(player) %>%
  summarise(
    shots = n(),
    avg_shot_xg = mean(xg),
    shot_quality_percentile = percent_rank(mean(xg)) * 100,
    .groups = "drop"
  )

print("\nPlayer Shot Selection Quality:")
print(player_shot_quality)

Plus-Minus and Impact Metrics

Basketball's plus-minus metrics (RAPM, RPM, EPM) measure how much better a team performs when a player is on court. Football's fluid substitution patterns make this harder, but adapted versions can still provide value.

script

import pandas as pd
import numpy as np

# Adapting Basketball Plus-Minus to Football
# RAPM (Regularized Adjusted Plus-Minus) Football Version

def calculate_football_plus_minus(match_segments: pd.DataFrame) -> pd.DataFrame:
    """
    Football adaptation of basketball's plus-minus metrics.
    Uses xG differential instead of goals due to low-scoring nature.
    """

    results = match_segments.groupby("player_id").agg(
        minutes_on=("segment_minutes", "sum"),
        total_team_xg=("team_xg", "sum"),
        total_opponent_xg=("opponent_xg", "sum"),
        avg_teammate_quality=("teammate_avg_rating", "mean"),
        avg_opponent_quality=("opponent_avg_rating", "mean")
    ).reset_index()

    # Calculate per-90 metrics
    results["nineties"] = results["minutes_on"] / 90
    results["xg_for_per_90"] = results["total_team_xg"] / results["nineties"]
    results["xg_against_per_90"] = results["total_opponent_xg"] / results["nineties"]
    results["xg_diff_per_90"] = results["xg_for_per_90"] - results["xg_against_per_90"]

    # Adjusted plus-minus (control for teammate/opponent quality)
    results["adjusted_xg_diff"] = (
        results["xg_diff_per_90"] -
        (results["avg_teammate_quality"] - 8) * 0.1 +
        (results["avg_opponent_quality"] - 8) * 0.1
    )

    return results


# Simulate match segment data
np.random.seed(123)
n_segments = 500

match_segments = pd.DataFrame({
    "segment_id": range(1, n_segments + 1),
    "player_id": np.random.choice([f"Player_{i}" for i in range(1, 21)], n_segments),
    "segment_minutes": np.random.uniform(5, 45, n_segments),
    "team_xg": np.random.poisson(0.4, n_segments),
    "opponent_xg": np.random.poisson(0.35, n_segments),
    "teammate_avg_rating": np.random.normal(7.5, 0.8, n_segments),
    "opponent_avg_rating": np.random.normal(7.5, 0.8, n_segments)
})

# Calculate plus-minus
plus_minus_results = calculate_football_plus_minus(match_segments)

# Filter for minimum minutes and show top performers
qualified = plus_minus_results[plus_minus_results["minutes_on"] >= 400]
top_performers = qualified.nlargest(10, "adjusted_xg_diff")

print("Football Plus-Minus Rankings (xG Differential per 90)")
print("=" * 60)
print(top_performers[["player_id", "minutes_on", "xg_diff_per_90", "adjusted_xg_diff"]]
      .to_string(index=False))

library(tidyverse)

# Adapting Basketball Plus-Minus to Football
# RAPM (Regularized Adjusted Plus-Minus) Football Version

calculate_football_plus_minus <- function(match_segments) {
  # Football adaptation requires different approach due to:
  # 1. Fewer substitutions (typically 3-5 per match)
  # 2. Lower scoring (goals vs points)
  # 3. More interdependent positions

  # Use xG differential instead of goal differential
  # Break matches into segments based on substitutions

  match_segments %>%
    group_by(player_id) %>%
    summarise(
      minutes_on = sum(segment_minutes),
      xg_for_per_90 = sum(team_xg) / (sum(segment_minutes) / 90),
      xg_against_per_90 = sum(opponent_xg) / (sum(segment_minutes) / 90),
      xg_diff_per_90 = xg_for_per_90 - xg_against_per_90,

      # Control for teammate/opponent quality (simplified RAPM)
      avg_teammate_quality = mean(teammate_avg_rating),
      avg_opponent_quality = mean(opponent_avg_rating),

      # Adjusted plus-minus
      adjusted_xg_diff = xg_diff_per_90 -
                         (avg_teammate_quality - 8) * 0.1 +
                         (avg_opponent_quality - 8) * 0.1,
      .groups = "drop"
    )
}

# Simulate match segment data
set.seed(123)
n_segments <- 500

match_segments <- tibble(
  segment_id = 1:n_segments,
  player_id = sample(paste0("Player_", 1:20), n_segments, replace = TRUE),
  segment_minutes = runif(n_segments, 5, 45),
  team_xg = rpois(n_segments, lambda = 0.4),
  opponent_xg = rpois(n_segments, lambda = 0.35),
  teammate_avg_rating = rnorm(n_segments, mean = 7.5, sd = 0.8),
  opponent_avg_rating = rnorm(n_segments, mean = 7.5, sd = 0.8)
)

# Calculate plus-minus
plus_minus_results <- calculate_football_plus_minus(match_segments)

# Top performers
print("Football Plus-Minus Rankings (xG Differential per 90):")
print(plus_minus_results %>%
        filter(minutes_on >= 400) %>%
        arrange(desc(adjusted_xg_diff)) %>%
        head(10) %>%
        select(player_id, minutes_on, xg_diff_per_90, adjusted_xg_diff))

Lessons from Hockey Analytics

Hockey analytics shares many characteristics with football: low-scoring games, continuous flow, and positional fluidity. Hockey's expected goals models and possession metrics (Corsi, Fenwick) have directly influenced football analytics development.

script

import pandas as pd
import numpy as np

# Hockey Analytics Concepts Applied to Football
# Corsi -> Territorial Dominance Metrics

def calculate_football_corsi(match_data: pd.DataFrame) -> pd.DataFrame:
    """
    Apply hockey's Corsi concept to football.
    Measures territorial dominance through attacking actions.
    """

    result = match_data.copy()

    # Football "Corsi" - territorial actions
    result["team_corsi"] = (
        result["shots"] + result["shots_blocked"] + result["crosses_attempted"] +
        result["final_third_entries"] + result["box_entries"]
    )

    result["opponent_corsi"] = (
        result["opponent_shots"] + result["opponent_shots_blocked"] +
        result["opponent_crosses_attempted"] + result["opponent_final_third_entries"] +
        result["opponent_box_entries"]
    )

    # Corsi For percentage (CF%)
    result["corsi_for_pct"] = (
        result["team_corsi"] / (result["team_corsi"] + result["opponent_corsi"]) * 100
    )

    # High-danger chances (like hockey slot shots)
    result["high_danger_cf"] = result["shots_inside_box"] + result["headers_inside_box"]
    result["high_danger_ca"] = (
        result["opponent_shots_inside_box"] + result["opponent_headers_inside_box"]
    )
    result["high_danger_cf_pct"] = (
        result["high_danger_cf"] / (result["high_danger_cf"] + result["high_danger_ca"]) * 100
    )

    # PDO (shooting % + save %) - regression indicator
    result["shooting_pct"] = result["goals"] / result["shots"] * 100
    result["save_pct"] = (1 - result["opponent_goals"] / result["opponent_shots"]) * 100
    result["pdo"] = result["shooting_pct"] / 10 + result["save_pct"]

    return result


# Generate example match data
np.random.seed(42)
n_matches = 20

match_data = pd.DataFrame({
    "match_id": range(1, n_matches + 1),
    "team": "Example FC",
    "shots": np.random.poisson(14, n_matches),
    "shots_blocked": np.random.poisson(4, n_matches),
    "crosses_attempted": np.random.poisson(18, n_matches),
    "final_third_entries": np.random.poisson(35, n_matches),
    "box_entries": np.random.poisson(12, n_matches),
    "shots_inside_box": np.random.poisson(8, n_matches),
    "headers_inside_box": np.random.poisson(2, n_matches),
    "goals": np.random.poisson(1.5, n_matches),
    "opponent_shots": np.random.poisson(12, n_matches),
    "opponent_shots_blocked": np.random.poisson(3, n_matches),
    "opponent_crosses_attempted": np.random.poisson(15, n_matches),
    "opponent_final_third_entries": np.random.poisson(30, n_matches),
    "opponent_box_entries": np.random.poisson(10, n_matches),
    "opponent_shots_inside_box": np.random.poisson(6, n_matches),
    "opponent_headers_inside_box": np.random.poisson(1, n_matches),
    "opponent_goals": np.random.poisson(1.2, n_matches)
})

# Apply hockey-style analysis
hockey_style_analysis = calculate_football_corsi(match_data)

# Season summary
print("Season Summary (Hockey-Style Metrics)")
print("=" * 50)
print(f"Matches: {len(hockey_style_analysis)}")
print(f"Avg Corsi For %: {hockey_style_analysis['corsi_for_pct'].mean():.1f}%")
print(f"Avg High Danger CF%: {hockey_style_analysis['high_danger_cf_pct'].mean():.1f}%")
print(f"Avg PDO: {hockey_style_analysis['pdo'].mean():.1f}")
print(f"Goals For: {hockey_style_analysis['goals'].sum()}")
print(f"Goals Against: {hockey_style_analysis['opponent_goals'].sum()}")

# PDO analysis
print("\n\nPDO Analysis (values far from 100 = regression candidate):")
hockey_style_analysis["pdo_deviation"] = abs(hockey_style_analysis["pdo"] - 100)
print(hockey_style_analysis.nlargest(5, "pdo_deviation")[
    ["match_id", "corsi_for_pct", "pdo", "goals", "opponent_goals"]
].to_string(index=False))

library(tidyverse)

# Hockey Analytics Concepts Applied to Football
# Corsi -> Territorial Dominance Metrics

# Hockey: Corsi = All shot attempts (shots on goal + missed + blocked)
# Football adaptation: All attacking actions in final third

calculate_football_corsi <- function(match_data) {
  match_data %>%
    mutate(
      # Football "Corsi" - territorial actions
      team_corsi = shots + shots_blocked + crosses_attempted +
                   final_third_entries + box_entries,
      opponent_corsi = opponent_shots + opponent_shots_blocked +
                       opponent_crosses_attempted + opponent_final_third_entries +
                       opponent_box_entries,

      # Corsi For percentage (CF%)
      corsi_for_pct = team_corsi / (team_corsi + opponent_corsi) * 100,

      # Hockey-style expected goals model adaptations
      # High-danger chances (like hockey slot shots)
      high_danger_cf = shots_inside_box + headers_inside_box,
      high_danger_ca = opponent_shots_inside_box + opponent_headers_inside_box,
      high_danger_cf_pct = high_danger_cf / (high_danger_cf + high_danger_ca) * 100,

      # PDO (shooting % + save %) - regression indicator
      shooting_pct = goals / shots * 100,
      save_pct = (1 - opponent_goals / opponent_shots) * 100,
      pdo = shooting_pct / 10 + save_pct  # Scaled to ~100 baseline
    )
}

# Example match data
match_data <- tibble(
  match_id = 1:20,
  team = "Example FC",
  shots = rpois(20, 14),
  shots_blocked = rpois(20, 4),
  crosses_attempted = rpois(20, 18),
  final_third_entries = rpois(20, 35),
  box_entries = rpois(20, 12),
  shots_inside_box = rpois(20, 8),
  headers_inside_box = rpois(20, 2),
  goals = rpois(20, 1.5),
  opponent_shots = rpois(20, 12),
  opponent_shots_blocked = rpois(20, 3),
  opponent_crosses_attempted = rpois(20, 15),
  opponent_final_third_entries = rpois(20, 30),
  opponent_box_entries = rpois(20, 10),
  opponent_shots_inside_box = rpois(20, 6),
  opponent_headers_inside_box = rpois(20, 1.5),
  opponent_goals = rpois(20, 1.2)
)

# Apply hockey-style analysis
hockey_style_analysis <- calculate_football_corsi(match_data)

# Season summary
season_summary <- hockey_style_analysis %>%
  summarise(
    matches = n(),
    avg_corsi_for_pct = mean(corsi_for_pct, na.rm = TRUE),
    avg_high_danger_cf_pct = mean(high_danger_cf_pct, na.rm = TRUE),
    avg_pdo = mean(pdo, na.rm = TRUE),
    goals_for = sum(goals),
    goals_against = sum(opponent_goals)
  )

print("Season Summary (Hockey-Style Metrics):")
print(season_summary)

# PDO analysis - teams with extreme PDO likely to regress
print("\nPDO Analysis (values far from 100 indicate luck/regression candidate):")
print(hockey_style_analysis %>%
        select(match_id, corsi_for_pct, pdo, goals, opponent_goals) %>%
        arrange(desc(abs(pdo - 100))) %>%
        head(5))
Key Hockey Concept: PDO

PDO (named after a hockey analytics blogger) is the sum of shooting percentage and save percentage. In hockey, PDO regresses strongly to 100 over time. Teams with high PDO are often "lucky" and due for regression.

Football Application: Teams with conversion rates significantly above their xG for extended periods are likely outperforming sustainable levels. This is valuable for betting markets and projection systems.

Lessons from American Football Analytics

American football's discrete play structure has enabled sophisticated play-by-play analysis. Expected Points Added (EPA) and Win Probability models offer frameworks applicable to football's set pieces and game state analysis.

script

import pandas as pd
import numpy as np

# American Football Concepts Applied to Soccer
# EPA (Expected Points Added) -> Expected Goals State Model

def calculate_football_epa(events_data: pd.DataFrame) -> pd.DataFrame:
    """
    Adapt NFL's Expected Points Added to football.
    Measures value added by actions based on field position changes.
    """

    result = events_data.copy()

    # Expected goals from current state (pre-action)
    zone_xg = {
        "opponent_box": 0.15,
        "opponent_final_third": 0.05,
        "middle_third": 0.02,
        "own_final_third": 0.005
    }

    result_zone_xg = {
        "goal": 1.0,
        "opponent_box": 0.15,
        "opponent_final_third": 0.05,
        "middle_third": 0.02,
        "own_final_third": 0.005,
        "turnover": -0.03
    }

    result["pre_xg_state"] = result["pitch_zone"].map(zone_xg).fillna(0.01)
    result["post_xg_state"] = result["result_zone"].map(result_zone_xg).fillna(0.01)

    # EPA equivalent: Change in expected goals
    result["xg_added"] = result["post_xg_state"] - result["pre_xg_state"]

    # Game state adjustment
    def game_state_multiplier(row):
        if row["minute"] >= 80 and row["goal_diff"] < 0:
            return 1.3  # Chasing late
        elif row["minute"] >= 80 and row["goal_diff"] > 0:
            return 0.7  # Protecting lead
        elif row["goal_diff"] <= -2:
            return 1.2  # Need goals
        elif row["goal_diff"] >= 2:
            return 0.8  # Comfortable
        return 1.0

    result["game_state_multiplier"] = result.apply(game_state_multiplier, axis=1)
    result["adjusted_xg_added"] = result["xg_added"] * result["game_state_multiplier"]

    return result


def calculate_win_probability(minute: int, goal_diff: int, home: bool = True) -> dict:
    """
    NFL-style win probability model adapted for football.
    """

    remaining_minutes = 90 - minute
    goals_per_minute = 0.028  # ~2.5 goals per game

    # Expected goals remaining
    home_factor = 1.1 if home else 0.9
    team_expected = remaining_minutes * goals_per_minute * home_factor
    opponent_expected = remaining_minutes * goals_per_minute * (2 - home_factor)

    # Current advantage in expected final goals
    expected_final_diff = goal_diff + team_expected - opponent_expected

    # Convert to win probability (logistic function)
    win_prob = 1 / (1 + np.exp(-expected_final_diff * 0.8))
    draw_prob = max(0, 0.25 - abs(expected_final_diff) * 0.05) * \
                (1 - abs(minute - 45) / 90)

    return {
        "win": win_prob * (1 - draw_prob),
        "draw": draw_prob,
        "loss": (1 - win_prob) * (1 - draw_prob)
    }


# Example: Win probability at different game states
scenarios = [
    {"scenario": "Start (0-0)", "minute": 0, "goal_diff": 0},
    {"scenario": "Down 1 at HT", "minute": 45, "goal_diff": -1},
    {"scenario": "Up 1 at 75'", "minute": 75, "goal_diff": 1},
    {"scenario": "Down 2 at 80'", "minute": 80, "goal_diff": -2}
]

print("Win Probability by Game State")
print("=" * 60)
print(f"{'Scenario':<20} {'Win%':>10} {'Draw%':>10} {'Loss%':>10}")
print("-" * 60)

for s in scenarios:
    probs = calculate_win_probability(s["minute"], s["goal_diff"])
    print(f"{s['scenario']:<20} {probs['win']*100:>9.1f}% {probs['draw']*100:>9.1f}% {probs['loss']*100:>9.1f}%")

library(tidyverse)

# American Football Concepts Applied to Soccer
# EPA (Expected Points Added) -> Expected Goals State Model

# NFL EPA: Value of play based on down, distance, field position
# Football adaptation: Value based on pitch position, game state, time

calculate_football_epa <- function(events_data) {
  events_data %>%
    mutate(
      # Expected goals from current state (pre-action)
      pre_xg_state = case_when(
        pitch_zone == "opponent_box" ~ 0.15,
        pitch_zone == "opponent_final_third" ~ 0.05,
        pitch_zone == "middle_third" ~ 0.02,
        pitch_zone == "own_final_third" ~ 0.005,
        TRUE ~ 0.01
      ),

      # Expected goals from resulting state (post-action)
      post_xg_state = case_when(
        result_zone == "goal" ~ 1.0,
        result_zone == "opponent_box" ~ 0.15,
        result_zone == "opponent_final_third" ~ 0.05,
        result_zone == "middle_third" ~ 0.02,
        result_zone == "own_final_third" ~ 0.005,
        result_zone == "turnover" ~ -0.03,  # Opponent possession value
        TRUE ~ 0.01
      ),

      # EPA equivalent: Change in expected goals
      xg_added = post_xg_state - pre_xg_state,

      # Adjust for game state (like NFL situation-adjusted EPA)
      game_state_multiplier = case_when(
        minute >= 80 & goal_diff < 0 ~ 1.3,   # Chasing late
        minute >= 80 & goal_diff > 0 ~ 0.7,   # Protecting lead
        goal_diff <= -2 ~ 1.2,                 # Need goals
        goal_diff >= 2 ~ 0.8,                  # Comfortable
        TRUE ~ 1.0
      ),

      adjusted_xg_added = xg_added * game_state_multiplier
    )
}

# Win Probability Model (NFL-style)
calculate_win_probability <- function(minute, goal_diff, home = TRUE) {
  # Simplified win probability based on game state
  # Based on historical data relationships

  remaining_minutes <- 90 - minute
  goals_per_minute <- 0.028  # League average ~2.5 goals per game

  # Expected goals remaining
  team_expected <- remaining_minutes * goals_per_minute * ifelse(home, 1.1, 0.9)
  opponent_expected <- remaining_minutes * goals_per_minute * ifelse(home, 0.9, 1.1)

  # Current advantage in "expected final goals"
  expected_final_diff <- goal_diff + team_expected - opponent_expected

  # Convert to win probability (using logistic function)
  win_prob <- 1 / (1 + exp(-expected_final_diff * 0.8))
  draw_prob <- max(0, 0.25 - abs(expected_final_diff) * 0.05) *
               (1 - abs(minute - 45) / 90)  # Draws less likely with time

  list(
    win = win_prob * (1 - draw_prob),
    draw = draw_prob,
    loss = (1 - win_prob) * (1 - draw_prob)
  )
}

# Example: Win probability at different game states
scenarios <- tibble(
  scenario = c("Start (0-0)", "Down 1 at HT", "Up 1 at 75'", "Down 2 at 80'"),
  minute = c(0, 45, 75, 80),
  goal_diff = c(0, -1, 1, -2)
)

win_probs <- scenarios %>%
  rowwise() %>%
  mutate(
    probs = list(calculate_win_probability(minute, goal_diff)),
    win = probs$win,
    draw = probs$draw,
    loss = probs$loss
  ) %>%
  ungroup() %>%
  select(scenario, minute, goal_diff, win, draw, loss)

print("Win Probability by Game State:")
print(win_probs)

Building a Unified Cross-Sport Framework

The best football analytics practitioners draw insights from multiple sports. Here we present a unified framework that synthesizes lessons from baseball, basketball, hockey, and American football into a cohesive analytical approach.

script

import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Optional

# Unified Cross-Sport Analytics Framework for Football

class CrossSportAnalytics:
    """
    Unified framework synthesizing analytics concepts from
    baseball, basketball, hockey, and American football.
    """

    def __init__(self):
        print("Cross-Sport Analytics Framework initialized")

    def calculate_total_value(self, player_data: pd.DataFrame) -> pd.DataFrame:
        """
        Baseball-style comprehensive value metric (WAR equivalent).
        """
        result = player_data.copy()

        # Offensive value (baseball-style linear weights)
        result["offensive_value"] = (
            result["npxg_per_90"] * 0.4 +
            result["xa_per_90"] * 0.3 +
            result["key_passes_per_90"] * 0.05
        )

        # Defensive value
        result["defensive_value"] = (
            result["tackles_won_per_90"] * 0.05 +
            result["interceptions_per_90"] * 0.04 +
            result["pressures_per_90"] * 0.02
        )

        # Progression value
        result["progression_value"] = (
            result["progressive_passes_per_90"] * 0.02 +
            result["progressive_carries_per_90"] * 0.015
        )

        # Total Goals Above Replacement
        result["total_gar"] = (
            (result["offensive_value"] + result["defensive_value"] +
             result["progression_value"]) * (result["minutes"] / 90)
        )

        return result

    def calculate_spatial_efficiency(self, shot_data: pd.DataFrame) -> pd.DataFrame:
        """
        Basketball-style spatial efficiency analysis.
        """
        shot_data = shot_data.copy()

        def classify_zone(distance):
            if distance <= 6:
                return "high_value"
            elif distance <= 18:
                return "medium_value"
            return "low_value"

        shot_data["zone"] = shot_data["distance"].apply(classify_zone)

        return shot_data.groupby(["player_id", "zone"]).agg(
            shots=("shot_id", "count"),
            xg=("xg", "sum"),
            goals=("goal", "sum")
        ).reset_index().assign(
            efficiency=lambda df: df["goals"] / df["xg"].replace(0, np.nan)
        )

    def calculate_territorial_dominance(self, match_data: pd.DataFrame) -> pd.DataFrame:
        """
        Hockey-style territorial dominance metrics (Corsi adaptation).
        """
        result = match_data.copy()

        result["team_territory"] = (
            result["final_third_entries"] + result["box_entries"] + result["shots"]
        )
        result["opponent_territory"] = (
            result["opp_final_third_entries"] + result["opp_box_entries"] +
            result["opp_shots"]
        )
        result["territorial_cf"] = (
            result["team_territory"] /
            (result["team_territory"] + result["opponent_territory"])
        )

        # PDO for regression analysis
        result["pdo"] = (
            (result["goals"] / result["shots"] * 10) +
            ((result["opp_shots"] - result["opp_goals"]) / result["opp_shots"] * 100)
        )

        return result

    def calculate_game_state_value(self, events_data: pd.DataFrame) -> pd.DataFrame:
        """
        NFL-style game state value model.
        """
        result = events_data.copy()

        # State-based expected value
        zone_values = {
            "opponent_box": 0.15,
            "opponent_third": 0.05,
            "middle_third": 0.02
        }
        result["state_value"] = result["zone"].map(zone_values).fillna(0.005)

        # Game state adjustment
        def state_multiplier(row):
            if row["minute"] >= 75 and row["goal_diff"] < 0:
                return 1.3
            elif row["minute"] >= 75 and row["goal_diff"] > 0:
                return 0.7
            return 1.0

        result["state_multiplier"] = result.apply(state_multiplier, axis=1)
        result["adjusted_value"] = result["state_value"] * result["state_multiplier"]

        return result


# Usage example
framework = CrossSportAnalytics()

# Example player data
sample_player = pd.DataFrame({
    "player": ["Star Forward"],
    "minutes": [2800],
    "npxg_per_90": [0.55],
    "xa_per_90": [0.18],
    "key_passes_per_90": [2.1],
    "tackles_won_per_90": [0.9],
    "interceptions_per_90": [0.5],
    "pressures_per_90": [18.5],
    "progressive_passes_per_90": [3.2],
    "progressive_carries_per_90": [5.8]
})

result = framework.calculate_total_value(sample_player)

print("Player Total Value (GAR) Analysis")
print("=" * 60)
print(result[["player", "offensive_value", "defensive_value",
              "progression_value", "total_gar"]].to_string(index=False))

library(tidyverse)
library(R6)

# Unified Cross-Sport Analytics Framework for Football
CrossSportAnalytics <- R6Class("CrossSportAnalytics",
  public = list(
    player_data = NULL,
    match_data = NULL,

    initialize = function() {
      message("Cross-Sport Analytics Framework initialized")
    },

    # Baseball: WAR-style comprehensive value metric
    calculate_total_value = function(player_data) {
      player_data %>%
        mutate(
          # Offensive value (baseball-style linear weights)
          offensive_value = npxg_per_90 * 0.4 + xa_per_90 * 0.3 +
                           key_passes_per_90 * 0.05,

          # Defensive value
          defensive_value = tackles_won_per_90 * 0.05 +
                           interceptions_per_90 * 0.04 +
                           pressures_per_90 * 0.02,

          # Progression value
          progression_value = progressive_passes_per_90 * 0.02 +
                             progressive_carries_per_90 * 0.015,

          # Total Goals Above Replacement
          total_gar = (offensive_value + defensive_value + progression_value) *
                      (minutes / 90)
        )
    },

    # Basketball: Spatial efficiency analysis
    calculate_spatial_efficiency = function(shot_data) {
      shot_data %>%
        mutate(
          zone = case_when(
            distance <= 6 ~ "high_value",
            distance <= 18 ~ "medium_value",
            TRUE ~ "low_value"
          )
        ) %>%
        group_by(player_id, zone) %>%
        summarise(
          shots = n(),
          xg = sum(xg),
          goals = sum(goal),
          efficiency = mean(goal) / mean(xg),
          .groups = "drop"
        )
    },

    # Hockey: Territorial dominance metrics
    calculate_territorial_dominance = function(match_data) {
      match_data %>%
        mutate(
          team_territory = final_third_entries + box_entries + shots,
          opponent_territory = opp_final_third_entries + opp_box_entries + opp_shots,
          territorial_cf = team_territory / (team_territory + opponent_territory),

          # PDO for regression analysis
          pdo = (goals / shots * 10) + ((opp_shots - opp_goals) / opp_shots * 100)
        )
    },

    # NFL: Game state value model
    calculate_game_state_value = function(events_data) {
      events_data %>%
        mutate(
          # State-based expected value
          state_value = case_when(
            zone == "opponent_box" ~ 0.15,
            zone == "opponent_third" ~ 0.05,
            zone == "middle_third" ~ 0.02,
            TRUE ~ 0.005
          ),

          # Game state adjustment
          state_multiplier = case_when(
            minute >= 75 & goal_diff < 0 ~ 1.3,
            minute >= 75 & goal_diff > 0 ~ 0.7,
            TRUE ~ 1.0
          ),

          adjusted_value = state_value * state_multiplier
        )
    },

    # Integrated player evaluation
    evaluate_player = function(player_id, player_data, shot_data, match_data) {
      total_value <- self$calculate_total_value(
        player_data %>% filter(player == player_id)
      )

      spatial <- self$calculate_spatial_efficiency(
        shot_data %>% filter(player_id == !!player_id)
      )

      list(
        player = player_id,
        total_gar = sum(total_value$total_gar),
        offensive_contribution = sum(total_value$offensive_value),
        defensive_contribution = sum(total_value$defensive_value),
        high_value_shot_efficiency = spatial %>%
          filter(zone == "high_value") %>%
          pull(efficiency) %>%
          first()
      )
    }
  )
)

# Usage example
framework <- CrossSportAnalytics$new()

# Example player data
sample_player <- tibble(
  player = "Star Forward",
  minutes = 2800,
  npxg_per_90 = 0.55,
  xa_per_90 = 0.18,
  key_passes_per_90 = 2.1,
  tackles_won_per_90 = 0.9,
  interceptions_per_90 = 0.5,
  pressures_per_90 = 18.5,
  progressive_passes_per_90 = 3.2,
  progressive_carries_per_90 = 5.8
)

result <- framework$calculate_total_value(sample_player)
print("Player Total Value (GAR) Analysis:")
print(result %>% select(player, offensive_value, defensive_value,
                        progression_value, total_gar))
Concept Origin Sport Football Application Key Insight
WAR/Replacement Level Baseball Goals Above Replacement (GAR) Value players against realistic alternatives
Shot Charts/Zones Basketball xG Maps with Zone Analysis Shot location matters as much as volume
Plus-Minus (RAPM) Basketball On-field xG Differential Team performance with player on/off field
Corsi/Fenwick Hockey Territorial Dominance % Possession proxy through shot attempts
PDO Hockey Conversion + Save Rate Index Identify luck/regression candidates
EPA American Football xG State Change Value actions by resulting game state
Win Probability American Football Live Match Win Probability Real-time outcome likelihood

Practice Exercises

Exercise 58.1: Build a Football WAR Model

Create a comprehensive Goals Above Replacement (GAR) model that accounts for position-specific contributions. Include offensive, defensive, and possession components weighted by position.

Define replacement level by position (forwards need higher offensive output to be above replacement). Weight components differently: forwards emphasize offensive, defenders emphasize defensive. Use percentiles to normalize across different position pools.
Exercise 58.2: PDO Regression Analysis

Analyze PDO (shooting percentage + save percentage) across a league season. Identify teams with extreme PDO values and track their subsequent performance to validate regression to mean.

Calculate rolling PDO over 5-10 match windows. Identify when teams exceed 1.5 standard deviations from mean. Track their points-per-game in subsequent windows. Plot PDO vs future performance to visualize regression effect.
Exercise 58.3: Win Probability Model

Build a live win probability model using historical match data. The model should update based on goals, time remaining, home/away status, and red cards. Calibrate against historical outcomes.

Start with a logistic regression using goal differential, minutes remaining, and home flag. Add interaction terms (goal_diff × minutes_remaining). Use historical data to fit coefficients. Validate with calibration plots (predicted vs actual win rates in probability bins).

Chapter Summary

Understanding how other sports solved analytical challenges provides valuable shortcuts for football analytics development. The best practitioners draw from multiple traditions while adapting concepts to football's unique characteristics.