Capstone - Complete Analytics System
Learning from Other Sports
Football analytics can accelerate its development by learning from analytics revolutions in other sports. Baseball, basketball, American football, and hockey have all faced similar challenges and developed innovative solutions that translate to the beautiful game.
Cross-Sport Insight
Baseball's Moneyball revolution, basketball's spatial analytics, and hockey's expected goals models all offer valuable lessons for football analytics practitioners.
The Evolution of Sports Analytics
import pandas as pd
from tabulate import tabulate
# Sports Analytics Timeline and Maturity Assessment
sports_analytics_evolution = pd.DataFrame({
"sport": ["Baseball (MLB)", "Basketball (NBA)", "American Football (NFL)",
"Hockey (NHL)", "Football (Soccer)"],
"analytics_start": [1970, 2002, 2010, 2012, 2012],
"major_breakthrough": ["Sabermetrics/Moneyball", "SportVU Tracking",
"NFL Next Gen Stats", "Expected Goals", "Opta/StatsBomb"],
"current_maturity": ["Very High", "High", "High", "Medium-High", "Medium"],
"key_metrics": ["WAR, OPS+, FIP", "RAPTOR, EPM, PIE", "EPA, CPOE, Win Rate",
"xG, Corsi, GSAA", "xG, xA, PPDA"],
"tracking_adoption": ["2015 (Statcast)", "2013 (SportVU)", "2016 (Zebra)",
"2020 (Puck/Player)", "2019 (Limited)"],
"open_data": ["High (Baseball Reference)", "Medium (NBA API)",
"Low (Limited)", "Medium (NHL API)", "Medium (StatsBomb)"]
})
print("Sports Analytics Maturity Comparison")
print("=" * 80)
print(tabulate(sports_analytics_evolution, headers="keys", tablefmt="grid", showindex=False))
# Summary statistics
print("\n\nKey Insights:")
print(f"- Earliest analytics adoption: Baseball ({sports_analytics_evolution['analytics_start'].min()})")
print(f"- Most recent tracking adoption: Hockey/Football (post-2019)")
print(f"- Football has medium maturity but fastest growth potential")
library(tidyverse)
library(gt)
# Sports Analytics Timeline and Maturity Assessment
sports_analytics_evolution <- tibble(
sport = c("Baseball (MLB)", "Basketball (NBA)", "American Football (NFL)",
"Hockey (NHL)", "Football (Soccer)"),
analytics_start = c(1970, 2002, 2010, 2012, 2012),
major_breakthrough = c("Sabermetrics/Moneyball", "SportVU Tracking",
"NFL Next Gen Stats", "Expected Goals", "Opta/StatsBomb"),
current_maturity = c("Very High", "High", "High", "Medium-High", "Medium"),
key_metrics = c("WAR, OPS+, FIP", "RAPTOR, EPM, PIE", "EPA, CPOE, Win Rate",
"xG, Corsi, GSAA", "xG, xA, PPDA"),
tracking_adoption = c("2015 (Statcast)", "2013 (SportVU)", "2016 (Zebra)",
"2020 (Puck/Player)", "2019 (Limited)"),
open_data = c("High (Baseball Reference)", "Medium (NBA API)",
"Low (Limited)", "Medium (NHL API)", "Medium (StatsBomb)")
)
# Create comparison table
sports_analytics_evolution %>%
gt() %>%
tab_header(
title = "Sports Analytics Maturity Comparison",
subtitle = "Evolution and current state across major sports"
) %>%
cols_label(
sport = "Sport",
analytics_start = "Analytics Era Start",
major_breakthrough = "Key Breakthrough",
current_maturity = "Maturity Level",
key_metrics = "Signature Metrics",
tracking_adoption = "Tracking Data",
open_data = "Data Accessibility"
) %>%
tab_style(
style = cell_fill(color = "#E8F5E9"),
locations = cells_body(rows = sport == "Football (Soccer)")
)
Lessons from Baseball Analytics
Baseball's sabermetrics revolution offers the longest track record of analytics adoption in professional sports. Key lessons include the importance of isolating individual contribution, the power of market inefficiency exploitation, and the value of open data for ecosystem growth.
- WAR - Wins Above Replacement
- OBP - On-Base Percentage
- FIP - Fielding Independent Pitching
- BABIP - Batting Average on Balls in Play
- wOBA - Weighted On-Base Average
- xG Added - Value Above Average
- xG per Shot - Shot Quality
- PSxG - Post-Shot Expected Goals
- Conversion Rate - Finishing Variance
- Non-Penalty xG - Core Attacking Value
import pandas as pd
import numpy as np
# Concept Translation: Baseball to Football
# WAR (Wins Above Replacement) -> Goals Above Replacement (GAR)
def calculate_football_gar(player_data: pd.DataFrame) -> pd.DataFrame:
"""
Football adaptation of baseball's WAR concept.
Calculates Goals Above Replacement for football players.
"""
# Define replacement level by position
replacement_levels = {
"Forward": 0.15,
"Midfielder": 0.08,
"Defender": 0.02,
"Goalkeeper": -0.05
}
results = player_data.copy()
# Get replacement level for each player
results["replacement_level"] = results["position"].map(replacement_levels)
# Calculate 90s played
results["nineties"] = results["minutes"] / 90
# Offensive contribution (xG + xA above replacement)
results["offensive_gar"] = (
(results["npxg_per_90"] + results["xa_per_90"] - results["replacement_level"])
* results["nineties"]
)
# Defensive contribution
results["defensive_gar"] = (
(results["tackles_won_per_90"] * 0.05 +
results["interceptions_per_90"] * 0.04 +
results["blocks_per_90"] * 0.03)
* results["nineties"]
)
# Possession contribution
results["possession_gar"] = (
(results["progressive_passes_per_90"] * 0.02 +
results["progressive_carries_per_90"] * 0.015)
* results["nineties"]
)
# Total Goals Above Replacement
results["total_gar"] = (
results["offensive_gar"] +
results["defensive_gar"] +
results["possession_gar"]
)
# Convert to Wins (roughly 2.5 goals per win)
results["war_equivalent"] = results["total_gar"] / 2.5
return results
# Example player data
example_players = pd.DataFrame({
"player": ["Elite Forward", "Good Midfielder", "Solid Defender", "Average GK"],
"position": ["Forward", "Midfielder", "Defender", "Goalkeeper"],
"minutes": [2800, 3000, 2500, 3200],
"npxg_per_90": [0.65, 0.15, 0.05, 0.00],
"xa_per_90": [0.25, 0.20, 0.08, 0.02],
"tackles_won_per_90": [0.8, 2.1, 3.5, 0.1],
"interceptions_per_90": [0.5, 1.8, 2.8, 0.2],
"blocks_per_90": [0.3, 0.8, 1.5, 0.0],
"progressive_passes_per_90": [2.5, 5.8, 4.2, 3.5],
"progressive_carries_per_90": [4.2, 3.5, 1.8, 0.1]
})
# Calculate GAR
gar_results = calculate_football_gar(example_players)
print("Goals Above Replacement (GAR) Analysis")
print("=" * 60)
print(gar_results[["player", "position", "total_gar", "war_equivalent"]]
.sort_values("total_gar", ascending=False)
.to_string(index=False))
# Breakdown by component
print("\n\nGAR Component Breakdown:")
print(gar_results[["player", "offensive_gar", "defensive_gar", "possession_gar"]]
.to_string(index=False))
library(tidyverse)
# Concept Translation: Baseball to Football
# WAR (Wins Above Replacement) -> Goals Above Replacement (GAR)
calculate_football_gar <- function(player_data, position) {
# Football adaptation of WAR concept
# Define replacement level by position (goals added per 90 for replacement player)
replacement_level <- case_when(
position == "Forward" ~ 0.15,
position == "Midfielder" ~ 0.08,
position == "Defender" ~ 0.02,
position == "Goalkeeper" ~ -0.05,
TRUE ~ 0.05
)
player_data %>%
mutate(
# Offensive contribution (xG + xA above replacement)
offensive_gar = (npxg_per_90 + xa_per_90 - replacement_level) * (minutes / 90),
# Defensive contribution (defensive actions value)
defensive_gar = (tackles_won_per_90 * 0.05 +
interceptions_per_90 * 0.04 +
blocks_per_90 * 0.03) * (minutes / 90),
# Possession contribution
possession_gar = (progressive_passes_per_90 * 0.02 +
progressive_carries_per_90 * 0.015) * (minutes / 90),
# Total Goals Above Replacement
total_gar = offensive_gar + defensive_gar + possession_gar,
# Convert to Wins (roughly 2.5 goals per win)
war_equivalent = total_gar / 2.5
)
}
# Example player data
example_players <- tibble(
player = c("Elite Forward", "Good Midfielder", "Solid Defender", "Average GK"),
position = c("Forward", "Midfielder", "Defender", "Goalkeeper"),
minutes = c(2800, 3000, 2500, 3200),
npxg_per_90 = c(0.65, 0.15, 0.05, 0.00),
xa_per_90 = c(0.25, 0.20, 0.08, 0.02),
tackles_won_per_90 = c(0.8, 2.1, 3.5, 0.1),
interceptions_per_90 = c(0.5, 1.8, 2.8, 0.2),
blocks_per_90 = c(0.3, 0.8, 1.5, 0.0),
progressive_passes_per_90 = c(2.5, 5.8, 4.2, 3.5),
progressive_carries_per_90 = c(4.2, 3.5, 1.8, 0.1)
)
# Calculate GAR for each player
gar_results <- example_players %>%
rowwise() %>%
mutate(
replacement_level = case_when(
position == "Forward" ~ 0.15,
position == "Midfielder" ~ 0.08,
position == "Defender" ~ 0.02,
position == "Goalkeeper" ~ -0.05
)
) %>%
ungroup() %>%
mutate(
offensive_gar = (npxg_per_90 + xa_per_90 - replacement_level) * (minutes / 90),
defensive_gar = (tackles_won_per_90 * 0.05 +
interceptions_per_90 * 0.04 +
blocks_per_90 * 0.03) * (minutes / 90),
possession_gar = (progressive_passes_per_90 * 0.02 +
progressive_carries_per_90 * 0.015) * (minutes / 90),
total_gar = offensive_gar + defensive_gar + possession_gar,
war_equivalent = total_gar / 2.5
)
print("Goals Above Replacement (GAR) Analysis:")
print(gar_results %>%
select(player, position, total_gar, war_equivalent) %>%
arrange(desc(total_gar)))
Key Baseball Lesson: Market Inefficiencies
Billy Beane's A's found value in on-base percentage when other teams overvalued batting average. In football, similar inefficiencies exist:
- Players from smaller leagues are often undervalued
- Defensive contributions are harder to measure, creating value opportunities
- Age curves differ from perception (peak years vary by position)
- Set-piece specialists add value not captured in market prices
Lessons from Basketball Analytics
Basketball's spatial revolution transformed how teams evaluate players and tactics. The NBA's adoption of tracking data (SportVU, then Second Spectrum) created entirely new analytical frameworks that football is now beginning to adapt.
import pandas as pd
import numpy as np
# Basketball Spatial Analytics Concepts Applied to Football
# Shot Charts -> Shot Maps with xG
def create_football_shot_chart(shots_data: pd.DataFrame) -> pd.DataFrame:
"""
Apply basketball-style spatial analysis to football shots.
Creates zones and calculates value above average.
"""
result = shots_data.copy()
# Zone classification (inspired by basketball court zones)
def classify_zone(row):
if row["distance"] <= 6:
return "Six-Yard Box"
elif row["distance"] <= 18 and abs(row["angle"]) < 30:
return "Central Penalty Area"
elif row["distance"] <= 18:
return "Wide Penalty Area"
elif row["distance"] <= 25 and abs(row["angle"]) < 25:
return "Central Edge"
elif row["distance"] <= 30:
return "Long Range Central"
else:
return "Long Range Wide"
result["zone"] = result.apply(classify_zone, axis=1)
# Zone average xG (league benchmarks)
zone_avg_xg = {
"Six-Yard Box": 0.45,
"Central Penalty Area": 0.22,
"Wide Penalty Area": 0.08,
"Central Edge": 0.06,
"Long Range Central": 0.04,
"Long Range Wide": 0.02
}
result["zone_xg_avg"] = result["zone"].map(zone_avg_xg)
result["xg_above_average"] = result["xg"] - result["zone_xg_avg"]
return result
# Simulate shot data
np.random.seed(42)
n_shots = 200
shot_data = pd.DataFrame({
"shot_id": range(1, n_shots + 1),
"player": np.random.choice(["Player A", "Player B", "Player C"], n_shots),
"distance": np.random.uniform(3, 35, n_shots),
"angle": np.random.uniform(-45, 45, n_shots)
})
# Generate xG based on distance
shot_data["xg"] = shot_data["distance"].apply(
lambda d: np.random.uniform(0.35, 0.65) if d <= 6
else (np.random.uniform(0.05, 0.35) if d <= 18
else np.random.uniform(0.01, 0.08))
)
# Simulate goals
shot_data["goal"] = np.random.binomial(1, shot_data["xg"].clip(upper=0.8))
# Apply zone analysis
shot_analysis = create_football_shot_chart(shot_data)
# Summarize by zone (like basketball shot chart analysis)
zone_summary = shot_analysis.groupby("zone").agg(
shots=("shot_id", "count"),
goals=("goal", "sum"),
total_xg=("xg", "sum"),
conversion_rate=("goal", "mean"),
avg_xg=("xg", "mean")
).reset_index()
zone_summary["xg_outperformance"] = zone_summary["conversion_rate"] - zone_summary["avg_xg"]
zone_summary = zone_summary.sort_values("avg_xg", ascending=False)
print("Shot Zone Analysis (Basketball-Style)")
print("=" * 70)
print(zone_summary.to_string(index=False))
# Player shot selection quality
player_shot_quality = shot_analysis.groupby("player").agg(
shots=("shot_id", "count"),
avg_shot_xg=("xg", "mean")
).reset_index()
player_shot_quality["shot_quality_percentile"] = (
player_shot_quality["avg_shot_xg"].rank(pct=True) * 100
)
print("\n\nPlayer Shot Selection Quality:")
print(player_shot_quality.to_string(index=False))
library(tidyverse)
# Basketball Spatial Analytics Concepts Applied to Football
# Shot Charts -> Shot Maps with xG
create_football_shot_chart <- function(shots_data) {
# Basketball pioneered spatial shot analysis
# Football adaptation with expected goals context
shots_data %>%
mutate(
# Zone classification (inspired by basketball court zones)
zone = case_when(
distance <= 6 ~ "Six-Yard Box",
distance <= 18 & abs(angle) < 30 ~ "Central Penalty Area",
distance <= 18 ~ "Wide Penalty Area",
distance <= 25 & abs(angle) < 25 ~ "Central Edge",
distance <= 30 ~ "Long Range Central",
TRUE ~ "Long Range Wide"
),
# Value added vs league average (like basketball eFG% vs average)
zone_xg_avg = case_when(
zone == "Six-Yard Box" ~ 0.45,
zone == "Central Penalty Area" ~ 0.22,
zone == "Wide Penalty Area" ~ 0.08,
zone == "Central Edge" ~ 0.06,
zone == "Long Range Central" ~ 0.04,
TRUE ~ 0.02
),
xg_above_average = xg - zone_xg_avg
)
}
# Simulate shot data
set.seed(42)
shot_data <- tibble(
shot_id = 1:200,
player = sample(c("Player A", "Player B", "Player C"), 200, replace = TRUE),
distance = runif(200, 3, 35),
angle = runif(200, -45, 45),
xg = case_when(
distance <= 6 ~ runif(200, 0.35, 0.65),
distance <= 18 ~ runif(200, 0.05, 0.35),
TRUE ~ runif(200, 0.01, 0.08)
)[1:200],
goal = rbinom(200, 1, prob = pmin(xg, 0.8))
)
# Apply zone analysis
shot_analysis <- create_football_shot_chart(shot_data)
# Summarize by zone (like basketball shot chart analysis)
zone_summary <- shot_analysis %>%
group_by(zone) %>%
summarise(
shots = n(),
goals = sum(goal),
total_xg = sum(xg),
conversion_rate = mean(goal),
avg_xg = mean(xg),
xg_outperformance = mean(goal) - mean(xg),
.groups = "drop"
) %>%
arrange(desc(avg_xg))
print("Shot Zone Analysis (Basketball-Style):")
print(zone_summary)
# Player shot selection quality (like basketball shot selection metrics)
player_shot_quality <- shot_analysis %>%
group_by(player) %>%
summarise(
shots = n(),
avg_shot_xg = mean(xg),
shot_quality_percentile = percent_rank(mean(xg)) * 100,
.groups = "drop"
)
print("\nPlayer Shot Selection Quality:")
print(player_shot_quality)
Plus-Minus and Impact Metrics
Basketball's plus-minus metrics (RAPM, RPM, EPM) measure how much better a team performs when a player is on court. Football's fluid substitution patterns make this harder, but adapted versions can still provide value.
import pandas as pd
import numpy as np
# Adapting Basketball Plus-Minus to Football
# RAPM (Regularized Adjusted Plus-Minus) Football Version
def calculate_football_plus_minus(match_segments: pd.DataFrame) -> pd.DataFrame:
"""
Football adaptation of basketball's plus-minus metrics.
Uses xG differential instead of goals due to low-scoring nature.
"""
results = match_segments.groupby("player_id").agg(
minutes_on=("segment_minutes", "sum"),
total_team_xg=("team_xg", "sum"),
total_opponent_xg=("opponent_xg", "sum"),
avg_teammate_quality=("teammate_avg_rating", "mean"),
avg_opponent_quality=("opponent_avg_rating", "mean")
).reset_index()
# Calculate per-90 metrics
results["nineties"] = results["minutes_on"] / 90
results["xg_for_per_90"] = results["total_team_xg"] / results["nineties"]
results["xg_against_per_90"] = results["total_opponent_xg"] / results["nineties"]
results["xg_diff_per_90"] = results["xg_for_per_90"] - results["xg_against_per_90"]
# Adjusted plus-minus (control for teammate/opponent quality)
results["adjusted_xg_diff"] = (
results["xg_diff_per_90"] -
(results["avg_teammate_quality"] - 8) * 0.1 +
(results["avg_opponent_quality"] - 8) * 0.1
)
return results
# Simulate match segment data
np.random.seed(123)
n_segments = 500
match_segments = pd.DataFrame({
"segment_id": range(1, n_segments + 1),
"player_id": np.random.choice([f"Player_{i}" for i in range(1, 21)], n_segments),
"segment_minutes": np.random.uniform(5, 45, n_segments),
"team_xg": np.random.poisson(0.4, n_segments),
"opponent_xg": np.random.poisson(0.35, n_segments),
"teammate_avg_rating": np.random.normal(7.5, 0.8, n_segments),
"opponent_avg_rating": np.random.normal(7.5, 0.8, n_segments)
})
# Calculate plus-minus
plus_minus_results = calculate_football_plus_minus(match_segments)
# Filter for minimum minutes and show top performers
qualified = plus_minus_results[plus_minus_results["minutes_on"] >= 400]
top_performers = qualified.nlargest(10, "adjusted_xg_diff")
print("Football Plus-Minus Rankings (xG Differential per 90)")
print("=" * 60)
print(top_performers[["player_id", "minutes_on", "xg_diff_per_90", "adjusted_xg_diff"]]
.to_string(index=False))
library(tidyverse)
# Adapting Basketball Plus-Minus to Football
# RAPM (Regularized Adjusted Plus-Minus) Football Version
calculate_football_plus_minus <- function(match_segments) {
# Football adaptation requires different approach due to:
# 1. Fewer substitutions (typically 3-5 per match)
# 2. Lower scoring (goals vs points)
# 3. More interdependent positions
# Use xG differential instead of goal differential
# Break matches into segments based on substitutions
match_segments %>%
group_by(player_id) %>%
summarise(
minutes_on = sum(segment_minutes),
xg_for_per_90 = sum(team_xg) / (sum(segment_minutes) / 90),
xg_against_per_90 = sum(opponent_xg) / (sum(segment_minutes) / 90),
xg_diff_per_90 = xg_for_per_90 - xg_against_per_90,
# Control for teammate/opponent quality (simplified RAPM)
avg_teammate_quality = mean(teammate_avg_rating),
avg_opponent_quality = mean(opponent_avg_rating),
# Adjusted plus-minus
adjusted_xg_diff = xg_diff_per_90 -
(avg_teammate_quality - 8) * 0.1 +
(avg_opponent_quality - 8) * 0.1,
.groups = "drop"
)
}
# Simulate match segment data
set.seed(123)
n_segments <- 500
match_segments <- tibble(
segment_id = 1:n_segments,
player_id = sample(paste0("Player_", 1:20), n_segments, replace = TRUE),
segment_minutes = runif(n_segments, 5, 45),
team_xg = rpois(n_segments, lambda = 0.4),
opponent_xg = rpois(n_segments, lambda = 0.35),
teammate_avg_rating = rnorm(n_segments, mean = 7.5, sd = 0.8),
opponent_avg_rating = rnorm(n_segments, mean = 7.5, sd = 0.8)
)
# Calculate plus-minus
plus_minus_results <- calculate_football_plus_minus(match_segments)
# Top performers
print("Football Plus-Minus Rankings (xG Differential per 90):")
print(plus_minus_results %>%
filter(minutes_on >= 400) %>%
arrange(desc(adjusted_xg_diff)) %>%
head(10) %>%
select(player_id, minutes_on, xg_diff_per_90, adjusted_xg_diff))
Lessons from Hockey Analytics
Hockey analytics shares many characteristics with football: low-scoring games, continuous flow, and positional fluidity. Hockey's expected goals models and possession metrics (Corsi, Fenwick) have directly influenced football analytics development.
import pandas as pd
import numpy as np
# Hockey Analytics Concepts Applied to Football
# Corsi -> Territorial Dominance Metrics
def calculate_football_corsi(match_data: pd.DataFrame) -> pd.DataFrame:
"""
Apply hockey's Corsi concept to football.
Measures territorial dominance through attacking actions.
"""
result = match_data.copy()
# Football "Corsi" - territorial actions
result["team_corsi"] = (
result["shots"] + result["shots_blocked"] + result["crosses_attempted"] +
result["final_third_entries"] + result["box_entries"]
)
result["opponent_corsi"] = (
result["opponent_shots"] + result["opponent_shots_blocked"] +
result["opponent_crosses_attempted"] + result["opponent_final_third_entries"] +
result["opponent_box_entries"]
)
# Corsi For percentage (CF%)
result["corsi_for_pct"] = (
result["team_corsi"] / (result["team_corsi"] + result["opponent_corsi"]) * 100
)
# High-danger chances (like hockey slot shots)
result["high_danger_cf"] = result["shots_inside_box"] + result["headers_inside_box"]
result["high_danger_ca"] = (
result["opponent_shots_inside_box"] + result["opponent_headers_inside_box"]
)
result["high_danger_cf_pct"] = (
result["high_danger_cf"] / (result["high_danger_cf"] + result["high_danger_ca"]) * 100
)
# PDO (shooting % + save %) - regression indicator
result["shooting_pct"] = result["goals"] / result["shots"] * 100
result["save_pct"] = (1 - result["opponent_goals"] / result["opponent_shots"]) * 100
result["pdo"] = result["shooting_pct"] / 10 + result["save_pct"]
return result
# Generate example match data
np.random.seed(42)
n_matches = 20
match_data = pd.DataFrame({
"match_id": range(1, n_matches + 1),
"team": "Example FC",
"shots": np.random.poisson(14, n_matches),
"shots_blocked": np.random.poisson(4, n_matches),
"crosses_attempted": np.random.poisson(18, n_matches),
"final_third_entries": np.random.poisson(35, n_matches),
"box_entries": np.random.poisson(12, n_matches),
"shots_inside_box": np.random.poisson(8, n_matches),
"headers_inside_box": np.random.poisson(2, n_matches),
"goals": np.random.poisson(1.5, n_matches),
"opponent_shots": np.random.poisson(12, n_matches),
"opponent_shots_blocked": np.random.poisson(3, n_matches),
"opponent_crosses_attempted": np.random.poisson(15, n_matches),
"opponent_final_third_entries": np.random.poisson(30, n_matches),
"opponent_box_entries": np.random.poisson(10, n_matches),
"opponent_shots_inside_box": np.random.poisson(6, n_matches),
"opponent_headers_inside_box": np.random.poisson(1, n_matches),
"opponent_goals": np.random.poisson(1.2, n_matches)
})
# Apply hockey-style analysis
hockey_style_analysis = calculate_football_corsi(match_data)
# Season summary
print("Season Summary (Hockey-Style Metrics)")
print("=" * 50)
print(f"Matches: {len(hockey_style_analysis)}")
print(f"Avg Corsi For %: {hockey_style_analysis['corsi_for_pct'].mean():.1f}%")
print(f"Avg High Danger CF%: {hockey_style_analysis['high_danger_cf_pct'].mean():.1f}%")
print(f"Avg PDO: {hockey_style_analysis['pdo'].mean():.1f}")
print(f"Goals For: {hockey_style_analysis['goals'].sum()}")
print(f"Goals Against: {hockey_style_analysis['opponent_goals'].sum()}")
# PDO analysis
print("\n\nPDO Analysis (values far from 100 = regression candidate):")
hockey_style_analysis["pdo_deviation"] = abs(hockey_style_analysis["pdo"] - 100)
print(hockey_style_analysis.nlargest(5, "pdo_deviation")[
["match_id", "corsi_for_pct", "pdo", "goals", "opponent_goals"]
].to_string(index=False))
library(tidyverse)
# Hockey Analytics Concepts Applied to Football
# Corsi -> Territorial Dominance Metrics
# Hockey: Corsi = All shot attempts (shots on goal + missed + blocked)
# Football adaptation: All attacking actions in final third
calculate_football_corsi <- function(match_data) {
match_data %>%
mutate(
# Football "Corsi" - territorial actions
team_corsi = shots + shots_blocked + crosses_attempted +
final_third_entries + box_entries,
opponent_corsi = opponent_shots + opponent_shots_blocked +
opponent_crosses_attempted + opponent_final_third_entries +
opponent_box_entries,
# Corsi For percentage (CF%)
corsi_for_pct = team_corsi / (team_corsi + opponent_corsi) * 100,
# Hockey-style expected goals model adaptations
# High-danger chances (like hockey slot shots)
high_danger_cf = shots_inside_box + headers_inside_box,
high_danger_ca = opponent_shots_inside_box + opponent_headers_inside_box,
high_danger_cf_pct = high_danger_cf / (high_danger_cf + high_danger_ca) * 100,
# PDO (shooting % + save %) - regression indicator
shooting_pct = goals / shots * 100,
save_pct = (1 - opponent_goals / opponent_shots) * 100,
pdo = shooting_pct / 10 + save_pct # Scaled to ~100 baseline
)
}
# Example match data
match_data <- tibble(
match_id = 1:20,
team = "Example FC",
shots = rpois(20, 14),
shots_blocked = rpois(20, 4),
crosses_attempted = rpois(20, 18),
final_third_entries = rpois(20, 35),
box_entries = rpois(20, 12),
shots_inside_box = rpois(20, 8),
headers_inside_box = rpois(20, 2),
goals = rpois(20, 1.5),
opponent_shots = rpois(20, 12),
opponent_shots_blocked = rpois(20, 3),
opponent_crosses_attempted = rpois(20, 15),
opponent_final_third_entries = rpois(20, 30),
opponent_box_entries = rpois(20, 10),
opponent_shots_inside_box = rpois(20, 6),
opponent_headers_inside_box = rpois(20, 1.5),
opponent_goals = rpois(20, 1.2)
)
# Apply hockey-style analysis
hockey_style_analysis <- calculate_football_corsi(match_data)
# Season summary
season_summary <- hockey_style_analysis %>%
summarise(
matches = n(),
avg_corsi_for_pct = mean(corsi_for_pct, na.rm = TRUE),
avg_high_danger_cf_pct = mean(high_danger_cf_pct, na.rm = TRUE),
avg_pdo = mean(pdo, na.rm = TRUE),
goals_for = sum(goals),
goals_against = sum(opponent_goals)
)
print("Season Summary (Hockey-Style Metrics):")
print(season_summary)
# PDO analysis - teams with extreme PDO likely to regress
print("\nPDO Analysis (values far from 100 indicate luck/regression candidate):")
print(hockey_style_analysis %>%
select(match_id, corsi_for_pct, pdo, goals, opponent_goals) %>%
arrange(desc(abs(pdo - 100))) %>%
head(5))
Key Hockey Concept: PDO
PDO (named after a hockey analytics blogger) is the sum of shooting percentage and save percentage. In hockey, PDO regresses strongly to 100 over time. Teams with high PDO are often "lucky" and due for regression.
Football Application: Teams with conversion rates significantly above their xG for extended periods are likely outperforming sustainable levels. This is valuable for betting markets and projection systems.
Lessons from American Football Analytics
American football's discrete play structure has enabled sophisticated play-by-play analysis. Expected Points Added (EPA) and Win Probability models offer frameworks applicable to football's set pieces and game state analysis.
import pandas as pd
import numpy as np
# American Football Concepts Applied to Soccer
# EPA (Expected Points Added) -> Expected Goals State Model
def calculate_football_epa(events_data: pd.DataFrame) -> pd.DataFrame:
"""
Adapt NFL's Expected Points Added to football.
Measures value added by actions based on field position changes.
"""
result = events_data.copy()
# Expected goals from current state (pre-action)
zone_xg = {
"opponent_box": 0.15,
"opponent_final_third": 0.05,
"middle_third": 0.02,
"own_final_third": 0.005
}
result_zone_xg = {
"goal": 1.0,
"opponent_box": 0.15,
"opponent_final_third": 0.05,
"middle_third": 0.02,
"own_final_third": 0.005,
"turnover": -0.03
}
result["pre_xg_state"] = result["pitch_zone"].map(zone_xg).fillna(0.01)
result["post_xg_state"] = result["result_zone"].map(result_zone_xg).fillna(0.01)
# EPA equivalent: Change in expected goals
result["xg_added"] = result["post_xg_state"] - result["pre_xg_state"]
# Game state adjustment
def game_state_multiplier(row):
if row["minute"] >= 80 and row["goal_diff"] < 0:
return 1.3 # Chasing late
elif row["minute"] >= 80 and row["goal_diff"] > 0:
return 0.7 # Protecting lead
elif row["goal_diff"] <= -2:
return 1.2 # Need goals
elif row["goal_diff"] >= 2:
return 0.8 # Comfortable
return 1.0
result["game_state_multiplier"] = result.apply(game_state_multiplier, axis=1)
result["adjusted_xg_added"] = result["xg_added"] * result["game_state_multiplier"]
return result
def calculate_win_probability(minute: int, goal_diff: int, home: bool = True) -> dict:
"""
NFL-style win probability model adapted for football.
"""
remaining_minutes = 90 - minute
goals_per_minute = 0.028 # ~2.5 goals per game
# Expected goals remaining
home_factor = 1.1 if home else 0.9
team_expected = remaining_minutes * goals_per_minute * home_factor
opponent_expected = remaining_minutes * goals_per_minute * (2 - home_factor)
# Current advantage in expected final goals
expected_final_diff = goal_diff + team_expected - opponent_expected
# Convert to win probability (logistic function)
win_prob = 1 / (1 + np.exp(-expected_final_diff * 0.8))
draw_prob = max(0, 0.25 - abs(expected_final_diff) * 0.05) * \
(1 - abs(minute - 45) / 90)
return {
"win": win_prob * (1 - draw_prob),
"draw": draw_prob,
"loss": (1 - win_prob) * (1 - draw_prob)
}
# Example: Win probability at different game states
scenarios = [
{"scenario": "Start (0-0)", "minute": 0, "goal_diff": 0},
{"scenario": "Down 1 at HT", "minute": 45, "goal_diff": -1},
{"scenario": "Up 1 at 75'", "minute": 75, "goal_diff": 1},
{"scenario": "Down 2 at 80'", "minute": 80, "goal_diff": -2}
]
print("Win Probability by Game State")
print("=" * 60)
print(f"{'Scenario':<20} {'Win%':>10} {'Draw%':>10} {'Loss%':>10}")
print("-" * 60)
for s in scenarios:
probs = calculate_win_probability(s["minute"], s["goal_diff"])
print(f"{s['scenario']:<20} {probs['win']*100:>9.1f}% {probs['draw']*100:>9.1f}% {probs['loss']*100:>9.1f}%")
library(tidyverse)
# American Football Concepts Applied to Soccer
# EPA (Expected Points Added) -> Expected Goals State Model
# NFL EPA: Value of play based on down, distance, field position
# Football adaptation: Value based on pitch position, game state, time
calculate_football_epa <- function(events_data) {
events_data %>%
mutate(
# Expected goals from current state (pre-action)
pre_xg_state = case_when(
pitch_zone == "opponent_box" ~ 0.15,
pitch_zone == "opponent_final_third" ~ 0.05,
pitch_zone == "middle_third" ~ 0.02,
pitch_zone == "own_final_third" ~ 0.005,
TRUE ~ 0.01
),
# Expected goals from resulting state (post-action)
post_xg_state = case_when(
result_zone == "goal" ~ 1.0,
result_zone == "opponent_box" ~ 0.15,
result_zone == "opponent_final_third" ~ 0.05,
result_zone == "middle_third" ~ 0.02,
result_zone == "own_final_third" ~ 0.005,
result_zone == "turnover" ~ -0.03, # Opponent possession value
TRUE ~ 0.01
),
# EPA equivalent: Change in expected goals
xg_added = post_xg_state - pre_xg_state,
# Adjust for game state (like NFL situation-adjusted EPA)
game_state_multiplier = case_when(
minute >= 80 & goal_diff < 0 ~ 1.3, # Chasing late
minute >= 80 & goal_diff > 0 ~ 0.7, # Protecting lead
goal_diff <= -2 ~ 1.2, # Need goals
goal_diff >= 2 ~ 0.8, # Comfortable
TRUE ~ 1.0
),
adjusted_xg_added = xg_added * game_state_multiplier
)
}
# Win Probability Model (NFL-style)
calculate_win_probability <- function(minute, goal_diff, home = TRUE) {
# Simplified win probability based on game state
# Based on historical data relationships
remaining_minutes <- 90 - minute
goals_per_minute <- 0.028 # League average ~2.5 goals per game
# Expected goals remaining
team_expected <- remaining_minutes * goals_per_minute * ifelse(home, 1.1, 0.9)
opponent_expected <- remaining_minutes * goals_per_minute * ifelse(home, 0.9, 1.1)
# Current advantage in "expected final goals"
expected_final_diff <- goal_diff + team_expected - opponent_expected
# Convert to win probability (using logistic function)
win_prob <- 1 / (1 + exp(-expected_final_diff * 0.8))
draw_prob <- max(0, 0.25 - abs(expected_final_diff) * 0.05) *
(1 - abs(minute - 45) / 90) # Draws less likely with time
list(
win = win_prob * (1 - draw_prob),
draw = draw_prob,
loss = (1 - win_prob) * (1 - draw_prob)
)
}
# Example: Win probability at different game states
scenarios <- tibble(
scenario = c("Start (0-0)", "Down 1 at HT", "Up 1 at 75'", "Down 2 at 80'"),
minute = c(0, 45, 75, 80),
goal_diff = c(0, -1, 1, -2)
)
win_probs <- scenarios %>%
rowwise() %>%
mutate(
probs = list(calculate_win_probability(minute, goal_diff)),
win = probs$win,
draw = probs$draw,
loss = probs$loss
) %>%
ungroup() %>%
select(scenario, minute, goal_diff, win, draw, loss)
print("Win Probability by Game State:")
print(win_probs)
Building a Unified Cross-Sport Framework
The best football analytics practitioners draw insights from multiple sports. Here we present a unified framework that synthesizes lessons from baseball, basketball, hockey, and American football into a cohesive analytical approach.
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Optional
# Unified Cross-Sport Analytics Framework for Football
class CrossSportAnalytics:
"""
Unified framework synthesizing analytics concepts from
baseball, basketball, hockey, and American football.
"""
def __init__(self):
print("Cross-Sport Analytics Framework initialized")
def calculate_total_value(self, player_data: pd.DataFrame) -> pd.DataFrame:
"""
Baseball-style comprehensive value metric (WAR equivalent).
"""
result = player_data.copy()
# Offensive value (baseball-style linear weights)
result["offensive_value"] = (
result["npxg_per_90"] * 0.4 +
result["xa_per_90"] * 0.3 +
result["key_passes_per_90"] * 0.05
)
# Defensive value
result["defensive_value"] = (
result["tackles_won_per_90"] * 0.05 +
result["interceptions_per_90"] * 0.04 +
result["pressures_per_90"] * 0.02
)
# Progression value
result["progression_value"] = (
result["progressive_passes_per_90"] * 0.02 +
result["progressive_carries_per_90"] * 0.015
)
# Total Goals Above Replacement
result["total_gar"] = (
(result["offensive_value"] + result["defensive_value"] +
result["progression_value"]) * (result["minutes"] / 90)
)
return result
def calculate_spatial_efficiency(self, shot_data: pd.DataFrame) -> pd.DataFrame:
"""
Basketball-style spatial efficiency analysis.
"""
shot_data = shot_data.copy()
def classify_zone(distance):
if distance <= 6:
return "high_value"
elif distance <= 18:
return "medium_value"
return "low_value"
shot_data["zone"] = shot_data["distance"].apply(classify_zone)
return shot_data.groupby(["player_id", "zone"]).agg(
shots=("shot_id", "count"),
xg=("xg", "sum"),
goals=("goal", "sum")
).reset_index().assign(
efficiency=lambda df: df["goals"] / df["xg"].replace(0, np.nan)
)
def calculate_territorial_dominance(self, match_data: pd.DataFrame) -> pd.DataFrame:
"""
Hockey-style territorial dominance metrics (Corsi adaptation).
"""
result = match_data.copy()
result["team_territory"] = (
result["final_third_entries"] + result["box_entries"] + result["shots"]
)
result["opponent_territory"] = (
result["opp_final_third_entries"] + result["opp_box_entries"] +
result["opp_shots"]
)
result["territorial_cf"] = (
result["team_territory"] /
(result["team_territory"] + result["opponent_territory"])
)
# PDO for regression analysis
result["pdo"] = (
(result["goals"] / result["shots"] * 10) +
((result["opp_shots"] - result["opp_goals"]) / result["opp_shots"] * 100)
)
return result
def calculate_game_state_value(self, events_data: pd.DataFrame) -> pd.DataFrame:
"""
NFL-style game state value model.
"""
result = events_data.copy()
# State-based expected value
zone_values = {
"opponent_box": 0.15,
"opponent_third": 0.05,
"middle_third": 0.02
}
result["state_value"] = result["zone"].map(zone_values).fillna(0.005)
# Game state adjustment
def state_multiplier(row):
if row["minute"] >= 75 and row["goal_diff"] < 0:
return 1.3
elif row["minute"] >= 75 and row["goal_diff"] > 0:
return 0.7
return 1.0
result["state_multiplier"] = result.apply(state_multiplier, axis=1)
result["adjusted_value"] = result["state_value"] * result["state_multiplier"]
return result
# Usage example
framework = CrossSportAnalytics()
# Example player data
sample_player = pd.DataFrame({
"player": ["Star Forward"],
"minutes": [2800],
"npxg_per_90": [0.55],
"xa_per_90": [0.18],
"key_passes_per_90": [2.1],
"tackles_won_per_90": [0.9],
"interceptions_per_90": [0.5],
"pressures_per_90": [18.5],
"progressive_passes_per_90": [3.2],
"progressive_carries_per_90": [5.8]
})
result = framework.calculate_total_value(sample_player)
print("Player Total Value (GAR) Analysis")
print("=" * 60)
print(result[["player", "offensive_value", "defensive_value",
"progression_value", "total_gar"]].to_string(index=False))
library(tidyverse)
library(R6)
# Unified Cross-Sport Analytics Framework for Football
CrossSportAnalytics <- R6Class("CrossSportAnalytics",
public = list(
player_data = NULL,
match_data = NULL,
initialize = function() {
message("Cross-Sport Analytics Framework initialized")
},
# Baseball: WAR-style comprehensive value metric
calculate_total_value = function(player_data) {
player_data %>%
mutate(
# Offensive value (baseball-style linear weights)
offensive_value = npxg_per_90 * 0.4 + xa_per_90 * 0.3 +
key_passes_per_90 * 0.05,
# Defensive value
defensive_value = tackles_won_per_90 * 0.05 +
interceptions_per_90 * 0.04 +
pressures_per_90 * 0.02,
# Progression value
progression_value = progressive_passes_per_90 * 0.02 +
progressive_carries_per_90 * 0.015,
# Total Goals Above Replacement
total_gar = (offensive_value + defensive_value + progression_value) *
(minutes / 90)
)
},
# Basketball: Spatial efficiency analysis
calculate_spatial_efficiency = function(shot_data) {
shot_data %>%
mutate(
zone = case_when(
distance <= 6 ~ "high_value",
distance <= 18 ~ "medium_value",
TRUE ~ "low_value"
)
) %>%
group_by(player_id, zone) %>%
summarise(
shots = n(),
xg = sum(xg),
goals = sum(goal),
efficiency = mean(goal) / mean(xg),
.groups = "drop"
)
},
# Hockey: Territorial dominance metrics
calculate_territorial_dominance = function(match_data) {
match_data %>%
mutate(
team_territory = final_third_entries + box_entries + shots,
opponent_territory = opp_final_third_entries + opp_box_entries + opp_shots,
territorial_cf = team_territory / (team_territory + opponent_territory),
# PDO for regression analysis
pdo = (goals / shots * 10) + ((opp_shots - opp_goals) / opp_shots * 100)
)
},
# NFL: Game state value model
calculate_game_state_value = function(events_data) {
events_data %>%
mutate(
# State-based expected value
state_value = case_when(
zone == "opponent_box" ~ 0.15,
zone == "opponent_third" ~ 0.05,
zone == "middle_third" ~ 0.02,
TRUE ~ 0.005
),
# Game state adjustment
state_multiplier = case_when(
minute >= 75 & goal_diff < 0 ~ 1.3,
minute >= 75 & goal_diff > 0 ~ 0.7,
TRUE ~ 1.0
),
adjusted_value = state_value * state_multiplier
)
},
# Integrated player evaluation
evaluate_player = function(player_id, player_data, shot_data, match_data) {
total_value <- self$calculate_total_value(
player_data %>% filter(player == player_id)
)
spatial <- self$calculate_spatial_efficiency(
shot_data %>% filter(player_id == !!player_id)
)
list(
player = player_id,
total_gar = sum(total_value$total_gar),
offensive_contribution = sum(total_value$offensive_value),
defensive_contribution = sum(total_value$defensive_value),
high_value_shot_efficiency = spatial %>%
filter(zone == "high_value") %>%
pull(efficiency) %>%
first()
)
}
)
)
# Usage example
framework <- CrossSportAnalytics$new()
# Example player data
sample_player <- tibble(
player = "Star Forward",
minutes = 2800,
npxg_per_90 = 0.55,
xa_per_90 = 0.18,
key_passes_per_90 = 2.1,
tackles_won_per_90 = 0.9,
interceptions_per_90 = 0.5,
pressures_per_90 = 18.5,
progressive_passes_per_90 = 3.2,
progressive_carries_per_90 = 5.8
)
result <- framework$calculate_total_value(sample_player)
print("Player Total Value (GAR) Analysis:")
print(result %>% select(player, offensive_value, defensive_value,
progression_value, total_gar))
| Concept | Origin Sport | Football Application | Key Insight |
|---|---|---|---|
| WAR/Replacement Level | Baseball | Goals Above Replacement (GAR) | Value players against realistic alternatives |
| Shot Charts/Zones | Basketball | xG Maps with Zone Analysis | Shot location matters as much as volume |
| Plus-Minus (RAPM) | Basketball | On-field xG Differential | Team performance with player on/off field |
| Corsi/Fenwick | Hockey | Territorial Dominance % | Possession proxy through shot attempts |
| PDO | Hockey | Conversion + Save Rate Index | Identify luck/regression candidates |
| EPA | American Football | xG State Change | Value actions by resulting game state |
| Win Probability | American Football | Live Match Win Probability | Real-time outcome likelihood |
Practice Exercises
Exercise 58.1: Build a Football WAR Model
Create a comprehensive Goals Above Replacement (GAR) model that accounts for position-specific contributions. Include offensive, defensive, and possession components weighted by position.
Exercise 58.2: PDO Regression Analysis
Analyze PDO (shooting percentage + save percentage) across a league season. Identify teams with extreme PDO values and track their subsequent performance to validate regression to mean.
Exercise 58.3: Win Probability Model
Build a live win probability model using historical match data. The model should update based on goals, time remaining, home/away status, and red cards. Calibrate against historical outcomes.
Chapter Summary
Key Takeaways
- Baseball teaches us about replacement level thinking, market inefficiencies, and the value of open data ecosystems
- Basketball pioneered spatial analytics and plus-minus metrics that adapt well to football's continuous flow
- Hockey developed low-scoring game analytics including xG models and PDO regression concepts
- American Football offers state-based value models and win probability frameworks
- A unified framework combining these insights creates more robust football analytics
- Cross-sport learning accelerates football analytics maturity without repeating other sports' mistakes
Understanding how other sports solved analytical challenges provides valuable shortcuts for football analytics development. The best practitioners draw from multiple traditions while adapting concepts to football's unique characteristics.