Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples
0 of 60 chapters completed (0%)
Learning Objectives
  • Understand the landscape of football data providers
  • Access and use StatsBomb open data
  • Scrape data from FBref, Understat, and Transfermarkt
  • Know when to use free vs. commercial data sources
  • Build your personal football database

3.1 The Football Data Landscape

The football data ecosystem has grown dramatically in recent years. Understanding what's available—and where to find it—is the first step to becoming a proficient analyst.

Categories of Football Data

Free Data Sources
StatsBomb Open DataEvent data
FBrefAggregate stats
UnderstatxG data
TransfermarktMarket values
Football-Data.co.ukHistorical results
Commercial Data Sources
StatsBombPremium event data
Opta (Stats Perform)Industry standard
Second SpectrumTracking data
SkillCornerBroadcast tracking
WyscoutVideo + data
The Good News

You can learn professional-level football analytics using only free data sources. The techniques you'll learn in this textbook apply equally to commercial data.

3.2 StatsBomb Open Data

StatsBomb's open data initiative is a game-changer for learning football analytics. They provide free, detailed event data for select competitions that would cost thousands of dollars commercially.

Available Competitions (as of 2024)

CompetitionSeasonsHighlights
FIFA World Cup2018, 2022Full tournament coverage
UEFA Euro2020Complete event data
FA Women's Super League2018-2021Multiple seasons
UEFA Women's Euro2022Full tournament
La Liga2004-2020 (Messi/Barcelona)Historical treasure
Champions LeagueSelect finalsBig matches
statsbomb_access.py
from statsbombpy import sb
import pandas as pd

# 1. GET ALL AVAILABLE COMPETITIONS
competitions = sb.competitions()
print("Available competitions:")
print(competitions[['competition_name', 'season_name', 'competition_gender']].drop_duplicates())

# 2. GET MATCHES FOR A COMPETITION
# World Cup 2022
wc_matches = sb.matches(competition_id=43, season_id=106)
print(f"\nWorld Cup 2022 matches: {len(wc_matches)}")
print(wc_matches[['match_date', 'home_team', 'away_team', 'home_score', 'away_score']].head())

# 3. GET EVENTS FOR A MATCH
events = sb.events(match_id=3869685)  # WC Final
print(f"\nEvents in WC Final: {len(events)}")

# 4. GET LINEUPS
lineups = sb.lineups(match_id=3869685)
print("\nArgentina starting XI:")
argentina_lineup = lineups['Argentina']
starters = argentina_lineup[argentina_lineup['positions'].apply(
    lambda x: x[0]['start_reason'] == 'Starting XI' if x else False
)]
print(starters[['player_name', 'jersey_number']].head(11))

# 5. GET 360 FREEZE FRAMES (where available)
# Provides positions of ALL players at moment of event
try:
    frames = sb.frames(match_id=3869685)
    print(f"\n360 frames available: {len(frames)}")
except:
    print("\n360 data not available for this match")

# 6. BATCH DOWNLOAD - All events for a competition
def get_all_competition_events(competition_id, season_id):
    """Download all events for a competition."""
    matches = sb.matches(competition_id=competition_id, season_id=season_id)
    all_events = []

    for idx, match in matches.iterrows():
        events = sb.events(match_id=match['match_id'])
        events['match_id'] = match['match_id']
        events['competition'] = match['competition']
        all_events.append(events)
        print(f"Downloaded match {idx+1}/{len(matches)}")

    return pd.concat(all_events, ignore_index=True)

# Uncomment to download (takes a few minutes)
# wc_events = get_all_competition_events(43, 106)
library(StatsBombR)
library(tidyverse)

# 1. GET ALL AVAILABLE COMPETITIONS
competitions <- FreeCompetitions()
cat("Available competitions:\n")
competitions %>%
  select(competition_name, season_name, competition_gender) %>%
  distinct() %>%
  print()

# 2. GET MATCHES FOR A COMPETITION
# World Cup 2022
wc_comp <- competitions %>%
  filter(competition_id == 43, season_id == 106)
wc_matches <- FreeMatches(wc_comp)
cat(sprintf("\nWorld Cup 2022 matches: %d\n", nrow(wc_matches)))
wc_matches %>%
  select(match_date, home_team.home_team_name, away_team.away_team_name,
         home_score, away_score) %>%
  head() %>%
  print()

# 3. GET EVENTS FOR A MATCH
events <- get.matchFree(wc_matches %>% filter(match_id == 3869685))
cat(sprintf("\nEvents in WC Final: %d\n", nrow(events)))

# 4. GET LINEUPS
lineups <- get.lineupsFree(wc_matches %>% filter(match_id == 3869685))
cat("\nArgentina starting XI:\n")
lineups %>%
  filter(team.name == "Argentina") %>%
  head(11) %>%
  select(player.name, jersey_number) %>%
  print()

# 5. BATCH DOWNLOAD - All events for a competition
get_all_competition_events <- function(comp_df) {
  matches <- FreeMatches(comp_df)
  all_events <- free_allevents(MatchesDF = matches)
  return(all_events)
}

# Download all World Cup 2022 events
# wc_events <- get_all_competition_events(wc_comp)

What Makes StatsBomb Data Special

  • Pressure events - Other providers often don't track pressing
  • Carry events - Ball progression while dribbling
  • Detailed shot info - Body part, technique, first-time shot
  • 360 freeze frames - Positions of all 22 players
  • High-quality xG - Among the best public models

3.3 FBref and Sports Reference

FBref (part of Sports Reference) provides free aggregate statistics powered by Opta data. It's the go-to source for comparing players across entire seasons.

fbref_access.py
import soccerdata as sd

# Initialize FBref scraper
fbref = sd.FBref(leagues="ENG-Premier League", seasons="2023-2024")

# 1. GET PLAYER SEASON STATS
# Standard stats (goals, assists, minutes)
standard = fbref.read_player_season_stats(stat_type="standard")
print("Standard stats columns:")
print(standard.columns.tolist()[:15])

# 2. SHOOTING STATS
shooting = fbref.read_player_season_stats(stat_type="shooting")
top_shooters = shooting.nsmallest(10, ('Expected', 'xG'))
print("\nTop xG players:")
print(top_shooters[[('Shooting', 'Gls'), ('Expected', 'xG')]].head(10))

# 3. PASSING STATS
passing = fbref.read_player_season_stats(stat_type="passing")

# 4. DEFENSIVE STATS
defense = fbref.read_player_season_stats(stat_type="defense")

# 5. POSSESSION STATS (touches, carries, etc.)
possession = fbref.read_player_season_stats(stat_type="possession")

# 6. TEAM STATS
team_stats = fbref.read_team_season_stats(stat_type="standard")
print("\nTeam standings:")
print(team_stats.head())

# 7. MATCH SCHEDULES AND RESULTS
schedule = fbref.read_schedule()
print(f"\nMatches in schedule: {len(schedule)}")

# 8. COMBINE MULTIPLE STAT TYPES
# Merge standard and shooting stats
combined = standard.join(shooting, rsuffix='_shoot')
print(f"\nCombined stats shape: {combined.shape}")
library(worldfootballR)
library(tidyverse)

# 1. GET PLAYER SEASON STATS
# Big 5 leagues standard stats
standard <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "standard",
  team_or_player = "player"
)
cat("Standard stats columns:\n")
print(names(standard)[1:15])

# Filter to Premier League
pl_standard <- standard %>% filter(Comp == "Premier League")

# 2. SHOOTING STATS
shooting <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "shooting",
  team_or_player = "player"
) %>%
  filter(Comp == "Premier League")

cat("\nTop xG players:\n")
shooting %>%
  arrange(desc(xG)) %>%
  select(Player, Squad, Gls, xG) %>%
  head(10) %>%
  print()

# 3. PASSING STATS
passing <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "passing",
  team_or_player = "player"
)

# 4. TEAM STATS
team_stats <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "standard",
  team_or_player = "team"
)
cat("\nTeam stats:\n")
team_stats %>%
  filter(Comp == "Premier League") %>%
  head() %>%
  print()

# 5. MATCH RESULTS
matches <- fb_match_results(
  country = "ENG",
  gender = "M",
  season_end_year = 2024,
  tier = "1st"
)
cat(sprintf("\nPremier League matches: %d\n", nrow(matches)))

FBref Stat Categories

Standard
Goals, assists, minutes, starts, xG, xAG
Shooting
Shots, SoT%, xG, npxG, Goals - xG
Passing
Completion %, progressive, key passes, xA
Defense
Tackles, blocks, interceptions, clearances
Possession
Touches, carries, progressive carries, dribbles
Miscellaneous
Fouls, cards, aerials, recoveries

3.4 Understat

Understat provides shot-level xG data for the top 5 European leagues plus the Russian Premier League, going back to 2014. It's excellent for historical xG analysis.

understat_access.py
import soccerdata as sd
# Or use: from understatapi import UnderstatClient

# Initialize Understat scraper
understat = sd.Understat(leagues="ENG-Premier League", seasons="2023-2024")

# 1. GET PLAYER STATS
player_stats = understat.read_player_season_stats()
print("Top scorers with xG:")
print(player_stats.nlargest(10, 'goals')[['player', 'team', 'goals', 'xG', 'shots']])

# 2. GET TEAM STATS
team_stats = understat.read_team_season_stats()
print("\nTeam xG stats:")
print(team_stats[['team', 'xG', 'xGA', 'scored', 'missed']])

# 3. GET SHOT-LEVEL DATA
# This is Understat's killer feature
shot_data = understat.read_shot_events()
print(f"\nTotal shots this season: {len(shot_data)}")

# Analyze a single player's shots
haaland_shots = shot_data[shot_data['player'] == 'Erling Haaland']
print(f"\nHaaland shots: {len(haaland_shots)}")
print(f"Haaland total xG: {haaland_shots['xG'].sum():.2f}")
print(f"Haaland goals: {(haaland_shots['result'] == 'Goal').sum()}")

# 4. ANALYZE SHOT QUALITY
avg_xg_by_result = shot_data.groupby('result')['xG'].mean()
print("\nAverage xG by shot result:")
print(avg_xg_by_result)
library(worldfootballR)
library(tidyverse)

# 1. GET PLAYER STATS FROM UNDERSTAT
player_stats <- understat_league_season_shots(
  league = "EPL",
  season_start_year = 2023
)

cat("Top scorers with xG:\n")
player_stats %>%
  group_by(player) %>%
  summarise(
    shots = n(),
    goals = sum(result == "Goal"),
    xG = sum(xG)
  ) %>%
  arrange(desc(goals)) %>%
  head(10) %>%
  print()

# 2. GET TEAM-LEVEL xG
team_stats <- understat_team_stats_breakdown(
  team_url = "https://understat.com/team/Manchester_City/2023"
)
cat("\nTeam breakdown:\n")
print(team_stats)

# 3. ANALYZE SHOT DISTRIBUTION
cat("\nShot analysis:\n")
player_stats %>%
  filter(str_detect(player, "Haaland")) %>%
  summarise(
    shots = n(),
    goals = sum(result == "Goal"),
    xG = sum(xG),
    avg_xG = mean(xG)
  ) %>%
  print()

# 4. XG BY SHOT RESULT
player_stats %>%
  group_by(result) %>%
  summarise(
    count = n(),
    avg_xG = mean(xG)
  ) %>%
  arrange(desc(avg_xG)) %>%
  print()

3.5 Transfermarkt

Transfermarkt is the definitive source for player market values, transfer history, contract information, and squad compositions. Essential for business-side analytics.

transfermarkt_access.py
import soccerdata as sd

# Initialize Transfermarkt scraper
tm = sd.Transfermarkt(leagues="ENG-Premier League", seasons="2023-2024")

# 1. GET PLAYER MARKET VALUES
player_values = tm.read_player_market_values()
print("Most valuable players:")
print(player_values.nlargest(10, 'market_value')[['player', 'team', 'market_value', 'age']])

# 2. GET TEAM MARKET VALUES
team_values = player_values.groupby('team')['market_value'].sum().sort_values(ascending=False)
print("\nMost valuable squads:")
print(team_values.head(10))

# 3. AGE PROFILE
avg_age = player_values.groupby('team')['age'].mean().sort_values()
print("\nYoungest squads:")
print(avg_age.head(5))

# 4. VALUE PER AGE
value_by_age = player_values.groupby('age')['market_value'].agg(['mean', 'count'])
print("\nValue by age (with sufficient sample):")
print(value_by_age[value_by_age['count'] > 10].head(10))
library(worldfootballR)
library(tidyverse)

# 1. GET PLAYER MARKET VALUES
player_values <- tm_player_market_values(
  country_name = "England",
  start_year = 2023
)

cat("Most valuable players:\n")
player_values %>%
  arrange(desc(player_market_value_euro)) %>%
  select(player_name, squad, player_market_value_euro, player_age) %>%
  head(10) %>%
  print()

# 2. GET TEAM MARKET VALUES
team_values <- player_values %>%
  group_by(squad) %>%
  summarise(total_value = sum(player_market_value_euro, na.rm = TRUE)) %>%
  arrange(desc(total_value))

cat("\nMost valuable squads:\n")
team_values %>% head(10) %>% print()

# 3. GET TRANSFER HISTORY
transfers <- tm_team_transfers(
  team_url = "https://www.transfermarkt.com/manchester-city/startseite/verein/281"
)
cat("\nRecent transfers:\n")
transfers %>% head(10) %>% print()

3.6 Commercial Data Providers

While free data is excellent for learning, professional analysts often work with commercial data. Understanding what's available helps you appreciate the full landscape.

Opta (Stats Perform)

The industry standard for event data. Used by most major leagues, broadcasters, and clubs.

  • Every on-ball action tracked
  • Used by Premier League, La Liga, Serie A
  • FBref data is powered by Opta
Second Spectrum

Premier League's official tracking data provider since 2019.

  • Optical tracking (cameras)
  • 25 frames per second
  • All 22 players + ball position
SkillCorner

Tracking data derived from broadcast video feeds.

  • More accessible than optical tracking
  • Covers 20+ leagues
  • Physical metrics (speed, distance)
Wyscout

Combined video and data platform popular with scouts.

  • Video clips of every action
  • Global coverage (100+ leagues)
  • Used by 1000+ clubs worldwide
Pricing Note

Commercial data subscriptions typically start at $10,000-50,000/year for basic packages. Full tracking data can cost $100,000+/year. Academic licenses are sometimes available at reduced rates.

3.7 Data Source Visualizations

Visualizing data availability and coverage helps you choose the right source for your analysis.

Data Coverage Comparison Chart

script
# Data source feature comparison
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Create comparison matrix
data = {
    "Source": ["StatsBomb", "FBref", "Understat", "Transfermarkt", "WhoScored"],
    "Event Data": [5, 2, 3, 0, 2],
    "xG Models": [5, 4, 5, 0, 4],
    "Historical": [3, 5, 4, 5, 4],
    "Market Values": [0, 0, 0, 5, 0],
    "Free Access": [4, 5, 5, 5, 4]
}
df = pd.DataFrame(data)
df.set_index("Source", inplace=True)

# Create heatmap
fig, ax = plt.subplots(figsize=(10, 6))
heatmap = sns.heatmap(df, annot=True, fmt="d", cmap="Greens",
                      cbar=True, linewidths=2, linecolor="white",
                      annot_kws={"fontsize": 14, "fontweight": "bold"})

ax.set_title("Football Data Source Comparison\nRatings: 0 (None) to 5 (Excellent)",
             fontsize=16, fontweight="bold", pad=20)
ax.set_xlabel("Feature Category", fontsize=12)
ax.set_ylabel("Data Source", fontsize=12)
plt.xticks(rotation=45, ha="right")

plt.tight_layout()
plt.show()

# Data source feature comparison
library(tidyverse)
library(ggplot2)

# Create comparison matrix
data_sources <- data.frame(
  Source = c("StatsBomb", "FBref", "Understat", "Transfermarkt", "WhoScored"),
  Event_Data = c(5, 2, 3, 0, 2),
  xG_Models = c(5, 4, 5, 0, 4),
  Historical = c(3, 5, 4, 5, 4),
  Market_Values = c(0, 0, 0, 5, 0),
  Free_Access = c(4, 5, 5, 5, 4)
)

# Reshape for plotting
data_long <- data_sources %>%
  pivot_longer(cols = -Source, names_to = "Feature", values_to = "Score")

# Create heatmap
ggplot(data_long, aes(x = Feature, y = Source, fill = Score)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Score), color = "white", fontface = "bold", size = 5) +
  scale_fill_gradient(low = "#CCCCCC", high = "#1B5E20", limits = c(0, 5)) +
  labs(
    title = "Football Data Source Comparison",
    subtitle = "Ratings from 0 (None) to 5 (Excellent)",
    x = "Feature Category",
    y = "Data Source",
    fill = "Rating"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text = element_text(size = 11),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

Competition Coverage by Source

script
# Competition coverage bar chart
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Data
sources = ["StatsBomb Free", "FBref", "Understat", "Transfermarkt"]
competitions = ["Premier League", "La Liga", "Serie A", "Bundesliga", "Ligue 1"]

data = {
    "StatsBomb Free": [0, 16, 0, 0, 0],
    "FBref": [30, 30, 30, 30, 30],
    "Understat": [10, 10, 10, 10, 10],
    "Transfermarkt": [50, 50, 50, 50, 50]
}

# Plot
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(competitions))
width = 0.2
colors = ["#E63946", "#457B9D", "#2A9D8F", "#E9C46A"]

for i, (source, color) in enumerate(zip(sources, colors)):
    offset = width * (i - 1.5)
    ax.bar(x + offset, data[source], width, label=source, color=color)

ax.set_xlabel("Competition", fontsize=12)
ax.set_ylabel("Seasons Available", fontsize=12)
ax.set_title("Historical Data Coverage by League\nApproximate seasons of data available",
             fontsize=16, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(competitions, rotation=45, ha="right")
ax.legend(loc="upper right")

plt.tight_layout()
plt.show()

# Competition coverage bar chart
library(ggplot2)

coverage <- data.frame(
  Source = rep(c("StatsBomb Free", "FBref", "Understat", "Transfermarkt"), each = 5),
  Competition = rep(c("Premier League", "La Liga", "Serie A", "Bundesliga", "Ligue 1"), 4),
  Seasons = c(
    # StatsBomb Free
    0, 16, 0, 0, 0,
    # FBref
    30, 30, 30, 30, 30,
    # Understat
    10, 10, 10, 10, 10,
    # Transfermarkt
    50, 50, 50, 50, 50
  )
)

ggplot(coverage, aes(x = Competition, y = Seasons, fill = Source)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c(
    "StatsBomb Free" = "#E63946",
    "FBref" = "#457B9D",
    "Understat" = "#2A9D8F",
    "Transfermarkt" = "#E9C46A"
  )) +
  labs(
    title = "Historical Data Coverage by League",
    subtitle = "Approximate seasons of data available",
    x = "Competition",
    y = "Seasons Available",
    fill = "Data Source"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

Data Freshness Timeline

script
# Data update frequency timeline
import matplotlib.pyplot as plt
import numpy as np

sources = ["StatsBomb", "FBref", "Understat", "Transfermarkt", "WhoScored"]
delays = [24, 2, 1, 168, 1]  # Hours
types = ["Batch", "Near-realtime", "Realtime", "Weekly", "Realtime"]

colors = {
    "Realtime": "#2A9D8F",
    "Near-realtime": "#457B9D",
    "Batch": "#E9C46A",
    "Weekly": "#E76F51"
}

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(sources, delays, color=[colors[t] for t in types])

# Add value labels
for bar, delay in zip(bars, delays):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
            f"{delay}h", ha="center", fontweight="bold")

ax.set_yscale("log")
ax.set_xlabel("Data Source", fontsize=12)
ax.set_ylabel("Delay (hours, log scale)", fontsize=12)
ax.set_title("Data Update Frequency by Source\nHours after match end until data available",
             fontsize=16, fontweight="bold")

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=t) for t, c in colors.items()]
ax.legend(handles=legend_elements, loc="upper right")

plt.tight_layout()
plt.show()

# Data update frequency timeline
library(ggplot2)

update_freq <- data.frame(
  Source = c("StatsBomb", "FBref", "Understat", "Transfermarkt", "WhoScored"),
  Delay_Hours = c(24, 2, 1, 168, 1),  # Hours after match
  Update_Type = c("Batch", "Near-realtime", "Realtime", "Weekly", "Realtime")
)

ggplot(update_freq, aes(x = reorder(Source, Delay_Hours), y = Delay_Hours, fill = Update_Type)) +
  geom_col() +
  geom_text(aes(label = paste0(Delay_Hours, "h")), vjust = -0.5, fontface = "bold") +
  scale_fill_manual(values = c(
    "Realtime" = "#2A9D8F",
    "Near-realtime" = "#457B9D",
    "Batch" = "#E9C46A",
    "Weekly" = "#E76F51"
  )) +
  scale_y_log10() +
  labs(
    title = "Data Update Frequency by Source",
    subtitle = "Hours after match end until data available",
    x = "Data Source",
    y = "Delay (hours, log scale)",
    fill = "Update Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    legend.position = "bottom"
  )

3.8 Practice Exercises

Practice accessing data from different sources with these exercises.

Exercise 3.1: Explore StatsBomb Competitions

Task: Load all available StatsBomb free competitions, filter to men's competitions only, and count how many seasons are available for each competition.

script
# Solution 3.1: StatsBomb competition exploration
from statsbombpy import sb
import pandas as pd
import matplotlib.pyplot as plt

# Load all competitions
comps = sb.competitions()

# Filter mens and count seasons
mens = comps[comps["competition_gender"] == "male"]
summary = mens.groupby("competition_name").agg(
    seasons=("season_name", "nunique"),
    years=("season_name", lambda x: f"{min(x)} - {max(x)}")
).reset_index().sort_values("seasons", ascending=False)

print(summary)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(summary["competition_name"], summary["seasons"], color="#1B5E20")
ax.set_xlabel("Seasons Available")
ax.set_ylabel("Competition")
ax.set_title("StatsBomb Free Data: Seasons by Competition")
plt.tight_layout()
plt.show()

# Solution 3.1: StatsBomb competition exploration
library(StatsBombR)
library(tidyverse)

# Load all competitions
comps <- FreeCompetitions()

# Filter mens and count seasons
mens_summary <- comps %>%
  filter(competition_gender == "male") %>%
  group_by(competition_name) %>%
  summarise(
    seasons = n(),
    years = paste(range(season_name), collapse = " - "),
    .groups = "drop"
  ) %>%
  arrange(desc(seasons))

print(mens_summary)

# Visualize
ggplot(mens_summary, aes(x = reorder(competition_name, seasons), y = seasons)) +
  geom_col(fill = "#1B5E20") +
  coord_flip() +
  labs(title = "StatsBomb Free Data: Seasons by Competition",
       x = "Competition", y = "Seasons Available") +
  theme_minimal()
Exercise 3.2: Compare xG Sources

Task: For a single match, compare xG values from StatsBomb with those from Understat (if available). Calculate the correlation and visualize the differences.

script
# Solution 3.2: xG source comparison (conceptual example)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulated comparison data
np.random.seed(42)
xg_comparison = pd.DataFrame({
    "shot_id": range(1, 21),
    "statsbomb_xg": np.random.uniform(0.02, 0.65, 20),
    "understat_xg": np.random.uniform(0.02, 0.65, 20)
})
xg_comparison["difference"] = xg_comparison["statsbomb_xg"] - xg_comparison["understat_xg"]

# Correlation
cor_value = xg_comparison["statsbomb_xg"].corr(xg_comparison["understat_xg"])
print(f"Correlation: {cor_value:.3f}")

# Scatter plot
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(xg_comparison["statsbomb_xg"], xg_comparison["understat_xg"],
           s=80, alpha=0.7, color="#1B5E20", edgecolors="black")
ax.plot([0, 0.7], [0, 0.7], "r--", label="Perfect agreement")
ax.set_xlabel("StatsBomb xG", fontsize=12)
ax.set_ylabel("Understat xG", fontsize=12)
ax.set_title(f"xG Model Comparison\nCorrelation: {cor_value:.3f}", fontsize=14)
ax.set_aspect("equal")
ax.legend()
plt.tight_layout()
plt.show()

# Solution 3.2: xG source comparison (conceptual example)
library(tidyverse)

# Simulated comparison data (in practice youd merge real data)
xg_comparison <- data.frame(
  shot_id = 1:20,
  statsbomb_xg = runif(20, 0.02, 0.65),
  understat_xg = runif(20, 0.02, 0.65)
) %>%
  mutate(difference = statsbomb_xg - understat_xg)

# Correlation
cor_value <- cor(xg_comparison$statsbomb_xg, xg_comparison$understat_xg)
cat("Correlation:", round(cor_value, 3), "\n")

# Scatter plot comparison
ggplot(xg_comparison, aes(x = statsbomb_xg, y = understat_xg)) +
  geom_point(size = 3, alpha = 0.7, color = "#1B5E20") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "xG Model Comparison: StatsBomb vs Understat",
    subtitle = paste("Correlation:", round(cor_value, 3)),
    x = "StatsBomb xG",
    y = "Understat xG"
  ) +
  theme_minimal() +
  coord_equal()
Exercise 3.3: Build a Local Database

Task: Download all World Cup 2022 match events and save them to a local CSV/Parquet file for faster future access.

script
# Solution 3.3: Build local database
from statsbombpy import sb
import pandas as pd

# Get World Cup 2022 matches
matches = sb.matches(competition_id=43, season_id=106)
print(f"Downloading {len(matches)} matches...")

# Download all events
all_events = []
for i, match_id in enumerate(matches["match_id"]):
    print(f"Match {i+1}/{len(matches)}")
    events = sb.events(match_id=match_id)
    events["match_id"] = match_id
    all_events.append(events)

all_events = pd.concat(all_events, ignore_index=True)
print(f"Total events: {len(all_events)}")

# Save to CSV
all_events.to_csv("wc2022_events.csv", index=False)
print("Saved to wc2022_events.csv")

# For faster loading, use Parquet (requires pyarrow)
# pip install pyarrow
all_events.to_parquet("wc2022_events.parquet", index=False)
print("Saved to wc2022_events.parquet")

# Future use: just load directly
# events = pd.read_parquet("wc2022_events.parquet")

# Solution 3.3: Build local database
library(StatsBombR)
library(tidyverse)

# Get World Cup 2022
comp <- FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106)
matches <- FreeMatches(comp)

# Download all events
cat("Downloading", nrow(matches), "matches...\n")
all_events <- map_dfr(1:nrow(matches), function(i) {
  cat("Match", i, "/", nrow(matches), "\n")
  events <- get.matchFree(matches[i, ])
  events$match_id <- matches$match_id[i]
  return(events)
})

cat("Total events:", nrow(all_events), "\n")

# Save to CSV
write_csv(all_events, "wc2022_events.csv")
cat("Saved to wc2022_events.csv\n")

# For faster loading, use RDS
saveRDS(all_events, "wc2022_events.rds")

# Future use: just load directly
# events <- readRDS("wc2022_events.rds")

3.9 Summary

In this chapter, you learned:

Key Takeaways
  1. StatsBomb Open Data is the best free source for detailed event data
  2. FBref provides comprehensive aggregate stats powered by Opta
  3. Understat is excellent for shot-level xG analysis
  4. Transfermarkt is the go-to for market values and transfers
  5. Different sources have different strengths - choose based on your analysis needs
  6. Build local databases to speed up your workflow and reduce API calls
  7. You can learn professional analytics techniques using only free data
What's Next

In Chapter 4: Data Visualization for Football, we'll learn to create compelling visualizations including shot maps, pass networks, and heat maps.