Chapter 3: The Football Data Ecosystem - Soccer Analytics Textbook

3.1 The Football Data Landscape

The football data ecosystem has grown dramatically in recent years. Understanding what's available—and where to find it—is the first step to becoming a proficient analyst.

Categories of Football Data

Free Data Sources

StatsBomb Open Data	Event data
FBref	Aggregate stats
Understat	xG data
Transfermarkt	Market values
Football-Data.co.uk	Historical results

Commercial Data Sources

StatsBomb	Premium event data
Opta (Stats Perform)	Industry standard
Second Spectrum	Tracking data
SkillCorner	Broadcast tracking
Wyscout	Video + data

The Good News

You can learn professional-level football analytics using only free data sources. The techniques you'll learn in this textbook apply equally to commercial data.

3.2 StatsBomb Open Data

StatsBomb's open data initiative is a game-changer for learning football analytics. They provide free, detailed event data for select competitions that would cost thousands of dollars commercially.

Available Competitions (as of 2024)

Competition	Seasons	Highlights
FIFA World Cup	2018, 2022	Full tournament coverage
UEFA Euro	2020	Complete event data
FA Women's Super League	2018-2021	Multiple seasons
UEFA Women's Euro	2022	Full tournament
La Liga	2004-2020 (Messi/Barcelona)	Historical treasure
Champions League	Select finals	Big matches

statsbomb_access.py

                from statsbombpy import sb
import pandas as pd

# 1. GET ALL AVAILABLE COMPETITIONS
competitions = sb.competitions()
print("Available competitions:")
print(competitions[['competition_name', 'season_name', 'competition_gender']].drop_duplicates())

# 2. GET MATCHES FOR A COMPETITION
# World Cup 2022
wc_matches = sb.matches(competition_id=43, season_id=106)
print(f"\nWorld Cup 2022 matches: {len(wc_matches)}")
print(wc_matches[['match_date', 'home_team', 'away_team', 'home_score', 'away_score']].head())

# 3. GET EVENTS FOR A MATCH
events = sb.events(match_id=3869685)  # WC Final
print(f"\nEvents in WC Final: {len(events)}")

# 4. GET LINEUPS
lineups = sb.lineups(match_id=3869685)
print("\nArgentina starting XI:")
argentina_lineup = lineups['Argentina']
starters = argentina_lineup[argentina_lineup['positions'].apply(
    lambda x: x[0]['start_reason'] == 'Starting XI' if x else False
)]
print(starters[['player_name', 'jersey_number']].head(11))

# 5. GET 360 FREEZE FRAMES (where available)
# Provides positions of ALL players at moment of event
try:
    frames = sb.frames(match_id=3869685)
    print(f"\n360 frames available: {len(frames)}")
except:
    print("\n360 data not available for this match")

# 6. BATCH DOWNLOAD - All events for a competition
def get_all_competition_events(competition_id, season_id):
    """Download all events for a competition."""
    matches = sb.matches(competition_id=competition_id, season_id=season_id)
    all_events = []

    for idx, match in matches.iterrows():
        events = sb.events(match_id=match['match_id'])
        events['match_id'] = match['match_id']
        events['competition'] = match['competition']
        all_events.append(events)
        print(f"Downloaded match {idx+1}/{len(matches)}")

    return pd.concat(all_events, ignore_index=True)

# Uncomment to download (takes a few minutes)
# wc_events = get_all_competition_events(43, 106)
            

                library(StatsBombR)
library(tidyverse)

# 1. GET ALL AVAILABLE COMPETITIONS
competitions <- FreeCompetitions()
cat("Available competitions:\n")
competitions %>%
  select(competition_name, season_name, competition_gender) %>%
  distinct() %>%
  print()

# 2. GET MATCHES FOR A COMPETITION
# World Cup 2022
wc_comp <- competitions %>%
  filter(competition_id == 43, season_id == 106)
wc_matches <- FreeMatches(wc_comp)
cat(sprintf("\nWorld Cup 2022 matches: %d\n", nrow(wc_matches)))
wc_matches %>%
  select(match_date, home_team.home_team_name, away_team.away_team_name,
         home_score, away_score) %>%
  head() %>%
  print()

# 3. GET EVENTS FOR A MATCH
events <- get.matchFree(wc_matches %>% filter(match_id == 3869685))
cat(sprintf("\nEvents in WC Final: %d\n", nrow(events)))

# 4. GET LINEUPS
lineups <- get.lineupsFree(wc_matches %>% filter(match_id == 3869685))
cat("\nArgentina starting XI:\n")
lineups %>%
  filter(team.name == "Argentina") %>%
  head(11) %>%
  select(player.name, jersey_number) %>%
  print()

# 5. BATCH DOWNLOAD - All events for a competition
get_all_competition_events <- function(comp_df) {
  matches <- FreeMatches(comp_df)
  all_events <- free_allevents(MatchesDF = matches)
  return(all_events)
}

# Download all World Cup 2022 events
# wc_events <- get_all_competition_events(wc_comp)
            

What Makes StatsBomb Data Special

Pressure events - Other providers often don't track pressing
Carry events - Ball progression while dribbling
Detailed shot info - Body part, technique, first-time shot
360 freeze frames - Positions of all 22 players
High-quality xG - Among the best public models

3.3 FBref and Sports Reference

FBref (part of Sports Reference) provides free aggregate statistics powered by Opta data. It's the go-to source for comparing players across entire seasons.

fbref_access.py

                import soccerdata as sd

# Initialize FBref scraper
fbref = sd.FBref(leagues="ENG-Premier League", seasons="2023-2024")

# 1. GET PLAYER SEASON STATS
# Standard stats (goals, assists, minutes)
standard = fbref.read_player_season_stats(stat_type="standard")
print("Standard stats columns:")
print(standard.columns.tolist()[:15])

# 2. SHOOTING STATS
shooting = fbref.read_player_season_stats(stat_type="shooting")
top_shooters = shooting.nsmallest(10, ('Expected', 'xG'))
print("\nTop xG players:")
print(top_shooters[[('Shooting', 'Gls'), ('Expected', 'xG')]].head(10))

# 3. PASSING STATS
passing = fbref.read_player_season_stats(stat_type="passing")

# 4. DEFENSIVE STATS
defense = fbref.read_player_season_stats(stat_type="defense")

# 5. POSSESSION STATS (touches, carries, etc.)
possession = fbref.read_player_season_stats(stat_type="possession")

# 6. TEAM STATS
team_stats = fbref.read_team_season_stats(stat_type="standard")
print("\nTeam standings:")
print(team_stats.head())

# 7. MATCH SCHEDULES AND RESULTS
schedule = fbref.read_schedule()
print(f"\nMatches in schedule: {len(schedule)}")

# 8. COMBINE MULTIPLE STAT TYPES
# Merge standard and shooting stats
combined = standard.join(shooting, rsuffix='_shoot')
print(f"\nCombined stats shape: {combined.shape}")
            

                library(worldfootballR)
library(tidyverse)

# 1. GET PLAYER SEASON STATS
# Big 5 leagues standard stats
standard <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "standard",
  team_or_player = "player"
)
cat("Standard stats columns:\n")
print(names(standard)[1:15])

# Filter to Premier League
pl_standard <- standard %>% filter(Comp == "Premier League")

# 2. SHOOTING STATS
shooting <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "shooting",
  team_or_player = "player"
) %>%
  filter(Comp == "Premier League")

cat("\nTop xG players:\n")
shooting %>%
  arrange(desc(xG)) %>%
  select(Player, Squad, Gls, xG) %>%
  head(10) %>%
  print()

# 3. PASSING STATS
passing <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "passing",
  team_or_player = "player"
)

# 4. TEAM STATS
team_stats <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "standard",
  team_or_player = "team"
)
cat("\nTeam stats:\n")
team_stats %>%
  filter(Comp == "Premier League") %>%
  head() %>%
  print()

# 5. MATCH RESULTS
matches <- fb_match_results(
  country = "ENG",
  gender = "M",
  season_end_year = 2024,
  tier = "1st"
)
cat(sprintf("\nPremier League matches: %d\n", nrow(matches)))
            

FBref Stat Categories

Standard

Goals, assists, minutes, starts, xG, xAG

Shooting

Shots, SoT%, xG, npxG, Goals - xG

Passing

Completion %, progressive, key passes, xA

Defense

Tackles, blocks, interceptions, clearances

Possession

Touches, carries, progressive carries, dribbles

Miscellaneous

Fouls, cards, aerials, recoveries

3.4 Understat

Understat provides shot-level xG data for the top 5 European leagues plus the Russian Premier League, going back to 2014. It's excellent for historical xG analysis.

understat_access.py

                import soccerdata as sd
# Or use: from understatapi import UnderstatClient

# Initialize Understat scraper
understat = sd.Understat(leagues="ENG-Premier League", seasons="2023-2024")

# 1. GET PLAYER STATS
player_stats = understat.read_player_season_stats()
print("Top scorers with xG:")
print(player_stats.nlargest(10, 'goals')[['player', 'team', 'goals', 'xG', 'shots']])

# 2. GET TEAM STATS
team_stats = understat.read_team_season_stats()
print("\nTeam xG stats:")
print(team_stats[['team', 'xG', 'xGA', 'scored', 'missed']])

# 3. GET SHOT-LEVEL DATA
# This is Understat's killer feature
shot_data = understat.read_shot_events()
print(f"\nTotal shots this season: {len(shot_data)}")

# Analyze a single player's shots
haaland_shots = shot_data[shot_data['player'] == 'Erling Haaland']
print(f"\nHaaland shots: {len(haaland_shots)}")
print(f"Haaland total xG: {haaland_shots['xG'].sum():.2f}")
print(f"Haaland goals: {(haaland_shots['result'] == 'Goal').sum()}")

# 4. ANALYZE SHOT QUALITY
avg_xg_by_result = shot_data.groupby('result')['xG'].mean()
print("\nAverage xG by shot result:")
print(avg_xg_by_result)
            

                library(worldfootballR)
library(tidyverse)

# 1. GET PLAYER STATS FROM UNDERSTAT
player_stats <- understat_league_season_shots(
  league = "EPL",
  season_start_year = 2023
)

cat("Top scorers with xG:\n")
player_stats %>%
  group_by(player) %>%
  summarise(
    shots = n(),
    goals = sum(result == "Goal"),
    xG = sum(xG)
  ) %>%
  arrange(desc(goals)) %>%
  head(10) %>%
  print()

# 2. GET TEAM-LEVEL xG
team_stats <- understat_team_stats_breakdown(
  team_url = "https://understat.com/team/Manchester_City/2023"
)
cat("\nTeam breakdown:\n")
print(team_stats)

# 3. ANALYZE SHOT DISTRIBUTION
cat("\nShot analysis:\n")
player_stats %>%
  filter(str_detect(player, "Haaland")) %>%
  summarise(
    shots = n(),
    goals = sum(result == "Goal"),
    xG = sum(xG),
    avg_xG = mean(xG)
  ) %>%
  print()

# 4. XG BY SHOT RESULT
player_stats %>%
  group_by(result) %>%
  summarise(
    count = n(),
    avg_xG = mean(xG)
  ) %>%
  arrange(desc(avg_xG)) %>%
  print()
            

3.5 Transfermarkt

Transfermarkt is the definitive source for player market values, transfer history, contract information, and squad compositions. Essential for business-side analytics.

transfermarkt_access.py

                import soccerdata as sd

# Initialize Transfermarkt scraper
tm = sd.Transfermarkt(leagues="ENG-Premier League", seasons="2023-2024")

# 1. GET PLAYER MARKET VALUES
player_values = tm.read_player_market_values()
print("Most valuable players:")
print(player_values.nlargest(10, 'market_value')[['player', 'team', 'market_value', 'age']])

# 2. GET TEAM MARKET VALUES
team_values = player_values.groupby('team')['market_value'].sum().sort_values(ascending=False)
print("\nMost valuable squads:")
print(team_values.head(10))

# 3. AGE PROFILE
avg_age = player_values.groupby('team')['age'].mean().sort_values()
print("\nYoungest squads:")
print(avg_age.head(5))

# 4. VALUE PER AGE
value_by_age = player_values.groupby('age')['market_value'].agg(['mean', 'count'])
print("\nValue by age (with sufficient sample):")
print(value_by_age[value_by_age['count'] > 10].head(10))
            

                library(worldfootballR)
library(tidyverse)

# 1. GET PLAYER MARKET VALUES
player_values <- tm_player_market_values(
  country_name = "England",
  start_year = 2023
)

cat("Most valuable players:\n")
player_values %>%
  arrange(desc(player_market_value_euro)) %>%
  select(player_name, squad, player_market_value_euro, player_age) %>%
  head(10) %>%
  print()

# 2. GET TEAM MARKET VALUES
team_values <- player_values %>%
  group_by(squad) %>%
  summarise(total_value = sum(player_market_value_euro, na.rm = TRUE)) %>%
  arrange(desc(total_value))

cat("\nMost valuable squads:\n")
team_values %>% head(10) %>% print()

# 3. GET TRANSFER HISTORY
transfers <- tm_team_transfers(
  team_url = "https://www.transfermarkt.com/manchester-city/startseite/verein/281"
)
cat("\nRecent transfers:\n")
transfers %>% head(10) %>% print()
            

3.6 Commercial Data Providers

While free data is excellent for learning, professional analysts often work with commercial data. Understanding what's available helps you appreciate the full landscape.

Opta (Stats Perform)

The industry standard for event data. Used by most major leagues, broadcasters, and clubs.

Every on-ball action tracked
Used by Premier League, La Liga, Serie A
FBref data is powered by Opta

Second Spectrum

Premier League's official tracking data provider since 2019.

Optical tracking (cameras)
25 frames per second
All 22 players + ball position

SkillCorner

Tracking data derived from broadcast video feeds.

More accessible than optical tracking
Covers 20+ leagues
Physical metrics (speed, distance)

Wyscout

Combined video and data platform popular with scouts.

Video clips of every action
Global coverage (100+ leagues)
Used by 1000+ clubs worldwide

Pricing Note

Commercial data subscriptions typically start at $10,000-50,000/year for basic packages. Full tracking data can cost $100,000+/year. Academic licenses are sometimes available at reduced rates.

3.7 Data Source Visualizations

Visualizing data availability and coverage helps you choose the right source for your analysis.

Data Coverage Comparison Chart

script

# Data source feature comparison
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Create comparison matrix
data = {
    "Source": ["StatsBomb", "FBref", "Understat", "Transfermarkt", "WhoScored"],
    "Event Data": [5, 2, 3, 0, 2],
    "xG Models": [5, 4, 5, 0, 4],
    "Historical": [3, 5, 4, 5, 4],
    "Market Values": [0, 0, 0, 5, 0],
    "Free Access": [4, 5, 5, 5, 4]
}
df = pd.DataFrame(data)
df.set_index("Source", inplace=True)

# Create heatmap
fig, ax = plt.subplots(figsize=(10, 6))
heatmap = sns.heatmap(df, annot=True, fmt="d", cmap="Greens",
                      cbar=True, linewidths=2, linecolor="white",
                      annot_kws={"fontsize": 14, "fontweight": "bold"})

ax.set_title("Football Data Source Comparison\nRatings: 0 (None) to 5 (Excellent)",
             fontsize=16, fontweight="bold", pad=20)
ax.set_xlabel("Feature Category", fontsize=12)
ax.set_ylabel("Data Source", fontsize=12)
plt.xticks(rotation=45, ha="right")

plt.tight_layout()
plt.show()

# Data source feature comparison
library(tidyverse)
library(ggplot2)

# Create comparison matrix
data_sources <- data.frame(
  Source = c("StatsBomb", "FBref", "Understat", "Transfermarkt", "WhoScored"),
  Event_Data = c(5, 2, 3, 0, 2),
  xG_Models = c(5, 4, 5, 0, 4),
  Historical = c(3, 5, 4, 5, 4),
  Market_Values = c(0, 0, 0, 5, 0),
  Free_Access = c(4, 5, 5, 5, 4)
)

# Reshape for plotting
data_long <- data_sources %>%
  pivot_longer(cols = -Source, names_to = "Feature", values_to = "Score")

# Create heatmap
ggplot(data_long, aes(x = Feature, y = Source, fill = Score)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Score), color = "white", fontface = "bold", size = 5) +
  scale_fill_gradient(low = "#CCCCCC", high = "#1B5E20", limits = c(0, 5)) +
  labs(
    title = "Football Data Source Comparison",
    subtitle = "Ratings from 0 (None) to 5 (Excellent)",
    x = "Feature Category",
    y = "Data Source",
    fill = "Rating"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text = element_text(size = 11),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

Competition Coverage by Source

script

# Competition coverage bar chart
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Data
sources = ["StatsBomb Free", "FBref", "Understat", "Transfermarkt"]
competitions = ["Premier League", "La Liga", "Serie A", "Bundesliga", "Ligue 1"]

data = {
    "StatsBomb Free": [0, 16, 0, 0, 0],
    "FBref": [30, 30, 30, 30, 30],
    "Understat": [10, 10, 10, 10, 10],
    "Transfermarkt": [50, 50, 50, 50, 50]
}

# Plot
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(competitions))
width = 0.2
colors = ["#E63946", "#457B9D", "#2A9D8F", "#E9C46A"]

for i, (source, color) in enumerate(zip(sources, colors)):
    offset = width * (i - 1.5)
    ax.bar(x + offset, data[source], width, label=source, color=color)

ax.set_xlabel("Competition", fontsize=12)
ax.set_ylabel("Seasons Available", fontsize=12)
ax.set_title("Historical Data Coverage by League\nApproximate seasons of data available",
             fontsize=16, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(competitions, rotation=45, ha="right")
ax.legend(loc="upper right")

plt.tight_layout()
plt.show()

# Competition coverage bar chart
library(ggplot2)

coverage <- data.frame(
  Source = rep(c("StatsBomb Free", "FBref", "Understat", "Transfermarkt"), each = 5),
  Competition = rep(c("Premier League", "La Liga", "Serie A", "Bundesliga", "Ligue 1"), 4),
  Seasons = c(
    # StatsBomb Free
    0, 16, 0, 0, 0,
    # FBref
    30, 30, 30, 30, 30,
    # Understat
    10, 10, 10, 10, 10,
    # Transfermarkt
    50, 50, 50, 50, 50
  )
)

ggplot(coverage, aes(x = Competition, y = Seasons, fill = Source)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c(
    "StatsBomb Free" = "#E63946",
    "FBref" = "#457B9D",
    "Understat" = "#2A9D8F",
    "Transfermarkt" = "#E9C46A"
  )) +
  labs(
    title = "Historical Data Coverage by League",
    subtitle = "Approximate seasons of data available",
    x = "Competition",
    y = "Seasons Available",
    fill = "Data Source"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

Data Freshness Timeline

script

# Data update frequency timeline
import matplotlib.pyplot as plt
import numpy as np

sources = ["StatsBomb", "FBref", "Understat", "Transfermarkt", "WhoScored"]
delays = [24, 2, 1, 168, 1]  # Hours
types = ["Batch", "Near-realtime", "Realtime", "Weekly", "Realtime"]

colors = {
    "Realtime": "#2A9D8F",
    "Near-realtime": "#457B9D",
    "Batch": "#E9C46A",
    "Weekly": "#E76F51"
}

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(sources, delays, color=[colors[t] for t in types])

# Add value labels
for bar, delay in zip(bars, delays):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
            f"{delay}h", ha="center", fontweight="bold")

ax.set_yscale("log")
ax.set_xlabel("Data Source", fontsize=12)
ax.set_ylabel("Delay (hours, log scale)", fontsize=12)
ax.set_title("Data Update Frequency by Source\nHours after match end until data available",
             fontsize=16, fontweight="bold")

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=t) for t, c in colors.items()]
ax.legend(handles=legend_elements, loc="upper right")

plt.tight_layout()
plt.show()

# Data update frequency timeline
library(ggplot2)

update_freq <- data.frame(
  Source = c("StatsBomb", "FBref", "Understat", "Transfermarkt", "WhoScored"),
  Delay_Hours = c(24, 2, 1, 168, 1),  # Hours after match
  Update_Type = c("Batch", "Near-realtime", "Realtime", "Weekly", "Realtime")
)

ggplot(update_freq, aes(x = reorder(Source, Delay_Hours), y = Delay_Hours, fill = Update_Type)) +
  geom_col() +
  geom_text(aes(label = paste0(Delay_Hours, "h")), vjust = -0.5, fontface = "bold") +
  scale_fill_manual(values = c(
    "Realtime" = "#2A9D8F",
    "Near-realtime" = "#457B9D",
    "Batch" = "#E9C46A",
    "Weekly" = "#E76F51"
  )) +
  scale_y_log10() +
  labs(
    title = "Data Update Frequency by Source",
    subtitle = "Hours after match end until data available",
    x = "Data Source",
    y = "Delay (hours, log scale)",
    fill = "Update Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    legend.position = "bottom"
  )

3.8 Practice Exercises

Practice accessing data from different sources with these exercises.

Exercise 3.1: Explore StatsBomb Competitions

Task: Load all available StatsBomb free competitions, filter to men's competitions only, and count how many seasons are available for each competition.

script

# Solution 3.1: StatsBomb competition exploration
from statsbombpy import sb
import pandas as pd
import matplotlib.pyplot as plt

# Load all competitions
comps = sb.competitions()

# Filter mens and count seasons
mens = comps[comps["competition_gender"] == "male"]
summary = mens.groupby("competition_name").agg(
    seasons=("season_name", "nunique"),
    years=("season_name", lambda x: f"{min(x)} - {max(x)}")
).reset_index().sort_values("seasons", ascending=False)

print(summary)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(summary["competition_name"], summary["seasons"], color="#1B5E20")
ax.set_xlabel("Seasons Available")
ax.set_ylabel("Competition")
ax.set_title("StatsBomb Free Data: Seasons by Competition")
plt.tight_layout()
plt.show()

# Solution 3.1: StatsBomb competition exploration
library(StatsBombR)
library(tidyverse)

# Load all competitions
comps <- FreeCompetitions()

# Filter mens and count seasons
mens_summary <- comps %>%
  filter(competition_gender == "male") %>%
  group_by(competition_name) %>%
  summarise(
    seasons = n(),
    years = paste(range(season_name), collapse = " - "),
    .groups = "drop"
  ) %>%
  arrange(desc(seasons))

print(mens_summary)

# Visualize
ggplot(mens_summary, aes(x = reorder(competition_name, seasons), y = seasons)) +
  geom_col(fill = "#1B5E20") +
  coord_flip() +
  labs(title = "StatsBomb Free Data: Seasons by Competition",
       x = "Competition", y = "Seasons Available") +
  theme_minimal()

Exercise 3.2: Compare xG Sources

Task: For a single match, compare xG values from StatsBomb with those from Understat (if available). Calculate the correlation and visualize the differences.

script

# Solution 3.2: xG source comparison (conceptual example)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulated comparison data
np.random.seed(42)
xg_comparison = pd.DataFrame({
    "shot_id": range(1, 21),
    "statsbomb_xg": np.random.uniform(0.02, 0.65, 20),
    "understat_xg": np.random.uniform(0.02, 0.65, 20)
})
xg_comparison["difference"] = xg_comparison["statsbomb_xg"] - xg_comparison["understat_xg"]

# Correlation
cor_value = xg_comparison["statsbomb_xg"].corr(xg_comparison["understat_xg"])
print(f"Correlation: {cor_value:.3f}")

# Scatter plot
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(xg_comparison["statsbomb_xg"], xg_comparison["understat_xg"],
           s=80, alpha=0.7, color="#1B5E20", edgecolors="black")
ax.plot([0, 0.7], [0, 0.7], "r--", label="Perfect agreement")
ax.set_xlabel("StatsBomb xG", fontsize=12)
ax.set_ylabel("Understat xG", fontsize=12)
ax.set_title(f"xG Model Comparison\nCorrelation: {cor_value:.3f}", fontsize=14)
ax.set_aspect("equal")
ax.legend()
plt.tight_layout()
plt.show()

# Solution 3.2: xG source comparison (conceptual example)
library(tidyverse)

# Simulated comparison data (in practice youd merge real data)
xg_comparison <- data.frame(
  shot_id = 1:20,
  statsbomb_xg = runif(20, 0.02, 0.65),
  understat_xg = runif(20, 0.02, 0.65)
) %>%
  mutate(difference = statsbomb_xg - understat_xg)

# Correlation
cor_value <- cor(xg_comparison$statsbomb_xg, xg_comparison$understat_xg)
cat("Correlation:", round(cor_value, 3), "\n")

# Scatter plot comparison
ggplot(xg_comparison, aes(x = statsbomb_xg, y = understat_xg)) +
  geom_point(size = 3, alpha = 0.7, color = "#1B5E20") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "xG Model Comparison: StatsBomb vs Understat",
    subtitle = paste("Correlation:", round(cor_value, 3)),
    x = "StatsBomb xG",
    y = "Understat xG"
  ) +
  theme_minimal() +
  coord_equal()

Exercise 3.3: Build a Local Database

Task: Download all World Cup 2022 match events and save them to a local CSV/Parquet file for faster future access.

script

# Solution 3.3: Build local database
from statsbombpy import sb
import pandas as pd

# Get World Cup 2022 matches
matches = sb.matches(competition_id=43, season_id=106)
print(f"Downloading {len(matches)} matches...")

# Download all events
all_events = []
for i, match_id in enumerate(matches["match_id"]):
    print(f"Match {i+1}/{len(matches)}")
    events = sb.events(match_id=match_id)
    events["match_id"] = match_id
    all_events.append(events)

all_events = pd.concat(all_events, ignore_index=True)
print(f"Total events: {len(all_events)}")

# Save to CSV
all_events.to_csv("wc2022_events.csv", index=False)
print("Saved to wc2022_events.csv")

# For faster loading, use Parquet (requires pyarrow)
# pip install pyarrow
all_events.to_parquet("wc2022_events.parquet", index=False)
print("Saved to wc2022_events.parquet")

# Future use: just load directly
# events = pd.read_parquet("wc2022_events.parquet")

# Solution 3.3: Build local database
library(StatsBombR)
library(tidyverse)

# Get World Cup 2022
comp <- FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106)
matches <- FreeMatches(comp)

# Download all events
cat("Downloading", nrow(matches), "matches...\n")
all_events <- map_dfr(1:nrow(matches), function(i) {
  cat("Match", i, "/", nrow(matches), "\n")
  events <- get.matchFree(matches[i, ])
  events$match_id <- matches$match_id[i]
  return(events)
})

cat("Total events:", nrow(all_events), "\n")

# Save to CSV
write_csv(all_events, "wc2022_events.csv")
cat("Saved to wc2022_events.csv\n")

# For faster loading, use RDS
saveRDS(all_events, "wc2022_events.rds")

# Future use: just load directly
# events <- readRDS("wc2022_events.rds")

3.9 Summary

In this chapter, you learned:

Key Takeaways

StatsBomb Open Data is the best free source for detailed event data
FBref provides comprehensive aggregate stats powered by Opta
Understat is excellent for shot-level xG analysis
Transfermarkt is the go-to for market values and transfers
Different sources have different strengths - choose based on your analysis needs
Build local databases to speed up your workflow and reduce API calls
You can learn professional analytics techniques using only free data

What's Next

In Chapter 4: Data Visualization for Football, we'll learn to create compelling visualizations including shot maps, pass networks, and heat maps.

Capstone - Complete Analytics System

Learning Objectives

3.1 The Football Data Landscape

Categories of Football Data

Free Data Sources

Commercial Data Sources

The Good News

3.2 StatsBomb Open Data

Available Competitions (as of 2024)

What Makes StatsBomb Data Special

3.3 FBref and Sports Reference

FBref Stat Categories

Standard

Shooting

Passing

Defense

Possession

Miscellaneous

3.4 Understat

3.5 Transfermarkt

3.6 Commercial Data Providers

Pricing Note

3.7 Data Source Visualizations

Data Coverage Comparison Chart

Competition Coverage by Source

Data Freshness Timeline

3.8 Practice Exercises

Exercise 3.1: Explore StatsBomb Competitions

Exercise 3.2: Compare xG Sources

Exercise 3.3: Build a Local Database

3.9 Summary

Key Takeaways

What's Next

On This Page

Exercises

Chapter Info