Capstone - Complete Analytics System
Learning Objectives
- Understand the structure of football event data
- Master DataFrame operations for football analysis
- Work with different pitch coordinate systems
- Filter, aggregate, and transform match data
- Handle missing data and join multiple datasets
2.1 Understanding Football Data Structures
Before we can analyze football data effectively, we need to understand how it's organized. Event data follows a consistent structure that, once mastered, unlocks powerful analysis capabilities.
The Anatomy of an Event
Every action in a football match is recorded as an "event." Each event contains multiple pieces of information that allow us to reconstruct and analyze what happened.
from statsbombpy import sb
import pandas as pd
# Load a match
events = sb.events(match_id=3869685)
# Look at a single pass event in detail
pass_events = events[events['type'] == 'Pass']
sample_pass = pass_events.iloc[0]
print("=== Anatomy of a Pass Event ===\n")
# Core identifiers
print("IDENTIFIERS:")
print(f" Event ID: {sample_pass['id']}")
print(f" Match ID: {sample_pass['match_id']}")
print(f" Index: {sample_pass['index']}")
# Temporal information
print("\nTIMING:")
print(f" Period: {sample_pass['period']}")
print(f" Minute: {sample_pass['minute']}")
print(f" Second: {sample_pass['second']}")
print(f" Timestamp: {sample_pass['timestamp']}")
# Who and what
print("\nACTION:")
print(f" Type: {sample_pass['type']}")
print(f" Player: {sample_pass['player']}")
print(f" Team: {sample_pass['team']}")
print(f" Position: {sample_pass['position']}")
# Spatial information
print("\nLOCATION:")
print(f" Start: {sample_pass['location']}")
print(f" End: {sample_pass['pass_end_location']}")
# Pass-specific details
print("\nPASS DETAILS:")
print(f" Recipient: {sample_pass['pass_recipient']}")
print(f" Length: {sample_pass['pass_length']:.1f}")
print(f" Angle: {sample_pass['pass_angle']:.2f} radians")
print(f" Height: {sample_pass['pass_height']}")
print(f" Body Part: {sample_pass['pass_body_part']}")
library(StatsBombR)
library(tidyverse)
# Load a match
events <- get.matchFree(data.frame(match_id = 3869685))
# Look at a single pass event in detail
pass_events <- events %>% filter(type.name == "Pass")
sample_pass <- pass_events %>% slice(1)
cat("=== Anatomy of a Pass Event ===\n\n")
# Core identifiers
cat("IDENTIFIERS:\n")
cat(sprintf(" Event ID: %s\n", sample_pass$id))
cat(sprintf(" Index: %d\n", sample_pass$index))
# Temporal information
cat("\nTIMING:\n")
cat(sprintf(" Period: %d\n", sample_pass$period))
cat(sprintf(" Minute: %d\n", sample_pass$minute))
cat(sprintf(" Second: %d\n", sample_pass$second))
# Who and what
cat("\nACTION:\n")
cat(sprintf(" Type: %s\n", sample_pass$type.name))
cat(sprintf(" Player: %s\n", sample_pass$player.name))
cat(sprintf(" Team: %s\n", sample_pass$team.name))
cat(sprintf(" Position: %s\n", sample_pass$position.name))
# Spatial information
cat("\nLOCATION:\n")
cat(sprintf(" Start: (%.1f, %.1f)\n", sample_pass$location.x, sample_pass$location.y))
cat(sprintf(" End: (%.1f, %.1f)\n", sample_pass$pass.end_location.x,
sample_pass$pass.end_location.y))
# Pass-specific details
cat("\nPASS DETAILS:\n")
cat(sprintf(" Recipient: %s\n", sample_pass$pass.recipient.name))
cat(sprintf(" Length: %.1f\n", sample_pass$pass.length))
cat(sprintf(" Angle: %.2f radians\n", sample_pass$pass.angle))
=== Anatomy of a Pass Event ===
IDENTIFIERS:
Event ID: 8f3a9b2c-...
Match ID: 3869685
Index: 5
TIMING:
Period: 1
Minute: 0
Second: 4
Timestamp: 00:00:04.123
ACTION:
Type: Pass
Player: Lionel Messi
Team: Argentina
Position: Right Wing
LOCATION:
Start: [60.0, 40.0]
End: [55.0, 35.0]
PASS DETAILS:
Recipient: Julian Alvarez
Length: 7.1
Angle: -0.79 radians
Height: Ground Pass
Body Part: Right Foot
Event Types in Football Data
StatsBomb data includes dozens of event types. Here are the most common:
- Pass - Ball transferred between players
- Carry - Ball moved while dribbling
- Shot - Attempt on goal
- Dribble - Take-on attempt
- Pressure - Pressing opponent
- Tackle - Attempting to win ball
- Interception - Cutting out pass
- Block - Blocking shot/pass
- Ball Receipt - Receiving a pass
- Ball Recovery - Winning loose ball
- Clearance - Clearing the ball
- Foul - Foul committed/won
2.2 Working with DataFrames
DataFrames are the fundamental data structure for football analytics. Whether you use pandas (Python) or tidyverse (R), mastering DataFrame operations is essential.
Essential DataFrame Operations
import pandas as pd
from statsbombpy import sb
# Load events
events = sb.events(match_id=3869685)
# 1. VIEWING DATA
print("Shape:", events.shape) # (rows, columns)
print("\nFirst 5 rows:")
print(events.head())
# 2. SELECTING COLUMNS
# Single column
players = events['player']
# Multiple columns
subset = events[['player', 'team', 'type', 'minute']]
# 3. DATA TYPES
print("\nColumn types:")
print(events.dtypes.head(10))
# 4. BASIC STATISTICS
print("\nNumeric column stats:")
print(events[['minute', 'second']].describe())
# 5. UNIQUE VALUES
print("\nUnique event types:")
print(events['type'].unique())
# 6. VALUE COUNTS
print("\nEvent type distribution:")
print(events['type'].value_counts().head(10))
# 7. SORTING
# Sort by minute and second
sorted_events = events.sort_values(['minute', 'second'])
# Sort descending
top_xg_shots = events[events['type'] == 'Shot'].sort_values(
'shot_statsbomb_xg', ascending=False
).head(5)
library(StatsBombR)
library(tidyverse)
# Load events
events <- get.matchFree(data.frame(match_id = 3869685))
# 1. VIEWING DATA
cat("Dimensions:", nrow(events), "x", ncol(events), "\n")
cat("\nFirst 5 rows:\n")
print(head(events, 5))
# 2. SELECTING COLUMNS
# Single column
players <- events$player.name
# Multiple columns (tidyverse way)
subset <- events %>%
select(player.name, team.name, type.name, minute)
# 3. DATA TYPES
cat("\nColumn types:\n")
print(sapply(events[1:10], class))
# 4. BASIC STATISTICS
cat("\nNumeric column stats:\n")
events %>%
select(minute, second) %>%
summary() %>%
print()
# 5. UNIQUE VALUES
cat("\nUnique event types:\n")
print(unique(events$type.name))
# 6. VALUE COUNTS
cat("\nEvent type distribution:\n")
events %>%
count(type.name, sort = TRUE) %>%
head(10) %>%
print()
# 7. SORTING
# Sort by minute and second
sorted_events <- events %>% arrange(minute, second)
# Sort descending - top xG shots
top_xg_shots <- events %>%
filter(type.name == "Shot") %>%
arrange(desc(shot.statsbomb_xg)) %>%
head(5)
2.3 Pitch Coordinate Systems
Different data providers use different coordinate systems. Understanding these is crucial for accurate visualization and analysis.
Important
Always check which coordinate system your data uses! Mixing up coordinate systems is one of the most common mistakes in football analytics.
Common Coordinate Systems
| Provider | X Range | Y Range | Origin | Notes |
|---|---|---|---|---|
| StatsBomb | 0 - 120 | 0 - 80 | Bottom-left | Teams always attack left-to-right in data |
| Opta | 0 - 100 | 0 - 100 | Bottom-left | Percentage-based system |
| Wyscout | 0 - 100 | 0 - 100 | Top-left | Y-axis inverted from Opta |
| UEFA | 0 - 105 | 0 - 68 | Bottom-left | Meters (standard pitch size) |
Converting Between Coordinate Systems
import pandas as pd
import numpy as np
def convert_statsbomb_to_opta(x, y):
"""Convert StatsBomb (120x80) to Opta (100x100) coordinates."""
opta_x = (x / 120) * 100
opta_y = (y / 80) * 100
return opta_x, opta_y
def convert_opta_to_statsbomb(x, y):
"""Convert Opta (100x100) to StatsBomb (120x80) coordinates."""
sb_x = (x / 100) * 120
sb_y = (y / 100) * 80
return sb_x, sb_y
def convert_wyscout_to_statsbomb(x, y):
"""Convert Wyscout to StatsBomb (flip Y-axis)."""
sb_x = (x / 100) * 120
sb_y = ((100 - y) / 100) * 80 # Flip Y
return sb_x, sb_y
# Example: Convert a StatsBomb shot location
shot_x, shot_y = 108.0, 36.0 # Near the penalty spot
opta_x, opta_y = convert_statsbomb_to_opta(shot_x, shot_y)
print(f"StatsBomb: ({shot_x}, {shot_y})")
print(f"Opta: ({opta_x:.1f}, {opta_y:.1f})")
# Convert to meters (standard 105x68 pitch)
meters_x = (shot_x / 120) * 105
meters_y = (shot_y / 80) * 68
print(f"Meters: ({meters_x:.1f}m, {meters_y:.1f}m)")
# Calculate distance to goal center
goal_x, goal_y = 120, 40 # Goal center in StatsBomb coords
distance = np.sqrt((shot_x - goal_x)**2 + (shot_y - goal_y)**2)
distance_meters = (distance / 120) * 105
print(f"\nDistance to goal: {distance_meters:.1f} meters")
library(tidyverse)
# Conversion functions
convert_statsbomb_to_opta <- function(x, y) {
opta_x <- (x / 120) * 100
opta_y <- (y / 80) * 100
return(list(x = opta_x, y = opta_y))
}
convert_opta_to_statsbomb <- function(x, y) {
sb_x <- (x / 100) * 120
sb_y <- (y / 100) * 80
return(list(x = sb_x, y = sb_y))
}
convert_wyscout_to_statsbomb <- function(x, y) {
sb_x <- (x / 100) * 120
sb_y <- ((100 - y) / 100) * 80 # Flip Y
return(list(x = sb_x, y = sb_y))
}
# Example: Convert a StatsBomb shot location
shot_x <- 108.0
shot_y <- 36.0 # Near the penalty spot
opta <- convert_statsbomb_to_opta(shot_x, shot_y)
cat(sprintf("StatsBomb: (%.1f, %.1f)\n", shot_x, shot_y))
cat(sprintf("Opta: (%.1f, %.1f)\n", opta$x, opta$y))
# Convert to meters (standard 105x68 pitch)
meters_x <- (shot_x / 120) * 105
meters_y <- (shot_y / 80) * 68
cat(sprintf("Meters: (%.1fm, %.1fm)\n", meters_x, meters_y))
# Calculate distance to goal center
goal_x <- 120
goal_y <- 40 # Goal center in StatsBomb coords
distance <- sqrt((shot_x - goal_x)^2 + (shot_y - goal_y)^2)
distance_meters <- (distance / 120) * 105
cat(sprintf("\nDistance to goal: %.1f meters\n", distance_meters))
2.4 Filtering and Selecting Events
Filtering is how we isolate the specific events we want to analyze. Master these techniques to quickly extract exactly the data you need.
from statsbombpy import sb
import pandas as pd
events = sb.events(match_id=3869685)
# 1. FILTER BY EVENT TYPE
shots = events[events['type'] == 'Shot']
passes = events[events['type'] == 'Pass']
print(f"Shots: {len(shots)}, Passes: {len(passes)}")
# 2. FILTER BY TEAM
argentina_events = events[events['team'] == 'Argentina']
france_events = events[events['team'] == 'France']
# 3. FILTER BY PLAYER
messi_events = events[events['player'] == 'Lionel Andrés Messi Cuccittini']
print(f"Messi events: {len(messi_events)}")
# 4. FILTER BY TIME
# First half only
first_half = events[events['period'] == 1]
# Last 15 minutes of regular time
late_events = events[(events['minute'] >= 75) & (events['minute'] < 90)]
# 5. FILTER BY LOCATION (final third)
# StatsBomb: x > 80 is final third
final_third = events[events['location'].apply(
lambda loc: loc[0] > 80 if isinstance(loc, list) else False
)]
# 6. MULTIPLE CONDITIONS
# Messi's successful passes in the final third
messi_passes_final_third = events[
(events['player'] == 'Lionel Andrés Messi Cuccittini') &
(events['type'] == 'Pass') &
(events['pass_outcome'].isna()) & # Successful pass = no outcome
(events['location'].apply(lambda loc: loc[0] > 80 if isinstance(loc, list) else False))
]
print(f"Messi final third passes (successful): {len(messi_passes_final_third)}")
# 7. FILTER BY OUTCOME
# Successful passes only (pass_outcome is null for successful passes)
successful_passes = events[
(events['type'] == 'Pass') &
(events['pass_outcome'].isna())
]
# Goals only
goals = events[
(events['type'] == 'Shot') &
(events['shot_outcome'] == 'Goal')
]
print(f"Goals in match: {len(goals)}")
library(StatsBombR)
library(tidyverse)
events <- get.matchFree(data.frame(match_id = 3869685))
# 1. FILTER BY EVENT TYPE
shots <- events %>% filter(type.name == "Shot")
passes <- events %>% filter(type.name == "Pass")
cat(sprintf("Shots: %d, Passes: %d\n", nrow(shots), nrow(passes)))
# 2. FILTER BY TEAM
argentina_events <- events %>% filter(team.name == "Argentina")
france_events <- events %>% filter(team.name == "France")
# 3. FILTER BY PLAYER
messi_events <- events %>%
filter(str_detect(player.name, "Messi"))
cat(sprintf("Messi events: %d\n", nrow(messi_events)))
# 4. FILTER BY TIME
# First half only
first_half <- events %>% filter(period == 1)
# Last 15 minutes of regular time
late_events <- events %>% filter(minute >= 75, minute < 90)
# 5. FILTER BY LOCATION (final third)
# StatsBomb: x > 80 is final third
final_third <- events %>% filter(location.x > 80)
# 6. MULTIPLE CONDITIONS
# Messi's successful passes in the final third
messi_passes_final_third <- events %>%
filter(
str_detect(player.name, "Messi"),
type.name == "Pass",
is.na(pass.outcome.name), # Successful pass
location.x > 80
)
cat(sprintf("Messi final third passes (successful): %d\n",
nrow(messi_passes_final_third)))
# 7. FILTER BY OUTCOME
# Successful passes only
successful_passes <- events %>%
filter(type.name == "Pass", is.na(pass.outcome.name))
# Goals only
goals <- events %>%
filter(type.name == "Shot", shot.outcome.name == "Goal")
cat(sprintf("Goals in match: %d\n", nrow(goals)))
2.5 Aggregating Match Statistics
Aggregation transforms event-level data into summary statistics. This is essential for comparing players, teams, and matches.
from statsbombpy import sb
import pandas as pd
events = sb.events(match_id=3869685)
# 1. BASIC AGGREGATION BY TEAM
team_stats = events.groupby('team').agg(
total_events=('type', 'count'),
passes=('type', lambda x: (x == 'Pass').sum()),
shots=('type', lambda x: (x == 'Shot').sum()),
pressures=('type', lambda x: (x == 'Pressure').sum())
)
print("Team Statistics:")
print(team_stats)
# 2. SHOT STATISTICS BY TEAM
shots = events[events['type'] == 'Shot']
shot_stats = shots.groupby('team').agg(
total_shots=('type', 'count'),
goals=('shot_outcome', lambda x: (x == 'Goal').sum()),
total_xG=('shot_statsbomb_xg', 'sum'),
avg_xG=('shot_statsbomb_xg', 'mean'),
on_target=('shot_outcome', lambda x: x.isin(['Goal', 'Saved']).sum())
).round(2)
print("\nShot Statistics:")
print(shot_stats)
# 3. PLAYER-LEVEL STATISTICS
player_stats = events.groupby(['team', 'player']).agg(
total_actions=('type', 'count'),
passes=('type', lambda x: (x == 'Pass').sum()),
shots=('type', lambda x: (x == 'Shot').sum()),
xG=('shot_statsbomb_xg', 'sum')
).reset_index()
# Top 10 players by actions
print("\nTop 10 Players by Actions:")
print(player_stats.nlargest(10, 'total_actions')[['player', 'team', 'total_actions']])
# 4. PASS COMPLETION BY PLAYER
passes = events[events['type'] == 'Pass']
pass_stats = passes.groupby('player').agg(
total_passes=('type', 'count'),
successful=('pass_outcome', lambda x: x.isna().sum()),
).assign(
completion_rate=lambda df: (df['successful'] / df['total_passes'] * 100).round(1)
).sort_values('total_passes', ascending=False)
print("\nPass Completion Rates (min 20 passes):")
print(pass_stats[pass_stats['total_passes'] >= 20].head(10))
# 5. TIME-BASED AGGREGATION (events per 15 minutes)
events['time_bin'] = pd.cut(events['minute'], bins=range(0, 135, 15))
time_stats = events.groupby(['team', 'time_bin']).agg(
events=('type', 'count'),
shots=('type', lambda x: (x == 'Shot').sum())
)
print("\nEvents by 15-minute intervals:")
print(time_stats.head(10))
library(StatsBombR)
library(tidyverse)
events <- get.matchFree(data.frame(match_id = 3869685))
# 1. BASIC AGGREGATION BY TEAM
team_stats <- events %>%
group_by(team.name) %>%
summarise(
total_events = n(),
passes = sum(type.name == "Pass"),
shots = sum(type.name == "Shot"),
pressures = sum(type.name == "Pressure")
)
cat("Team Statistics:\n")
print(team_stats)
# 2. SHOT STATISTICS BY TEAM
shot_stats <- events %>%
filter(type.name == "Shot") %>%
group_by(team.name) %>%
summarise(
total_shots = n(),
goals = sum(shot.outcome.name == "Goal", na.rm = TRUE),
total_xG = sum(shot.statsbomb_xg, na.rm = TRUE),
avg_xG = mean(shot.statsbomb_xg, na.rm = TRUE),
on_target = sum(shot.outcome.name %in% c("Goal", "Saved"), na.rm = TRUE)
) %>%
mutate(across(where(is.numeric), ~round(., 2)))
cat("\nShot Statistics:\n")
print(shot_stats)
# 3. PLAYER-LEVEL STATISTICS
player_stats <- events %>%
group_by(team.name, player.name) %>%
summarise(
total_actions = n(),
passes = sum(type.name == "Pass"),
shots = sum(type.name == "Shot"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
.groups = "drop"
)
# Top 10 players by actions
cat("\nTop 10 Players by Actions:\n")
player_stats %>%
arrange(desc(total_actions)) %>%
head(10) %>%
select(player.name, team.name, total_actions) %>%
print()
# 4. PASS COMPLETION BY PLAYER
pass_stats <- events %>%
filter(type.name == "Pass") %>%
group_by(player.name) %>%
summarise(
total_passes = n(),
successful = sum(is.na(pass.outcome.name))
) %>%
mutate(completion_rate = round(successful / total_passes * 100, 1)) %>%
arrange(desc(total_passes))
cat("\nPass Completion Rates (min 20 passes):\n")
pass_stats %>%
filter(total_passes >= 20) %>%
head(10) %>%
print()
2.6 Handling Missing Data
Missing data is common in football datasets. Understanding why data is missing and how to handle it properly is crucial for accurate analysis.
Types of Missing Data in Football
| Scenario | Example | How to Handle |
|---|---|---|
| Intentionally Missing | pass_outcome is null for successful passes | This is by design - null means success |
| Not Applicable | shot_xg for non-shot events | Filter first, then analyze |
| Data Not Collected | xG not available in older datasets | Exclude or calculate your own |
| Tracking Issues | Player location temporarily lost | Interpolate or exclude |
import pandas as pd
from statsbombpy import sb
events = sb.events(match_id=3869685)
# 1. CHECK FOR MISSING VALUES
print("Missing values per column:")
print(events.isnull().sum().sort_values(ascending=False).head(20))
# 2. UNDERSTAND MISSING PATTERNS
# For shots: check xG availability
shots = events[events['type'] == 'Shot']
print(f"\nShots with xG: {shots['shot_statsbomb_xg'].notna().sum()}")
print(f"Shots without xG: {shots['shot_statsbomb_xg'].isna().sum()}")
# 3. PASS OUTCOME - null means SUCCESS
passes = events[events['type'] == 'Pass']
successful = passes['pass_outcome'].isna().sum()
unsuccessful = passes['pass_outcome'].notna().sum()
print(f"\nPass outcomes:")
print(f" Successful (null): {successful}")
print(f" Unsuccessful: {unsuccessful}")
print(f" Completion rate: {successful/(successful+unsuccessful)*100:.1f}%")
# 4. FILLING MISSING VALUES (when appropriate)
# Example: Fill missing xG with 0 for non-shot events
events['shot_xg_filled'] = events['shot_statsbomb_xg'].fillna(0)
# Example: Fill missing player positions with 'Unknown'
events['position_filled'] = events['position'].fillna('Unknown')
# 5. DROPPING ROWS WITH MISSING CRITICAL DATA
# Only analyze events with complete location data
events_with_location = events.dropna(subset=['location'])
print(f"\nEvents with location: {len(events_with_location)}")
print(f"Events without location: {len(events) - len(events_with_location)}")
library(StatsBombR)
library(tidyverse)
events <- get.matchFree(data.frame(match_id = 3869685))
# 1. CHECK FOR MISSING VALUES
cat("Missing values per column (top 20):\n")
events %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "column", values_to = "missing") %>%
arrange(desc(missing)) %>%
head(20) %>%
print()
# 2. UNDERSTAND MISSING PATTERNS
# For shots: check xG availability
shots <- events %>% filter(type.name == "Shot")
cat(sprintf("\nShots with xG: %d\n", sum(!is.na(shots$shot.statsbomb_xg))))
cat(sprintf("Shots without xG: %d\n", sum(is.na(shots$shot.statsbomb_xg))))
# 3. PASS OUTCOME - NA means SUCCESS
passes <- events %>% filter(type.name == "Pass")
successful <- sum(is.na(passes$pass.outcome.name))
unsuccessful <- sum(!is.na(passes$pass.outcome.name))
cat("\nPass outcomes:\n")
cat(sprintf(" Successful (NA): %d\n", successful))
cat(sprintf(" Unsuccessful: %d\n", unsuccessful))
cat(sprintf(" Completion rate: %.1f%%\n",
successful/(successful+unsuccessful)*100))
# 4. FILLING MISSING VALUES
# Example: Fill missing xG with 0
events <- events %>%
mutate(shot_xg_filled = replace_na(shot.statsbomb_xg, 0))
# Example: Fill missing positions
events <- events %>%
mutate(position_filled = replace_na(position.name, "Unknown"))
# 5. DROPPING ROWS WITH MISSING DATA
events_with_location <- events %>%
filter(!is.na(location.x), !is.na(location.y))
cat(sprintf("\nEvents with location: %d\n", nrow(events_with_location)))
cat(sprintf("Events without location: %d\n",
nrow(events) - nrow(events_with_location)))
2.7 Joining Multiple Datasets
Often you'll need to combine data from multiple sources or merge match-level data with player information. Understanding joins is essential.
import pandas as pd
from statsbombpy import sb
# Load multiple matches
matches = sb.matches(competition_id=43, season_id=106) # World Cup 2022
# Load events for multiple matches
all_events = []
for match_id in matches['match_id'].head(5): # First 5 matches
events = sb.events(match_id=match_id)
events['match_id'] = match_id
all_events.append(events)
all_events = pd.concat(all_events, ignore_index=True)
print(f"Total events across 5 matches: {len(all_events)}")
# JOIN: Add match info to events
match_info = matches[['match_id', 'home_team', 'away_team', 'home_score', 'away_score']]
events_with_match = all_events.merge(match_info, on='match_id', how='left')
print("\nEvents with match info:")
print(events_with_match[['match_id', 'home_team', 'away_team', 'type', 'player']].head())
# AGGREGATE: Player stats across all matches
player_tournament_stats = all_events.groupby('player').agg(
matches=('match_id', 'nunique'),
total_actions=('type', 'count'),
goals=('shot_outcome', lambda x: (x == 'Goal').sum()),
xG=('shot_statsbomb_xg', 'sum'),
passes=('type', lambda x: (x == 'Pass').sum())
).reset_index().sort_values('total_actions', ascending=False)
print("\nPlayer tournament stats (top 10 by actions):")
print(player_tournament_stats.head(10))
# Create match result column
events_with_match['match_result'] = events_with_match.apply(
lambda row: 'Win' if (
(row['team'] == row['home_team'] and row['home_score'] > row['away_score']) or
(row['team'] == row['away_team'] and row['away_score'] > row['home_score'])
) else ('Loss' if (
(row['team'] == row['home_team'] and row['home_score'] < row['away_score']) or
(row['team'] == row['away_team'] and row['away_score'] < row['home_score'])
) else 'Draw'),
axis=1
)
library(StatsBombR)
library(tidyverse)
# Load World Cup 2022
comp <- FreeCompetitions() %>%
filter(competition_id == 43, season_id == 106)
matches <- FreeMatches(comp)
# Load events for first 5 matches
match_ids <- head(matches$match_id, 5)
all_events <- map_dfr(match_ids, function(mid) {
events <- get.matchFree(data.frame(match_id = mid))
events$match_id <- mid
return(events)
})
cat(sprintf("Total events across 5 matches: %d\n", nrow(all_events)))
# JOIN: Add match info to events
match_info <- matches %>%
select(match_id, home_team.home_team_name, away_team.away_team_name,
home_score, away_score)
events_with_match <- all_events %>%
left_join(match_info, by = "match_id")
cat("\nEvents with match info:\n")
events_with_match %>%
select(match_id, home_team.home_team_name, away_team.away_team_name,
type.name, player.name) %>%
head() %>%
print()
# AGGREGATE: Player stats across all matches
player_tournament_stats <- all_events %>%
group_by(player.name) %>%
summarise(
matches = n_distinct(match_id),
total_actions = n(),
goals = sum(shot.outcome.name == "Goal", na.rm = TRUE),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
passes = sum(type.name == "Pass"),
.groups = "drop"
) %>%
arrange(desc(total_actions))
cat("\nPlayer tournament stats (top 10 by actions):\n")
player_tournament_stats %>% head(10) %>% print()
2.8 Summary
In this chapter, you learned:
Key Concepts
- The structure of football event data
- Different pitch coordinate systems
- How to identify and handle missing data
- When to use different types of joins
Technical Skills
- Filtering events by type, team, player, time, location
- Aggregating statistics at player and team level
- Converting between coordinate systems
- Joining and combining multiple datasets
What's Next
In Chapter 3: The Football Data Ecosystem, we'll explore where to find football data—from free open sources to commercial providers—and how to access each one.
2.9 Data Wrangling Visualizations
After wrangling your data, visualization helps verify your transformations and explore patterns. Here are essential visualizations for data wrangling workflows.
Event Distribution Bar Chart
Visualize the distribution of event types to understand match dynamics:
# Event type distribution bar chart
from statsbombpy import sb
import matplotlib.pyplot as plt
import seaborn as sns
# Load match data
events = sb.events(match_id=3869685)
# Count events by type
event_counts = events["type"].value_counts().head(15)
# Create bar chart
fig, ax = plt.subplots(figsize=(10, 8))
colors = plt.cm.Greens(event_counts / event_counts.max())
bars = ax.barh(event_counts.index, event_counts.values, color=colors)
ax.set_xlabel("Count", fontsize=12)
ax.set_ylabel("Event Type", fontsize=12)
ax.set_title("Event Type Distribution\nWorld Cup 2022 Final", fontsize=16, fontweight="bold")
# Add value labels
for bar, val in zip(bars, event_counts.values):
ax.text(val + 5, bar.get_y() + bar.get_height()/2, str(val),
va="center", fontsize=10)
plt.tight_layout()
plt.show()
# Event type distribution bar chart
library(StatsBombR)
library(tidyverse)
# Load match data
events <- get.matchFree(data.frame(match_id = 3869685))
# Count events by type
event_counts <- events %>%
count(type.name, sort = TRUE) %>%
head(15)
# Create bar chart
ggplot(event_counts, aes(x = reorder(type.name, n), y = n, fill = n)) +
geom_col() +
coord_flip() +
scale_fill_gradient(low = "#90EE90", high = "#1B5E20") +
labs(
title = "Event Type Distribution",
subtitle = "World Cup 2022 Final",
x = "Event Type",
y = "Count"
) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 16),
axis.text = element_text(size = 10)
)
Team Action Comparison
Compare event counts between teams with a grouped bar chart:
# Team comparison grouped bar chart
import pandas as pd
import matplotlib.pyplot as plt
# Filter key events
key_events = events[events["type"].isin(["Pass", "Shot", "Pressure", "Dribble", "Tackle"])]
team_counts = key_events.groupby(["team", "type"]).size().unstack(fill_value=0)
# Create grouped bar chart
fig, ax = plt.subplots(figsize=(12, 6))
x = range(len(team_counts.columns))
width = 0.35
teams = team_counts.index.tolist()
colors = {"Argentina": "#75AADB", "France": "#002395"}
for i, team in enumerate(teams):
offset = width * (i - 0.5)
bars = ax.bar([xi + offset for xi in x], team_counts.loc[team],
width, label=team, color=colors.get(team, "gray"))
ax.set_xlabel("Event Type", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
ax.set_title("Team Action Comparison\nKey Event Types", fontsize=16, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(team_counts.columns, rotation=45, ha="right")
ax.legend()
plt.tight_layout()
plt.show()
# Team comparison grouped bar chart
team_events <- events %>%
filter(type.name %in% c("Pass", "Shot", "Pressure", "Dribble", "Tackle")) %>%
count(team.name, type.name)
ggplot(team_events, aes(x = type.name, y = n, fill = team.name)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Argentina" = "#75AADB", "France" = "#002395")) +
labs(
title = "Team Action Comparison",
subtitle = "Key Event Types",
x = "Event Type",
y = "Count",
fill = "Team"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
legend.position = "bottom"
)
Time-Series Event Flow
Visualize how events unfold over match time:
# Events over time line chart
import matplotlib.pyplot as plt
import pandas as pd
# Create time bins
passes = events[events["type"] == "Pass"].copy()
passes["time_bin"] = pd.cut(passes["minute"], bins=range(0, 130, 5))
# Count by team and time
time_counts = passes.groupby(["team", "time_bin"]).size().unstack(level=0, fill_value=0)
# Plot
fig, ax = plt.subplots(figsize=(14, 6))
colors = {"Argentina": "#75AADB", "France": "#002395"}
for team in time_counts.columns:
ax.plot(range(len(time_counts)), time_counts[team],
marker="o", label=team, color=colors.get(team, "gray"), linewidth=2)
ax.set_xlabel("Time Period", fontsize=12)
ax.set_ylabel("Pass Count", fontsize=12)
ax.set_title("Passing Activity Over Match Time", fontsize=16, fontweight="bold")
ax.set_xticks(range(len(time_counts)))
ax.set_xticklabels([str(x) for x in time_counts.index], rotation=45, ha="right", fontsize=8)
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Events over time line chart
events_timeline <- events %>%
filter(type.name %in% c("Pass", "Shot")) %>%
mutate(time_bin = cut(minute, breaks = seq(0, 125, 5))) %>%
count(team.name, type.name, time_bin) %>%
filter(!is.na(time_bin))
ggplot(events_timeline, aes(x = time_bin, y = n, color = team.name, group = team.name)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
facet_wrap(~type.name, scales = "free_y") +
scale_color_manual(values = c("Argentina" = "#75AADB", "France" = "#002395")) +
labs(
title = "Event Flow Over Match Time",
x = "Time Period (minutes)",
y = "Count",
color = "Team"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
plot.title = element_text(face = "bold"),
legend.position = "bottom"
)
Coordinate Distribution Scatter Plot
Verify coordinate data by plotting event locations:
# Event location scatter plot
from mplsoccer import Pitch
import matplotlib.pyplot as plt
# Filter events with locations
locations = events[events["x"].notna() & events["y"].notna()].copy()
locations = locations[locations["type"].isin(["Pass", "Shot", "Carry"])]
# Create pitch for each team
teams = locations["team"].unique()
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
colors = {"Pass": "#87CEEB", "Shot": "#FFD700", "Carry": "#FF6B6B"}
pitch = Pitch(pitch_color="#1B5E20", line_color="white")
for ax, team in zip(axes, teams):
pitch.draw(ax=ax)
team_data = locations[locations["team"] == team]
for event_type, color in colors.items():
type_data = team_data[team_data["type"] == event_type]
ax.scatter(type_data["x"], type_data["y"], c=color,
alpha=0.4, s=20, label=event_type)
ax.set_title(team, fontsize=14, fontweight="bold")
ax.legend(loc="upper left", fontsize=8)
fig.suptitle("Event Locations by Type and Team", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()
# Event location scatter plot
library(ggsoccer)
event_locations <- events %>%
filter(!is.na(location.x), !is.na(location.y)) %>%
filter(type.name %in% c("Pass", "Shot", "Carry"))
ggplot(event_locations) +
annotate_pitch(colour = "white", fill = "#1B5E20") +
geom_point(aes(x = location.x, y = location.y, color = type.name),
alpha = 0.4, size = 1.5) +
scale_color_manual(values = c("Pass" = "#87CEEB", "Shot" = "#FFD700", "Carry" = "#FF6B6B")) +
facet_wrap(~team.name) +
theme_pitch() +
coord_flip() +
labs(
title = "Event Locations by Type and Team",
color = "Event Type"
) +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
2.10 Practice Exercises
Apply your data wrangling skills with these exercises. Each includes hints and full solutions.
Exercise 2.1: Filter and Aggregate Player Stats
Task: Create a summary of passing statistics for all players who attempted at least 30 passes in a match. Include: total passes, completed passes, completion rate, and progressive passes.
- Filter for Pass events first
- Group by player name
- Successful passes have null pass_outcome in StatsBomb
- Progressive passes move the ball 10+ yards toward goal
# Solution 2.1: Player passing summary
from statsbombpy import sb
import pandas as pd
events = sb.events(match_id=3869685)
# Filter passes
passes = events[events["type"] == "Pass"].copy()
# Calculate metrics
passes["is_complete"] = passes["pass_outcome"].isna()
passes["is_progressive"] = (
(passes["pass_end_location"].apply(lambda x: x[0] if isinstance(x, list) else 0) -
passes["location"].apply(lambda x: x[0] if isinstance(x, list) else 0)) >= 10
)
# Aggregate by player
player_passing = passes.groupby(["player", "team"]).agg(
total_passes=("type", "count"),
completed=("is_complete", "sum"),
progressive=("is_progressive", lambda x: (x & passes.loc[x.index, "is_complete"]).sum())
).reset_index()
# Filter and calculate rate
player_passing = player_passing[player_passing["total_passes"] >= 30]
player_passing["completion_rate"] = round(
player_passing["completed"] / player_passing["total_passes"] * 100, 1
)
print(player_passing.sort_values("completion_rate", ascending=False))
# Solution 2.1: Player passing summary
library(StatsBombR)
library(tidyverse)
events <- get.matchFree(data.frame(match_id = 3869685))
player_passing <- events %>%
filter(type.name == "Pass") %>%
mutate(
is_complete = is.na(pass.outcome.name),
is_progressive = (pass.end_location.x - location.x) >= 10 &
pass.end_location.x >= 60
) %>%
group_by(player.name, team.name) %>%
summarise(
total_passes = n(),
completed = sum(is_complete),
progressive = sum(is_progressive & is_complete, na.rm = TRUE),
.groups = "drop"
) %>%
filter(total_passes >= 30) %>%
mutate(
completion_rate = round(completed / total_passes * 100, 1)
) %>%
arrange(desc(completion_rate))
print(player_passing)
Exercise 2.2: Convert Coordinates and Plot
Task: Take shot data in StatsBomb coordinates (120x80) and convert to real meters (105x68). Then calculate the distance from each shot to the center of the goal.
- Conversion: meters_x = sb_x / 120 * 105
- Conversion: meters_y = sb_y / 80 * 68
- Goal center in meters: (105, 34)
- Distance formula: sqrt((x2-x1)² + (y2-y1)²)
# Solution 2.2: Coordinate conversion and distance
from statsbombpy import sb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
events = sb.events(match_id=3869685)
shots = events[events["type"] == "Shot"].copy()
# Extract coordinates
shots["sb_x"] = shots["location"].apply(lambda x: x[0] if isinstance(x, list) else None)
shots["sb_y"] = shots["location"].apply(lambda x: x[1] if isinstance(x, list) else None)
# Convert to meters
shots["meters_x"] = shots["sb_x"] / 120 * 105
shots["meters_y"] = shots["sb_y"] / 80 * 68
# Goal center
goal_x, goal_y = 105, 34
# Calculate distance
shots["distance_to_goal"] = np.sqrt(
(goal_x - shots["meters_x"])**2 +
(goal_y - shots["meters_y"])**2
)
# Summary stats
print(f"Average shot distance: {shots[\"distance_to_goal\"].mean():.1f} meters")
print(f"Closest shot: {shots[\"distance_to_goal\"].min():.1f} meters")
print(f"Furthest shot: {shots[\"distance_to_goal\"].max():.1f} meters")
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
colors = shots["shot_outcome"].apply(lambda x: "#FFD700" if x == "Goal" else "#888888")
ax.scatter(shots["distance_to_goal"], shots["shot_statsbomb_xg"],
c=colors, s=80, alpha=0.7, edgecolors="black")
ax.set_xlabel("Distance to Goal (meters)", fontsize=12)
ax.set_ylabel("xG", fontsize=12)
ax.set_title("Shot Distance vs Expected Goals", fontsize=16, fontweight="bold")
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Solution 2.2: Coordinate conversion and distance
library(StatsBombR)
library(tidyverse)
events <- get.matchFree(data.frame(match_id = 3869685))
shots <- events %>%
filter(type.name == "Shot") %>%
select(player.name, location.x, location.y, shot.statsbomb_xg, shot.outcome.name) %>%
mutate(
# Convert to meters
meters_x = location.x / 120 * 105,
meters_y = location.y / 80 * 68,
# Goal center (end line center)
goal_x = 105,
goal_y = 34,
# Calculate distance
distance_to_goal = sqrt((goal_x - meters_x)^2 + (goal_y - meters_y)^2)
) %>%
arrange(distance_to_goal)
# Summary
cat("Average shot distance:", round(mean(shots$distance_to_goal), 1), "meters\n")
cat("Closest shot:", round(min(shots$distance_to_goal), 1), "meters\n")
cat("Furthest shot:", round(max(shots$distance_to_goal), 1), "meters\n")
# Plot distance vs xG
ggplot(shots, aes(x = distance_to_goal, y = shot.statsbomb_xg)) +
geom_point(aes(color = shot.outcome.name == "Goal"), size = 3, alpha = 0.7) +
geom_smooth(method = "loess", se = TRUE, color = "#1B5E20") +
scale_color_manual(values = c("FALSE" = "#888888", "TRUE" = "#FFD700"),
labels = c("No Goal", "Goal")) +
labs(
title = "Shot Distance vs Expected Goals",
x = "Distance to Goal (meters)",
y = "xG",
color = "Result"
) +
theme_minimal()
Exercise 2.3: Join Multiple Matches
Task: Load events from 5 World Cup 2022 matches and create a tournament-level player summary with total goals, xG, and shots across all matches.
- World Cup 2022: competition_id=43, season_id=106
- Loop through match IDs to load events
- Add match_id column before combining
- Aggregate shots across all matches
# Solution 2.3: Multi-match player summary
from statsbombpy import sb
import pandas as pd
# Get World Cup 2022 matches
matches = sb.matches(competition_id=43, season_id=106)
# Load first 5 matches
all_events = []
for match_id in matches["match_id"].head(5):
events = sb.events(match_id=match_id)
events["match_id"] = match_id
all_events.append(events)
all_events = pd.concat(all_events, ignore_index=True)
# Filter shots and aggregate
shots = all_events[all_events["type"] == "Shot"]
player_summary = shots.groupby("player").agg(
matches=("match_id", "nunique"),
shots=("type", "count"),
goals=("shot_outcome", lambda x: (x == "Goal").sum()),
xG=("shot_statsbomb_xg", "sum")
).reset_index()
player_summary["xG_diff"] = player_summary["goals"] - player_summary["xG"]
player_summary["shots_per_match"] = round(
player_summary["shots"] / player_summary["matches"], 2
)
# Filter and sort
player_summary = player_summary[player_summary["shots"] >= 3]
player_summary = player_summary.sort_values("xG", ascending=False)
print(player_summary.head(15))
# Solution 2.3: Multi-match player summary
library(StatsBombR)
library(tidyverse)
# Get matches
comp <- FreeCompetitions() %>%
filter(competition_id == 43, season_id == 106)
matches <- FreeMatches(comp)
# Load first 5 matches
match_ids <- head(matches$match_id, 5)
all_events <- map_dfr(match_ids, function(mid) {
events <- get.matchFree(data.frame(match_id = mid))
events$match_id <- mid
return(events)
})
# Player tournament summary
player_summary <- all_events %>%
filter(type.name == "Shot") %>%
group_by(player.name) %>%
summarise(
matches = n_distinct(match_id),
shots = n(),
goals = sum(shot.outcome.name == "Goal"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
xG_diff = goals - xG,
shots_per_match = round(shots / matches, 2)
) %>%
filter(shots >= 3) %>%
arrange(desc(xG))
print(player_summary, n = 15)
Key Takeaways
- Every event has context - identifiers, timing, location, and action-specific details
- Coordinate systems vary - always check your data provider's documentation
- Missing data isn't always bad - null pass_outcome means successful pass in StatsBomb
- Filter before aggregate - get the right subset of data first
- Visualize your wrangled data - charts help verify transformations are correct
- Practice makes perfect - the more you work with football data, the more intuitive it becomes