Chapter 2: Data Wrangling for Football - Soccer Analytics Textbook

2.1 Understanding Football Data Structures

Before we can analyze football data effectively, we need to understand how it's organized. Event data follows a consistent structure that, once mastered, unlocks powerful analysis capabilities.

The Anatomy of an Event

Every action in a football match is recorded as an "event." Each event contains multiple pieces of information that allow us to reconstruct and analyze what happened.

event_structure.py

                from statsbombpy import sb
import pandas as pd

# Load a match
events = sb.events(match_id=3869685)

# Look at a single pass event in detail
pass_events = events[events['type'] == 'Pass']
sample_pass = pass_events.iloc[0]

print("=== Anatomy of a Pass Event ===\n")

# Core identifiers
print("IDENTIFIERS:")
print(f"  Event ID: {sample_pass['id']}")
print(f"  Match ID: {sample_pass['match_id']}")
print(f"  Index: {sample_pass['index']}")

# Temporal information
print("\nTIMING:")
print(f"  Period: {sample_pass['period']}")
print(f"  Minute: {sample_pass['minute']}")
print(f"  Second: {sample_pass['second']}")
print(f"  Timestamp: {sample_pass['timestamp']}")

# Who and what
print("\nACTION:")
print(f"  Type: {sample_pass['type']}")
print(f"  Player: {sample_pass['player']}")
print(f"  Team: {sample_pass['team']}")
print(f"  Position: {sample_pass['position']}")

# Spatial information
print("\nLOCATION:")
print(f"  Start: {sample_pass['location']}")
print(f"  End: {sample_pass['pass_end_location']}")

# Pass-specific details
print("\nPASS DETAILS:")
print(f"  Recipient: {sample_pass['pass_recipient']}")
print(f"  Length: {sample_pass['pass_length']:.1f}")
print(f"  Angle: {sample_pass['pass_angle']:.2f} radians")
print(f"  Height: {sample_pass['pass_height']}")
print(f"  Body Part: {sample_pass['pass_body_part']}")
            

                library(StatsBombR)
library(tidyverse)

# Load a match
events <- get.matchFree(data.frame(match_id = 3869685))

# Look at a single pass event in detail
pass_events <- events %>% filter(type.name == "Pass")
sample_pass <- pass_events %>% slice(1)

cat("=== Anatomy of a Pass Event ===\n\n")

# Core identifiers
cat("IDENTIFIERS:\n")
cat(sprintf("  Event ID: %s\n", sample_pass$id))
cat(sprintf("  Index: %d\n", sample_pass$index))

# Temporal information
cat("\nTIMING:\n")
cat(sprintf("  Period: %d\n", sample_pass$period))
cat(sprintf("  Minute: %d\n", sample_pass$minute))
cat(sprintf("  Second: %d\n", sample_pass$second))

# Who and what
cat("\nACTION:\n")
cat(sprintf("  Type: %s\n", sample_pass$type.name))
cat(sprintf("  Player: %s\n", sample_pass$player.name))
cat(sprintf("  Team: %s\n", sample_pass$team.name))
cat(sprintf("  Position: %s\n", sample_pass$position.name))

# Spatial information
cat("\nLOCATION:\n")
cat(sprintf("  Start: (%.1f, %.1f)\n", sample_pass$location.x, sample_pass$location.y))
cat(sprintf("  End: (%.1f, %.1f)\n", sample_pass$pass.end_location.x,
            sample_pass$pass.end_location.y))

# Pass-specific details
cat("\nPASS DETAILS:\n")
cat(sprintf("  Recipient: %s\n", sample_pass$pass.recipient.name))
cat(sprintf("  Length: %.1f\n", sample_pass$pass.length))
cat(sprintf("  Angle: %.2f radians\n", sample_pass$pass.angle))
            

Output

=== Anatomy of a Pass Event ===

IDENTIFIERS:
  Event ID: 8f3a9b2c-...
  Match ID: 3869685
  Index: 5

TIMING:
  Period: 1
  Minute: 0
  Second: 4
  Timestamp: 00:00:04.123

ACTION:
  Type: Pass
  Player: Lionel Messi
  Team: Argentina
  Position: Right Wing

LOCATION:
  Start: [60.0, 40.0]
  End: [55.0, 35.0]

PASS DETAILS:
  Recipient: Julian Alvarez
  Length: 7.1
  Angle: -0.79 radians
  Height: Ground Pass
  Body Part: Right Foot

Event Types in Football Data

StatsBomb data includes dozens of event types. Here are the most common:

Attacking

Pass - Ball transferred between players
Carry - Ball moved while dribbling
Shot - Attempt on goal
Dribble - Take-on attempt

Defensive

Pressure - Pressing opponent
Tackle - Attempting to win ball
Interception - Cutting out pass
Block - Blocking shot/pass

Other

Ball Receipt - Receiving a pass
Ball Recovery - Winning loose ball
Clearance - Clearing the ball
Foul - Foul committed/won

2.2 Working with DataFrames

DataFrames are the fundamental data structure for football analytics. Whether you use pandas (Python) or tidyverse (R), mastering DataFrame operations is essential.

Essential DataFrame Operations

dataframe_basics.py

                import pandas as pd
from statsbombpy import sb

# Load events
events = sb.events(match_id=3869685)

# 1. VIEWING DATA
print("Shape:", events.shape)  # (rows, columns)
print("\nFirst 5 rows:")
print(events.head())

# 2. SELECTING COLUMNS
# Single column
players = events['player']

# Multiple columns
subset = events[['player', 'team', 'type', 'minute']]

# 3. DATA TYPES
print("\nColumn types:")
print(events.dtypes.head(10))

# 4. BASIC STATISTICS
print("\nNumeric column stats:")
print(events[['minute', 'second']].describe())

# 5. UNIQUE VALUES
print("\nUnique event types:")
print(events['type'].unique())

# 6. VALUE COUNTS
print("\nEvent type distribution:")
print(events['type'].value_counts().head(10))

# 7. SORTING
# Sort by minute and second
sorted_events = events.sort_values(['minute', 'second'])

# Sort descending
top_xg_shots = events[events['type'] == 'Shot'].sort_values(
    'shot_statsbomb_xg', ascending=False
).head(5)
            

                library(StatsBombR)
library(tidyverse)

# Load events
events <- get.matchFree(data.frame(match_id = 3869685))

# 1. VIEWING DATA
cat("Dimensions:", nrow(events), "x", ncol(events), "\n")
cat("\nFirst 5 rows:\n")
print(head(events, 5))

# 2. SELECTING COLUMNS
# Single column
players <- events$player.name

# Multiple columns (tidyverse way)
subset <- events %>%
  select(player.name, team.name, type.name, minute)

# 3. DATA TYPES
cat("\nColumn types:\n")
print(sapply(events[1:10], class))

# 4. BASIC STATISTICS
cat("\nNumeric column stats:\n")
events %>%
  select(minute, second) %>%
  summary() %>%
  print()

# 5. UNIQUE VALUES
cat("\nUnique event types:\n")
print(unique(events$type.name))

# 6. VALUE COUNTS
cat("\nEvent type distribution:\n")
events %>%
  count(type.name, sort = TRUE) %>%
  head(10) %>%
  print()

# 7. SORTING
# Sort by minute and second
sorted_events <- events %>% arrange(minute, second)

# Sort descending - top xG shots
top_xg_shots <- events %>%
  filter(type.name == "Shot") %>%
  arrange(desc(shot.statsbomb_xg)) %>%
  head(5)
            

2.3 Pitch Coordinate Systems

Different data providers use different coordinate systems. Understanding these is crucial for accurate visualization and analysis.

Important

Always check which coordinate system your data uses! Mixing up coordinate systems is one of the most common mistakes in football analytics.

Common Coordinate Systems

Provider	X Range	Y Range	Origin	Notes
StatsBomb	0 - 120	0 - 80	Bottom-left	Teams always attack left-to-right in data
Opta	0 - 100	0 - 100	Bottom-left	Percentage-based system
Wyscout	0 - 100	0 - 100	Top-left	Y-axis inverted from Opta
UEFA	0 - 105	0 - 68	Bottom-left	Meters (standard pitch size)

Converting Between Coordinate Systems

coordinate_conversion.py

                import pandas as pd
import numpy as np

def convert_statsbomb_to_opta(x, y):
    """Convert StatsBomb (120x80) to Opta (100x100) coordinates."""
    opta_x = (x / 120) * 100
    opta_y = (y / 80) * 100
    return opta_x, opta_y

def convert_opta_to_statsbomb(x, y):
    """Convert Opta (100x100) to StatsBomb (120x80) coordinates."""
    sb_x = (x / 100) * 120
    sb_y = (y / 100) * 80
    return sb_x, sb_y

def convert_wyscout_to_statsbomb(x, y):
    """Convert Wyscout to StatsBomb (flip Y-axis)."""
    sb_x = (x / 100) * 120
    sb_y = ((100 - y) / 100) * 80  # Flip Y
    return sb_x, sb_y

# Example: Convert a StatsBomb shot location
shot_x, shot_y = 108.0, 36.0  # Near the penalty spot

opta_x, opta_y = convert_statsbomb_to_opta(shot_x, shot_y)
print(f"StatsBomb: ({shot_x}, {shot_y})")
print(f"Opta:      ({opta_x:.1f}, {opta_y:.1f})")

# Convert to meters (standard 105x68 pitch)
meters_x = (shot_x / 120) * 105
meters_y = (shot_y / 80) * 68
print(f"Meters:    ({meters_x:.1f}m, {meters_y:.1f}m)")

# Calculate distance to goal center
goal_x, goal_y = 120, 40  # Goal center in StatsBomb coords
distance = np.sqrt((shot_x - goal_x)**2 + (shot_y - goal_y)**2)
distance_meters = (distance / 120) * 105
print(f"\nDistance to goal: {distance_meters:.1f} meters")
            

                library(tidyverse)

# Conversion functions
convert_statsbomb_to_opta <- function(x, y) {
  opta_x <- (x / 120) * 100
  opta_y <- (y / 80) * 100
  return(list(x = opta_x, y = opta_y))
}

convert_opta_to_statsbomb <- function(x, y) {
  sb_x <- (x / 100) * 120
  sb_y <- (y / 100) * 80
  return(list(x = sb_x, y = sb_y))
}

convert_wyscout_to_statsbomb <- function(x, y) {
  sb_x <- (x / 100) * 120
  sb_y <- ((100 - y) / 100) * 80  # Flip Y
  return(list(x = sb_x, y = sb_y))
}

# Example: Convert a StatsBomb shot location
shot_x <- 108.0
shot_y <- 36.0  # Near the penalty spot

opta <- convert_statsbomb_to_opta(shot_x, shot_y)
cat(sprintf("StatsBomb: (%.1f, %.1f)\n", shot_x, shot_y))
cat(sprintf("Opta:      (%.1f, %.1f)\n", opta$x, opta$y))

# Convert to meters (standard 105x68 pitch)
meters_x <- (shot_x / 120) * 105
meters_y <- (shot_y / 80) * 68
cat(sprintf("Meters:    (%.1fm, %.1fm)\n", meters_x, meters_y))

# Calculate distance to goal center
goal_x <- 120
goal_y <- 40  # Goal center in StatsBomb coords
distance <- sqrt((shot_x - goal_x)^2 + (shot_y - goal_y)^2)
distance_meters <- (distance / 120) * 105
cat(sprintf("\nDistance to goal: %.1f meters\n", distance_meters))
            

2.4 Filtering and Selecting Events

Filtering is how we isolate the specific events we want to analyze. Master these techniques to quickly extract exactly the data you need.

filtering_events.py

                from statsbombpy import sb
import pandas as pd

events = sb.events(match_id=3869685)

# 1. FILTER BY EVENT TYPE
shots = events[events['type'] == 'Shot']
passes = events[events['type'] == 'Pass']
print(f"Shots: {len(shots)}, Passes: {len(passes)}")

# 2. FILTER BY TEAM
argentina_events = events[events['team'] == 'Argentina']
france_events = events[events['team'] == 'France']

# 3. FILTER BY PLAYER
messi_events = events[events['player'] == 'Lionel Andrés Messi Cuccittini']
print(f"Messi events: {len(messi_events)}")

# 4. FILTER BY TIME
# First half only
first_half = events[events['period'] == 1]

# Last 15 minutes of regular time
late_events = events[(events['minute'] >= 75) & (events['minute'] < 90)]

# 5. FILTER BY LOCATION (final third)
# StatsBomb: x > 80 is final third
final_third = events[events['location'].apply(
    lambda loc: loc[0] > 80 if isinstance(loc, list) else False
)]

# 6. MULTIPLE CONDITIONS
# Messi's successful passes in the final third
messi_passes_final_third = events[
    (events['player'] == 'Lionel Andrés Messi Cuccittini') &
    (events['type'] == 'Pass') &
    (events['pass_outcome'].isna()) &  # Successful pass = no outcome
    (events['location'].apply(lambda loc: loc[0] > 80 if isinstance(loc, list) else False))
]
print(f"Messi final third passes (successful): {len(messi_passes_final_third)}")

# 7. FILTER BY OUTCOME
# Successful passes only (pass_outcome is null for successful passes)
successful_passes = events[
    (events['type'] == 'Pass') &
    (events['pass_outcome'].isna())
]

# Goals only
goals = events[
    (events['type'] == 'Shot') &
    (events['shot_outcome'] == 'Goal')
]
print(f"Goals in match: {len(goals)}")
            

                library(StatsBombR)
library(tidyverse)

events <- get.matchFree(data.frame(match_id = 3869685))

# 1. FILTER BY EVENT TYPE
shots <- events %>% filter(type.name == "Shot")
passes <- events %>% filter(type.name == "Pass")
cat(sprintf("Shots: %d, Passes: %d\n", nrow(shots), nrow(passes)))

# 2. FILTER BY TEAM
argentina_events <- events %>% filter(team.name == "Argentina")
france_events <- events %>% filter(team.name == "France")

# 3. FILTER BY PLAYER
messi_events <- events %>%
  filter(str_detect(player.name, "Messi"))
cat(sprintf("Messi events: %d\n", nrow(messi_events)))

# 4. FILTER BY TIME
# First half only
first_half <- events %>% filter(period == 1)

# Last 15 minutes of regular time
late_events <- events %>% filter(minute >= 75, minute < 90)

# 5. FILTER BY LOCATION (final third)
# StatsBomb: x > 80 is final third
final_third <- events %>% filter(location.x > 80)

# 6. MULTIPLE CONDITIONS
# Messi's successful passes in the final third
messi_passes_final_third <- events %>%
  filter(
    str_detect(player.name, "Messi"),
    type.name == "Pass",
    is.na(pass.outcome.name),  # Successful pass
    location.x > 80
  )
cat(sprintf("Messi final third passes (successful): %d\n",
            nrow(messi_passes_final_third)))

# 7. FILTER BY OUTCOME
# Successful passes only
successful_passes <- events %>%
  filter(type.name == "Pass", is.na(pass.outcome.name))

# Goals only
goals <- events %>%
  filter(type.name == "Shot", shot.outcome.name == "Goal")
cat(sprintf("Goals in match: %d\n", nrow(goals)))
            

2.5 Aggregating Match Statistics

Aggregation transforms event-level data into summary statistics. This is essential for comparing players, teams, and matches.

aggregating_stats.py

                from statsbombpy import sb
import pandas as pd

events = sb.events(match_id=3869685)

# 1. BASIC AGGREGATION BY TEAM
team_stats = events.groupby('team').agg(
    total_events=('type', 'count'),
    passes=('type', lambda x: (x == 'Pass').sum()),
    shots=('type', lambda x: (x == 'Shot').sum()),
    pressures=('type', lambda x: (x == 'Pressure').sum())
)
print("Team Statistics:")
print(team_stats)

# 2. SHOT STATISTICS BY TEAM
shots = events[events['type'] == 'Shot']
shot_stats = shots.groupby('team').agg(
    total_shots=('type', 'count'),
    goals=('shot_outcome', lambda x: (x == 'Goal').sum()),
    total_xG=('shot_statsbomb_xg', 'sum'),
    avg_xG=('shot_statsbomb_xg', 'mean'),
    on_target=('shot_outcome', lambda x: x.isin(['Goal', 'Saved']).sum())
).round(2)
print("\nShot Statistics:")
print(shot_stats)

# 3. PLAYER-LEVEL STATISTICS
player_stats = events.groupby(['team', 'player']).agg(
    total_actions=('type', 'count'),
    passes=('type', lambda x: (x == 'Pass').sum()),
    shots=('type', lambda x: (x == 'Shot').sum()),
    xG=('shot_statsbomb_xg', 'sum')
).reset_index()

# Top 10 players by actions
print("\nTop 10 Players by Actions:")
print(player_stats.nlargest(10, 'total_actions')[['player', 'team', 'total_actions']])

# 4. PASS COMPLETION BY PLAYER
passes = events[events['type'] == 'Pass']
pass_stats = passes.groupby('player').agg(
    total_passes=('type', 'count'),
    successful=('pass_outcome', lambda x: x.isna().sum()),
).assign(
    completion_rate=lambda df: (df['successful'] / df['total_passes'] * 100).round(1)
).sort_values('total_passes', ascending=False)

print("\nPass Completion Rates (min 20 passes):")
print(pass_stats[pass_stats['total_passes'] >= 20].head(10))

# 5. TIME-BASED AGGREGATION (events per 15 minutes)
events['time_bin'] = pd.cut(events['minute'], bins=range(0, 135, 15))
time_stats = events.groupby(['team', 'time_bin']).agg(
    events=('type', 'count'),
    shots=('type', lambda x: (x == 'Shot').sum())
)
print("\nEvents by 15-minute intervals:")
print(time_stats.head(10))
            

                library(StatsBombR)
library(tidyverse)

events <- get.matchFree(data.frame(match_id = 3869685))

# 1. BASIC AGGREGATION BY TEAM
team_stats <- events %>%
  group_by(team.name) %>%
  summarise(
    total_events = n(),
    passes = sum(type.name == "Pass"),
    shots = sum(type.name == "Shot"),
    pressures = sum(type.name == "Pressure")
  )
cat("Team Statistics:\n")
print(team_stats)

# 2. SHOT STATISTICS BY TEAM
shot_stats <- events %>%
  filter(type.name == "Shot") %>%
  group_by(team.name) %>%
  summarise(
    total_shots = n(),
    goals = sum(shot.outcome.name == "Goal", na.rm = TRUE),
    total_xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    avg_xG = mean(shot.statsbomb_xg, na.rm = TRUE),
    on_target = sum(shot.outcome.name %in% c("Goal", "Saved"), na.rm = TRUE)
  ) %>%
  mutate(across(where(is.numeric), ~round(., 2)))
cat("\nShot Statistics:\n")
print(shot_stats)

# 3. PLAYER-LEVEL STATISTICS
player_stats <- events %>%
  group_by(team.name, player.name) %>%
  summarise(
    total_actions = n(),
    passes = sum(type.name == "Pass"),
    shots = sum(type.name == "Shot"),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    .groups = "drop"
  )

# Top 10 players by actions
cat("\nTop 10 Players by Actions:\n")
player_stats %>%
  arrange(desc(total_actions)) %>%
  head(10) %>%
  select(player.name, team.name, total_actions) %>%
  print()

# 4. PASS COMPLETION BY PLAYER
pass_stats <- events %>%
  filter(type.name == "Pass") %>%
  group_by(player.name) %>%
  summarise(
    total_passes = n(),
    successful = sum(is.na(pass.outcome.name))
  ) %>%
  mutate(completion_rate = round(successful / total_passes * 100, 1)) %>%
  arrange(desc(total_passes))

cat("\nPass Completion Rates (min 20 passes):\n")
pass_stats %>%
  filter(total_passes >= 20) %>%
  head(10) %>%
  print()
            

2.6 Handling Missing Data

Missing data is common in football datasets. Understanding why data is missing and how to handle it properly is crucial for accurate analysis.

Types of Missing Data in Football

Scenario	Example	How to Handle
Intentionally Missing	pass_outcome is null for successful passes	This is by design - null means success
Not Applicable	shot_xg for non-shot events	Filter first, then analyze
Data Not Collected	xG not available in older datasets	Exclude or calculate your own
Tracking Issues	Player location temporarily lost	Interpolate or exclude

handling_missing.py

                import pandas as pd
from statsbombpy import sb

events = sb.events(match_id=3869685)

# 1. CHECK FOR MISSING VALUES
print("Missing values per column:")
print(events.isnull().sum().sort_values(ascending=False).head(20))

# 2. UNDERSTAND MISSING PATTERNS
# For shots: check xG availability
shots = events[events['type'] == 'Shot']
print(f"\nShots with xG: {shots['shot_statsbomb_xg'].notna().sum()}")
print(f"Shots without xG: {shots['shot_statsbomb_xg'].isna().sum()}")

# 3. PASS OUTCOME - null means SUCCESS
passes = events[events['type'] == 'Pass']
successful = passes['pass_outcome'].isna().sum()
unsuccessful = passes['pass_outcome'].notna().sum()
print(f"\nPass outcomes:")
print(f"  Successful (null): {successful}")
print(f"  Unsuccessful: {unsuccessful}")
print(f"  Completion rate: {successful/(successful+unsuccessful)*100:.1f}%")

# 4. FILLING MISSING VALUES (when appropriate)
# Example: Fill missing xG with 0 for non-shot events
events['shot_xg_filled'] = events['shot_statsbomb_xg'].fillna(0)

# Example: Fill missing player positions with 'Unknown'
events['position_filled'] = events['position'].fillna('Unknown')

# 5. DROPPING ROWS WITH MISSING CRITICAL DATA
# Only analyze events with complete location data
events_with_location = events.dropna(subset=['location'])
print(f"\nEvents with location: {len(events_with_location)}")
print(f"Events without location: {len(events) - len(events_with_location)}")
            

                library(StatsBombR)
library(tidyverse)

events <- get.matchFree(data.frame(match_id = 3869685))

# 1. CHECK FOR MISSING VALUES
cat("Missing values per column (top 20):\n")
events %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "column", values_to = "missing") %>%
  arrange(desc(missing)) %>%
  head(20) %>%
  print()

# 2. UNDERSTAND MISSING PATTERNS
# For shots: check xG availability
shots <- events %>% filter(type.name == "Shot")
cat(sprintf("\nShots with xG: %d\n", sum(!is.na(shots$shot.statsbomb_xg))))
cat(sprintf("Shots without xG: %d\n", sum(is.na(shots$shot.statsbomb_xg))))

# 3. PASS OUTCOME - NA means SUCCESS
passes <- events %>% filter(type.name == "Pass")
successful <- sum(is.na(passes$pass.outcome.name))
unsuccessful <- sum(!is.na(passes$pass.outcome.name))
cat("\nPass outcomes:\n")
cat(sprintf("  Successful (NA): %d\n", successful))
cat(sprintf("  Unsuccessful: %d\n", unsuccessful))
cat(sprintf("  Completion rate: %.1f%%\n",
            successful/(successful+unsuccessful)*100))

# 4. FILLING MISSING VALUES
# Example: Fill missing xG with 0
events <- events %>%
  mutate(shot_xg_filled = replace_na(shot.statsbomb_xg, 0))

# Example: Fill missing positions
events <- events %>%
  mutate(position_filled = replace_na(position.name, "Unknown"))

# 5. DROPPING ROWS WITH MISSING DATA
events_with_location <- events %>%
  filter(!is.na(location.x), !is.na(location.y))
cat(sprintf("\nEvents with location: %d\n", nrow(events_with_location)))
cat(sprintf("Events without location: %d\n",
            nrow(events) - nrow(events_with_location)))
            

2.7 Joining Multiple Datasets

Often you'll need to combine data from multiple sources or merge match-level data with player information. Understanding joins is essential.

joining_data.py

                import pandas as pd
from statsbombpy import sb

# Load multiple matches
matches = sb.matches(competition_id=43, season_id=106)  # World Cup 2022

# Load events for multiple matches
all_events = []
for match_id in matches['match_id'].head(5):  # First 5 matches
    events = sb.events(match_id=match_id)
    events['match_id'] = match_id
    all_events.append(events)

all_events = pd.concat(all_events, ignore_index=True)
print(f"Total events across 5 matches: {len(all_events)}")

# JOIN: Add match info to events
match_info = matches[['match_id', 'home_team', 'away_team', 'home_score', 'away_score']]
events_with_match = all_events.merge(match_info, on='match_id', how='left')

print("\nEvents with match info:")
print(events_with_match[['match_id', 'home_team', 'away_team', 'type', 'player']].head())

# AGGREGATE: Player stats across all matches
player_tournament_stats = all_events.groupby('player').agg(
    matches=('match_id', 'nunique'),
    total_actions=('type', 'count'),
    goals=('shot_outcome', lambda x: (x == 'Goal').sum()),
    xG=('shot_statsbomb_xg', 'sum'),
    passes=('type', lambda x: (x == 'Pass').sum())
).reset_index().sort_values('total_actions', ascending=False)

print("\nPlayer tournament stats (top 10 by actions):")
print(player_tournament_stats.head(10))

# Create match result column
events_with_match['match_result'] = events_with_match.apply(
    lambda row: 'Win' if (
        (row['team'] == row['home_team'] and row['home_score'] > row['away_score']) or
        (row['team'] == row['away_team'] and row['away_score'] > row['home_score'])
    ) else ('Loss' if (
        (row['team'] == row['home_team'] and row['home_score'] < row['away_score']) or
        (row['team'] == row['away_team'] and row['away_score'] < row['home_score'])
    ) else 'Draw'),
    axis=1
)
            

                library(StatsBombR)
library(tidyverse)

# Load World Cup 2022
comp <- FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106)
matches <- FreeMatches(comp)

# Load events for first 5 matches
match_ids <- head(matches$match_id, 5)
all_events <- map_dfr(match_ids, function(mid) {
  events <- get.matchFree(data.frame(match_id = mid))
  events$match_id <- mid
  return(events)
})

cat(sprintf("Total events across 5 matches: %d\n", nrow(all_events)))

# JOIN: Add match info to events
match_info <- matches %>%
  select(match_id, home_team.home_team_name, away_team.away_team_name,
         home_score, away_score)

events_with_match <- all_events %>%
  left_join(match_info, by = "match_id")

cat("\nEvents with match info:\n")
events_with_match %>%
  select(match_id, home_team.home_team_name, away_team.away_team_name,
         type.name, player.name) %>%
  head() %>%
  print()

# AGGREGATE: Player stats across all matches
player_tournament_stats <- all_events %>%
  group_by(player.name) %>%
  summarise(
    matches = n_distinct(match_id),
    total_actions = n(),
    goals = sum(shot.outcome.name == "Goal", na.rm = TRUE),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    passes = sum(type.name == "Pass"),
    .groups = "drop"
  ) %>%
  arrange(desc(total_actions))

cat("\nPlayer tournament stats (top 10 by actions):\n")
player_tournament_stats %>% head(10) %>% print()
            

2.8 Summary

In this chapter, you learned:

Key Concepts

The structure of football event data
Different pitch coordinate systems
How to identify and handle missing data
When to use different types of joins

Technical Skills

Filtering events by type, team, player, time, location
Aggregating statistics at player and team level
Converting between coordinate systems
Joining and combining multiple datasets

What's Next

In Chapter 3: The Football Data Ecosystem, we'll explore where to find football data—from free open sources to commercial providers—and how to access each one.

2.9 Data Wrangling Visualizations

After wrangling your data, visualization helps verify your transformations and explore patterns. Here are essential visualizations for data wrangling workflows.

Event Distribution Bar Chart

Visualize the distribution of event types to understand match dynamics:

script

# Event type distribution bar chart
from statsbombpy import sb
import matplotlib.pyplot as plt
import seaborn as sns

# Load match data
events = sb.events(match_id=3869685)

# Count events by type
event_counts = events["type"].value_counts().head(15)

# Create bar chart
fig, ax = plt.subplots(figsize=(10, 8))
colors = plt.cm.Greens(event_counts / event_counts.max())

bars = ax.barh(event_counts.index, event_counts.values, color=colors)
ax.set_xlabel("Count", fontsize=12)
ax.set_ylabel("Event Type", fontsize=12)
ax.set_title("Event Type Distribution\nWorld Cup 2022 Final", fontsize=16, fontweight="bold")

# Add value labels
for bar, val in zip(bars, event_counts.values):
    ax.text(val + 5, bar.get_y() + bar.get_height()/2, str(val),
            va="center", fontsize=10)

plt.tight_layout()
plt.show()

# Event type distribution bar chart
library(StatsBombR)
library(tidyverse)

# Load match data
events <- get.matchFree(data.frame(match_id = 3869685))

# Count events by type
event_counts <- events %>%
  count(type.name, sort = TRUE) %>%
  head(15)

# Create bar chart
ggplot(event_counts, aes(x = reorder(type.name, n), y = n, fill = n)) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient(low = "#90EE90", high = "#1B5E20") +
  labs(
    title = "Event Type Distribution",
    subtitle = "World Cup 2022 Final",
    x = "Event Type",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 16),
    axis.text = element_text(size = 10)
  )

Team Action Comparison

Compare event counts between teams with a grouped bar chart:

script

# Team comparison grouped bar chart
import pandas as pd
import matplotlib.pyplot as plt

# Filter key events
key_events = events[events["type"].isin(["Pass", "Shot", "Pressure", "Dribble", "Tackle"])]
team_counts = key_events.groupby(["team", "type"]).size().unstack(fill_value=0)

# Create grouped bar chart
fig, ax = plt.subplots(figsize=(12, 6))
x = range(len(team_counts.columns))
width = 0.35

teams = team_counts.index.tolist()
colors = {"Argentina": "#75AADB", "France": "#002395"}

for i, team in enumerate(teams):
    offset = width * (i - 0.5)
    bars = ax.bar([xi + offset for xi in x], team_counts.loc[team],
                  width, label=team, color=colors.get(team, "gray"))

ax.set_xlabel("Event Type", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
ax.set_title("Team Action Comparison\nKey Event Types", fontsize=16, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(team_counts.columns, rotation=45, ha="right")
ax.legend()

plt.tight_layout()
plt.show()

# Team comparison grouped bar chart
team_events <- events %>%
  filter(type.name %in% c("Pass", "Shot", "Pressure", "Dribble", "Tackle")) %>%
  count(team.name, type.name)

ggplot(team_events, aes(x = type.name, y = n, fill = team.name)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Argentina" = "#75AADB", "France" = "#002395")) +
  labs(
    title = "Team Action Comparison",
    subtitle = "Key Event Types",
    x = "Event Type",
    y = "Count",
    fill = "Team"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    legend.position = "bottom"
  )

Time-Series Event Flow

Visualize how events unfold over match time:

script

# Events over time line chart
import matplotlib.pyplot as plt
import pandas as pd

# Create time bins
passes = events[events["type"] == "Pass"].copy()
passes["time_bin"] = pd.cut(passes["minute"], bins=range(0, 130, 5))

# Count by team and time
time_counts = passes.groupby(["team", "time_bin"]).size().unstack(level=0, fill_value=0)

# Plot
fig, ax = plt.subplots(figsize=(14, 6))
colors = {"Argentina": "#75AADB", "France": "#002395"}

for team in time_counts.columns:
    ax.plot(range(len(time_counts)), time_counts[team],
            marker="o", label=team, color=colors.get(team, "gray"), linewidth=2)

ax.set_xlabel("Time Period", fontsize=12)
ax.set_ylabel("Pass Count", fontsize=12)
ax.set_title("Passing Activity Over Match Time", fontsize=16, fontweight="bold")
ax.set_xticks(range(len(time_counts)))
ax.set_xticklabels([str(x) for x in time_counts.index], rotation=45, ha="right", fontsize=8)
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Events over time line chart
events_timeline <- events %>%
  filter(type.name %in% c("Pass", "Shot")) %>%
  mutate(time_bin = cut(minute, breaks = seq(0, 125, 5))) %>%
  count(team.name, type.name, time_bin) %>%
  filter(!is.na(time_bin))

ggplot(events_timeline, aes(x = time_bin, y = n, color = team.name, group = team.name)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  facet_wrap(~type.name, scales = "free_y") +
  scale_color_manual(values = c("Argentina" = "#75AADB", "France" = "#002395")) +
  labs(
    title = "Event Flow Over Match Time",
    x = "Time Period (minutes)",
    y = "Count",
    color = "Team"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    plot.title = element_text(face = "bold"),
    legend.position = "bottom"
  )

Coordinate Distribution Scatter Plot

Verify coordinate data by plotting event locations:

script

# Event location scatter plot
from mplsoccer import Pitch
import matplotlib.pyplot as plt

# Filter events with locations
locations = events[events["x"].notna() & events["y"].notna()].copy()
locations = locations[locations["type"].isin(["Pass", "Shot", "Carry"])]

# Create pitch for each team
teams = locations["team"].unique()
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

colors = {"Pass": "#87CEEB", "Shot": "#FFD700", "Carry": "#FF6B6B"}
pitch = Pitch(pitch_color="#1B5E20", line_color="white")

for ax, team in zip(axes, teams):
    pitch.draw(ax=ax)
    team_data = locations[locations["team"] == team]

    for event_type, color in colors.items():
        type_data = team_data[team_data["type"] == event_type]
        ax.scatter(type_data["x"], type_data["y"], c=color,
                   alpha=0.4, s=20, label=event_type)

    ax.set_title(team, fontsize=14, fontweight="bold")
    ax.legend(loc="upper left", fontsize=8)

fig.suptitle("Event Locations by Type and Team", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()

# Event location scatter plot
library(ggsoccer)

event_locations <- events %>%
  filter(!is.na(location.x), !is.na(location.y)) %>%
  filter(type.name %in% c("Pass", "Shot", "Carry"))

ggplot(event_locations) +
  annotate_pitch(colour = "white", fill = "#1B5E20") +
  geom_point(aes(x = location.x, y = location.y, color = type.name),
             alpha = 0.4, size = 1.5) +
  scale_color_manual(values = c("Pass" = "#87CEEB", "Shot" = "#FFD700", "Carry" = "#FF6B6B")) +
  facet_wrap(~team.name) +
  theme_pitch() +
  coord_flip() +
  labs(
    title = "Event Locations by Type and Team",
    color = "Event Type"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

2.10 Practice Exercises

Apply your data wrangling skills with these exercises. Each includes hints and full solutions.

Exercise 2.1: Filter and Aggregate Player Stats

Task: Create a summary of passing statistics for all players who attempted at least 30 passes in a match. Include: total passes, completed passes, completion rate, and progressive passes.

Filter for Pass events first
Group by player name
Successful passes have null pass_outcome in StatsBomb
Progressive passes move the ball 10+ yards toward goal

script

# Solution 2.1: Player passing summary
from statsbombpy import sb
import pandas as pd

events = sb.events(match_id=3869685)

# Filter passes
passes = events[events["type"] == "Pass"].copy()

# Calculate metrics
passes["is_complete"] = passes["pass_outcome"].isna()
passes["is_progressive"] = (
    (passes["pass_end_location"].apply(lambda x: x[0] if isinstance(x, list) else 0) -
     passes["location"].apply(lambda x: x[0] if isinstance(x, list) else 0)) >= 10
)

# Aggregate by player
player_passing = passes.groupby(["player", "team"]).agg(
    total_passes=("type", "count"),
    completed=("is_complete", "sum"),
    progressive=("is_progressive", lambda x: (x & passes.loc[x.index, "is_complete"]).sum())
).reset_index()

# Filter and calculate rate
player_passing = player_passing[player_passing["total_passes"] >= 30]
player_passing["completion_rate"] = round(
    player_passing["completed"] / player_passing["total_passes"] * 100, 1
)

print(player_passing.sort_values("completion_rate", ascending=False))

# Solution 2.1: Player passing summary
library(StatsBombR)
library(tidyverse)

events <- get.matchFree(data.frame(match_id = 3869685))

player_passing <- events %>%
  filter(type.name == "Pass") %>%
  mutate(
    is_complete = is.na(pass.outcome.name),
    is_progressive = (pass.end_location.x - location.x) >= 10 &
                     pass.end_location.x >= 60
  ) %>%
  group_by(player.name, team.name) %>%
  summarise(
    total_passes = n(),
    completed = sum(is_complete),
    progressive = sum(is_progressive & is_complete, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(total_passes >= 30) %>%
  mutate(
    completion_rate = round(completed / total_passes * 100, 1)
  ) %>%
  arrange(desc(completion_rate))

print(player_passing)

Exercise 2.2: Convert Coordinates and Plot

Task: Take shot data in StatsBomb coordinates (120x80) and convert to real meters (105x68). Then calculate the distance from each shot to the center of the goal.

Conversion: meters_x = sb_x / 120 * 105
Conversion: meters_y = sb_y / 80 * 68
Goal center in meters: (105, 34)
Distance formula: sqrt((x2-x1)² + (y2-y1)²)

script

# Solution 2.2: Coordinate conversion and distance
from statsbombpy import sb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

events = sb.events(match_id=3869685)

shots = events[events["type"] == "Shot"].copy()

# Extract coordinates
shots["sb_x"] = shots["location"].apply(lambda x: x[0] if isinstance(x, list) else None)
shots["sb_y"] = shots["location"].apply(lambda x: x[1] if isinstance(x, list) else None)

# Convert to meters
shots["meters_x"] = shots["sb_x"] / 120 * 105
shots["meters_y"] = shots["sb_y"] / 80 * 68

# Goal center
goal_x, goal_y = 105, 34

# Calculate distance
shots["distance_to_goal"] = np.sqrt(
    (goal_x - shots["meters_x"])**2 +
    (goal_y - shots["meters_y"])**2
)

# Summary stats
print(f"Average shot distance: {shots[\"distance_to_goal\"].mean():.1f} meters")
print(f"Closest shot: {shots[\"distance_to_goal\"].min():.1f} meters")
print(f"Furthest shot: {shots[\"distance_to_goal\"].max():.1f} meters")

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
colors = shots["shot_outcome"].apply(lambda x: "#FFD700" if x == "Goal" else "#888888")
ax.scatter(shots["distance_to_goal"], shots["shot_statsbomb_xg"],
           c=colors, s=80, alpha=0.7, edgecolors="black")

ax.set_xlabel("Distance to Goal (meters)", fontsize=12)
ax.set_ylabel("xG", fontsize=12)
ax.set_title("Shot Distance vs Expected Goals", fontsize=16, fontweight="bold")
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Solution 2.2: Coordinate conversion and distance
library(StatsBombR)
library(tidyverse)

events <- get.matchFree(data.frame(match_id = 3869685))

shots <- events %>%
  filter(type.name == "Shot") %>%
  select(player.name, location.x, location.y, shot.statsbomb_xg, shot.outcome.name) %>%
  mutate(
    # Convert to meters
    meters_x = location.x / 120 * 105,
    meters_y = location.y / 80 * 68,

    # Goal center (end line center)
    goal_x = 105,
    goal_y = 34,

    # Calculate distance
    distance_to_goal = sqrt((goal_x - meters_x)^2 + (goal_y - meters_y)^2)
  ) %>%
  arrange(distance_to_goal)

# Summary
cat("Average shot distance:", round(mean(shots$distance_to_goal), 1), "meters\n")
cat("Closest shot:", round(min(shots$distance_to_goal), 1), "meters\n")
cat("Furthest shot:", round(max(shots$distance_to_goal), 1), "meters\n")

# Plot distance vs xG
ggplot(shots, aes(x = distance_to_goal, y = shot.statsbomb_xg)) +
  geom_point(aes(color = shot.outcome.name == "Goal"), size = 3, alpha = 0.7) +
  geom_smooth(method = "loess", se = TRUE, color = "#1B5E20") +
  scale_color_manual(values = c("FALSE" = "#888888", "TRUE" = "#FFD700"),
                     labels = c("No Goal", "Goal")) +
  labs(
    title = "Shot Distance vs Expected Goals",
    x = "Distance to Goal (meters)",
    y = "xG",
    color = "Result"
  ) +
  theme_minimal()

Exercise 2.3: Join Multiple Matches

Task: Load events from 5 World Cup 2022 matches and create a tournament-level player summary with total goals, xG, and shots across all matches.

World Cup 2022: competition_id=43, season_id=106
Loop through match IDs to load events
Add match_id column before combining
Aggregate shots across all matches

script

# Solution 2.3: Multi-match player summary
from statsbombpy import sb
import pandas as pd

# Get World Cup 2022 matches
matches = sb.matches(competition_id=43, season_id=106)

# Load first 5 matches
all_events = []
for match_id in matches["match_id"].head(5):
    events = sb.events(match_id=match_id)
    events["match_id"] = match_id
    all_events.append(events)

all_events = pd.concat(all_events, ignore_index=True)

# Filter shots and aggregate
shots = all_events[all_events["type"] == "Shot"]

player_summary = shots.groupby("player").agg(
    matches=("match_id", "nunique"),
    shots=("type", "count"),
    goals=("shot_outcome", lambda x: (x == "Goal").sum()),
    xG=("shot_statsbomb_xg", "sum")
).reset_index()

player_summary["xG_diff"] = player_summary["goals"] - player_summary["xG"]
player_summary["shots_per_match"] = round(
    player_summary["shots"] / player_summary["matches"], 2
)

# Filter and sort
player_summary = player_summary[player_summary["shots"] >= 3]
player_summary = player_summary.sort_values("xG", ascending=False)

print(player_summary.head(15))

# Solution 2.3: Multi-match player summary
library(StatsBombR)
library(tidyverse)

# Get matches
comp <- FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106)
matches <- FreeMatches(comp)

# Load first 5 matches
match_ids <- head(matches$match_id, 5)
all_events <- map_dfr(match_ids, function(mid) {
  events <- get.matchFree(data.frame(match_id = mid))
  events$match_id <- mid
  return(events)
})

# Player tournament summary
player_summary <- all_events %>%
  filter(type.name == "Shot") %>%
  group_by(player.name) %>%
  summarise(
    matches = n_distinct(match_id),
    shots = n(),
    goals = sum(shot.outcome.name == "Goal"),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    xG_diff = goals - xG,
    shots_per_match = round(shots / matches, 2)
  ) %>%
  filter(shots >= 3) %>%
  arrange(desc(xG))

print(player_summary, n = 15)

Capstone - Complete Analytics System

Learning Objectives

2.1 Understanding Football Data Structures

The Anatomy of an Event

Event Types in Football Data

2.2 Working with DataFrames

Essential DataFrame Operations

2.3 Pitch Coordinate Systems

Important

Common Coordinate Systems

Converting Between Coordinate Systems

2.4 Filtering and Selecting Events

2.5 Aggregating Match Statistics

2.6 Handling Missing Data

Types of Missing Data in Football

2.7 Joining Multiple Datasets

2.8 Summary

Key Concepts

Technical Skills

What's Next

2.9 Data Wrangling Visualizations

Event Distribution Bar Chart

Team Action Comparison

Time-Series Event Flow

Coordinate Distribution Scatter Plot

2.10 Practice Exercises

Exercise 2.1: Filter and Aggregate Player Stats

Exercise 2.2: Convert Coordinates and Plot

Exercise 2.3: Join Multiple Matches

Key Takeaways

On This Page

Exercises

Chapter Info

Capstone - Complete Analytics System

Learning Objectives

2.1 Understanding Football Data Structures

The Anatomy of an Event

Event Types in Football Data

2.2 Working with DataFrames

Essential DataFrame Operations

2.3 Pitch Coordinate Systems

Important

Common Coordinate Systems

Converting Between Coordinate Systems

2.4 Filtering and Selecting Events

2.5 Aggregating Match Statistics

2.6 Handling Missing Data

Types of Missing Data in Football

2.7 Joining Multiple Datasets

2.8 Summary

Key Concepts

Technical Skills

What's Next

2.9 Data Wrangling Visualizations

Event Distribution Bar Chart

Team Action Comparison

Time-Series Event Flow

Coordinate Distribution Scatter Plot

2.10 Practice Exercises

Exercise 2.1: Filter and Aggregate Player Stats

Hint

Solution

Exercise 2.2: Convert Coordinates and Plot

Hint

Solution

Exercise 2.3: Join Multiple Matches

Hint

Solution

Key Takeaways

On This Page

Exercises

Chapter Info