Capstone - Complete Analytics System
Learning Objectives
- Understand what soccer analytics is and why it matters in modern football
- Learn about the different types of football data available
- Set up your R or Python development environment
- Load and explore your first football dataset
- Perform basic analysis on real match data
1.1 What is Soccer Analytics?
Soccer analytics is the systematic application of data analysis and statistical methods to understand, evaluate, and improve performance in association football.
At its core, soccer analytics answers questions that traditional observation alone cannot. While a scout might say a player "looks good," analytics can quantify exactly how good they are compared to their peers, and in what specific areas they excel or struggle.
The Three Pillars of Football Analytics
Performance Analysis
Measuring how well players and teams perform through metrics like expected goals (xG), pass completion rates, pressing intensity, and defensive actions.
Recruitment & Scouting
Identifying players who fit specific profiles, finding undervalued talent, and predicting future performance to make better transfer decisions.
Tactical Analysis
Understanding team playing styles, opponent weaknesses, set piece effectiveness, and in-game decision making through data-driven insights.
Why Analytics Matters in Modern Football
The adoption of analytics has transformed how football clubs operate. Here are some key reasons why data-driven decision making has become essential:
| Traditional Approach | Analytics Approach | Benefit |
|---|---|---|
| "He scores lots of goals" | "His xG outperformance is +3.2 this season" | Distinguishes skill from luck |
| "Good passer" | "Top 5% for progressive passes per 90" | Quantifiable comparison |
| "Works hard defensively" | "8.3 pressures per 90, 32% success rate" | Measures actual contribution |
| "£50m seems reasonable" | "Market value model suggests £35m" | Data-informed negotiations |
Key Insight
Analytics doesn't replace traditional scouting and coaching expertise—it enhances it. The best football organizations combine data insights with human judgment to make better decisions.
1.2 The Analytics Revolution in Football
Football's analytics revolution began later than other sports like baseball (featured in "Moneyball"), but has accelerated rapidly in the past decade. Understanding this history helps contextualize where we are today.
Timeline of Key Developments
1990s - Early Pioneers
Charles Reep's long-ball theories (later debunked) represented early attempts at football analytics. Opta began collecting basic match statistics. Most analysis was simple: shots, passes, possession.
2000s - Data Collection Expands
ProZone introduced video-based tracking. Clubs like Bolton Wanderers under Sam Allardyce began using data for set-piece analysis. Event data became more detailed but remained proprietary.
2012 - Expected Goals Emerges
Sam Green at Opta and others develop expected goals (xG) models. This metric revolutionizes how we evaluate shots and chances. Analytics Twitter begins sharing insights publicly.
2017 - StatsBomb Open Data
StatsBomb releases free, detailed event data for select competitions. This democratizes football analytics, enabling students and hobbyists to learn with professional-grade data.
2018-Present - Mainstream Adoption
xG appears in TV broadcasts. Liverpool and Manchester City build world-class analytics departments. Brentford reaches the Premier League largely through data-driven recruitment. Tracking data becomes more accessible.
Case Study: Leicester City's 2015-16 Title
Leicester City's Premier League triumph wasn't just a fairytale—it was partly enabled by smart data use. Under Claudio Ranieri, Leicester identified that:
- Counter-attacking efficiency could compete with possession-based football
- Jamie Vardy's running statistics made him ideal for their direct style
- N'Golo Kanté's ball recovery numbers were elite before he was widely recognized
- Defensive compactness could be maintained without dominating possession
# Analyzing Leicester's 2015-16 season efficiency
import pandas as pd
# Leicester's key stats from that season
leicester_stats = {
'matches': 38,
'goals_scored': 68,
'goals_conceded': 36,
'xG_for': 55.4, # Expected goals created
'xG_against': 42.1, # Expected goals conceded
'possession_avg': 42.3, # Below league average!
'points': 81
}
# Calculate overperformance
xg_difference = leicester_stats['goals_scored'] - leicester_stats['xG_for']
xga_difference = leicester_stats['xG_against'] - leicester_stats['goals_conceded']
print(f"Goals vs xG: +{xg_difference:.1f} (clinical finishing)")
print(f"Conceded vs xGA: -{xga_difference:.1f} (excellent defending)")
print(f"Net xG overperformance: +{xg_difference + xga_difference:.1f}")
# This shows Leicester massively outperformed their underlying numbers
# They scored 12.6 more goals than expected and conceded 6.1 fewer
# Analyzing Leicester's 2015-16 season efficiency
library(tidyverse)
# Leicester's key stats from that season
leicester_stats <- tibble(
matches = 38,
goals_scored = 68,
goals_conceded = 36,
xG_for = 55.4, # Expected goals created
xG_against = 42.1, # Expected goals conceded
possession_avg = 42.3, # Below league average!
points = 81
)
# Calculate overperformance
leicester_stats <- leicester_stats %>%
mutate(
xg_overperformance = goals_scored - xG_for,
xga_overperformance = xG_against - goals_conceded,
net_overperformance = xg_overperformance + xga_overperformance
)
cat(sprintf("Goals vs xG: +%.1f (clinical finishing)\n",
leicester_stats$xg_overperformance))
cat(sprintf("Conceded vs xGA: -%.1f (excellent defending)\n",
leicester_stats$xga_overperformance))
cat(sprintf("Net xG overperformance: +%.1f\n",
leicester_stats$net_overperformance))
Goals vs xG: +12.6 (clinical finishing)
Conceded vs xGA: -6.1 (excellent defending)
Net xG overperformance: +18.7
The analysis shows Leicester overperformed their xG by a massive 18.7 goals across the season. While some of this is variance (luck), much came from Vardy and Mahrez's clinical finishing and Schmeichel's outstanding goalkeeping.
1.3 Questions Analytics Can Answer
Before diving into technical implementation, let's understand the types of questions football analytics can help answer. This will guide what skills and metrics you'll learn.
Player Evaluation
- How efficient is this striker's finishing?
- Which midfielder progresses the ball most effectively?
- Is this defender actually good, or protected by the system?
- How does player X compare to his positional peers?
- Is this goalkeeper's save rate sustainable?
Team Analysis
- What's this team's playing style?
- Where do they create chances from?
- How effectively do they press?
- What's their set piece effectiveness?
- Are their results sustainable?
Recruitment
- Who are similar players to our target?
- Is this player worth the asking price?
- How might they perform in our league?
- What's their development trajectory?
- Which young players are breakout candidates?
Tactical Insights
- Where is the opponent vulnerable?
- What formations work best against them?
- How should we build up against their press?
- Which substitutions would be most impactful?
- What set piece routines should we use?
1.4 Types of Football Data
Understanding the different types of football data is crucial before you start analyzing. Each type has different levels of detail, availability, and use cases.
1. Event Data
Event data records every on-ball action in a match: passes, shots, tackles, dribbles, fouls, and more. Each event includes:
- Location - x, y coordinates on the pitch
- Timestamp - when the event occurred
- Player - who performed the action
- Outcome - success/failure and additional details
- Qualifiers - additional context (body part, technique, etc.)
# Exploring event data structure with StatsBomb
from statsbombpy import sb
import pandas as pd
# Load a match - 2022 World Cup Final
events = sb.events(match_id=3869685)
# See what columns are available
print("Event data columns:")
print(events.columns.tolist()[:20]) # First 20 columns
# Count events by type
print("\nEvent types in the match:")
print(events['type'].value_counts().head(15))
# Example: Look at a single pass event
pass_event = events[events['type'] == 'Pass'].iloc[0]
print("\nSample pass event:")
print(f" Player: {pass_event['player']}")
print(f" Team: {pass_event['team']}")
print(f" Location: {pass_event['location']}")
print(f" Pass end location: {pass_event['pass_end_location']}")
print(f" Pass recipient: {pass_event['pass_recipient']}")
print(f" Minute: {pass_event['minute']}")
# Exploring event data structure with StatsBomb
library(StatsBombR)
library(tidyverse)
# Load a match - 2022 World Cup Final
events <- get.matchFree(data.frame(match_id = 3869685))
# See what columns are available
cat("Event data columns:\n")
print(names(events)[1:20]) # First 20 columns
# Count events by type
cat("\nEvent types in the match:\n")
events %>%
count(type.name, sort = TRUE) %>%
head(15) %>%
print()
# Example: Look at a single pass event
pass_event <- events %>%
filter(type.name == "Pass") %>%
slice(1)
cat("\nSample pass event:\n")
cat(sprintf(" Player: %s\n", pass_event$player.name))
cat(sprintf(" Team: %s\n", pass_event$team.name))
cat(sprintf(" Location: %.1f, %.1f\n",
pass_event$location.x, pass_event$location.y))
cat(sprintf(" Minute: %d\n", pass_event$minute))
Event types in the match:
Pass 1247
Ball Receipt* 892
Carry 891
Pressure 298
Ball Recovery 124
Duel 108
Clearance 89
Block 58
Foul Committed 43
Shot 41
Interception 38
...
2. Tracking Data
Tracking data captures the position of all 22 players and the ball, typically 25 times per second. This creates incredibly rich datasets but requires specialized analysis techniques.
Tracking Data Availability
Tracking data is mostly proprietary and expensive. Providers like Second Spectrum and SkillCorner serve professional clubs. However, some public datasets exist for learning (Metrica Sports, Last Row datasets). We'll cover these in Chapter 21.
3. Aggregate Statistics
The most accessible form of football data. Sites like FBref provide season-level and match-level statistics including:
- Goals, assists, minutes played
- Shots, shots on target
- Pass completion percentages
- Tackles, interceptions, clearances
- Expected goals and expected assists
# Accessing aggregate stats from FBref
import soccerdata as sd
# Initialize FBref scraper
fbref = sd.FBref(leagues="ENG-Premier League", seasons="2023-2024")
# Get player season stats
player_stats = fbref.read_player_season_stats(stat_type="standard")
# Look at top scorers
top_scorers = player_stats.nlargest(10, ('Performance', 'Gls'))
print("Top 10 Premier League Scorers 2023-24:")
print(top_scorers[['Performance', 'Gls', 'xG', 'npxG']].head(10))
# Calculate goals vs xG for top scorers
top_scorers['xG_diff'] = top_scorers[('Performance', 'Gls')] - top_scorers[('Expected', 'xG')]
print("\nGoals minus xG (overperformance):")
print(top_scorers[['xG_diff']].head(10))
# Accessing aggregate stats from FBref
library(worldfootballR)
library(tidyverse)
# Get Premier League player stats
player_stats <- fb_big5_advanced_season_stats(
season_end_year = 2024,
stat_type = "standard",
team_or_player = "player"
) %>%
filter(Comp == "Premier League")
# Look at top scorers
top_scorers <- player_stats %>%
arrange(desc(Gls)) %>%
head(10) %>%
select(Player, Squad, Gls, xG, npxG)
print("Top 10 Premier League Scorers 2023-24:")
print(top_scorers)
# Calculate goals vs xG for top scorers
top_scorers <- top_scorers %>%
mutate(xG_diff = Gls - xG)
print("\nGoals minus xG (overperformance):")
print(select(top_scorers, Player, Gls, xG, xG_diff))
Data Comparison Table
| Data Type | Granularity | Accessibility | Best For |
|---|---|---|---|
| Event Data | Individual actions | Free (StatsBomb) to expensive (Opta) | Detailed match analysis, xG models |
| Tracking Data | 25 frames/second | Expensive, limited public access | Off-ball analysis, space control |
| Aggregate Stats | Match/season totals | Widely free (FBref, etc.) | Player comparison, trend analysis |
1.5 Setting Up Your Development Environment
Before we can analyze football data, we need to set up our tools. This textbook supports both Python and R—choose whichever you're more comfortable with, or learn both!
Python Setup (Recommended: Anaconda)
-
Install Anaconda
Download from anaconda.com/download. Anaconda includes Python and many data science packages pre-installed.
-
Create a virtual environment
# Open Anaconda Prompt or terminal conda create -n soccer-analytics python=3.10 conda activate soccer-analytics -
Install essential packages
# Core data science pip install pandas numpy matplotlib seaborn # Soccer-specific pip install mplsoccer statsbombpy soccerdata # Machine learning (for later chapters) pip install scikit-learn xgboost # Jupyter for interactive analysis pip install jupyter jupyterlab -
Verify installation
# Test that everything works import pandas as pd import numpy as np import matplotlib.pyplot as plt from mplsoccer import Pitch from statsbombpy import sb print("All packages installed successfully!") print(f"Pandas version: {pd.__version__}")
R Setup (Recommended: RStudio)
-
Install R
Download from cran.r-project.org
-
Install RStudio
Download from posit.co/download/rstudio-desktop
-
Install essential packages
# Core tidyverse packages install.packages("tidyverse") install.packages("lubridate") # Soccer-specific install.packages("worldfootballR") install.packages("ggsoccer") # StatsBomb package (from GitHub) install.packages("devtools") devtools::install_github("statsbomb/StatsBombR") # Machine learning (for later chapters) install.packages("tidymodels") install.packages("xgboost") -
Verify installation
# Test that everything works library(tidyverse) library(StatsBombR) library(ggsoccer) print("All packages installed successfully!") print(paste("R version:", R.version.string))
Recommended Development Setup
- Python: VS Code with Python extension + Jupyter notebooks
- R: RStudio with R Markdown for reproducible analysis
- Both: Git for version control of your analysis projects
1.6 Your First Football Analysis
Now let's put everything together and perform a real analysis. We'll analyze the 2022 FIFA World Cup Final between Argentina and France—one of the greatest matches ever played.
Step 1: Load the Match Data
from statsbombpy import sb
import pandas as pd
import matplotlib.pyplot as plt
# Load World Cup 2022 matches
competitions = sb.competitions()
world_cup = competitions[
(competitions['competition_name'] == 'FIFA World Cup') &
(competitions['season_name'] == '2022')
]
# Get all matches
matches = sb.matches(competition_id=43, season_id=106)
print(f"Total World Cup 2022 matches: {len(matches)}")
# Find the final
final = matches[matches['match_id'] == 3869685].iloc[0]
print(f"\nFinal: {final['home_team']} vs {final['away_team']}")
print(f"Score: {final['home_score']} - {final['away_score']}")
# Load all events from the final
events = sb.events(match_id=3869685)
print(f"\nTotal events in match: {len(events)}")
library(StatsBombR)
library(tidyverse)
# Load World Cup 2022 matches
competitions <- FreeCompetitions()
world_cup <- competitions %>%
filter(competition_name == "FIFA World Cup", season_name == "2022")
# Get all matches
matches <- FreeMatches(world_cup)
cat(sprintf("Total World Cup 2022 matches: %d\n", nrow(matches)))
# Find the final
final <- matches %>% filter(match_id == 3869685)
cat(sprintf("\nFinal: %s vs %s\n", final$home_team.home_team_name,
final$away_team.away_team_name))
cat(sprintf("Score: %d - %d\n", final$home_score, final$away_score))
# Load all events from the final
events <- get.matchFree(final)
cat(sprintf("\nTotal events in match: %d\n", nrow(events)))
Total World Cup 2022 matches: 64
Final: Argentina vs France
Score: 3 - 3 (Argentina wins on penalties)
Total events in match: 3847
Step 2: Analyze Shots and Expected Goals
# Filter for shots only
shots = events[events['type'] == 'Shot'].copy()
print(f"Total shots in the match: {len(shots)}")
# Calculate shot statistics by team
shot_stats = shots.groupby('team').agg(
total_shots=('type', 'count'),
goals=('shot_outcome', lambda x: (x == 'Goal').sum()),
total_xG=('shot_statsbomb_xg', 'sum'),
shots_on_target=('shot_outcome', lambda x: x.isin(['Goal', 'Saved']).sum()),
avg_xG_per_shot=('shot_statsbomb_xg', 'mean')
).round(2)
print("\n=== Shot Analysis: World Cup 2022 Final ===")
print(shot_stats)
# Calculate xG difference (goals - xG)
for team in shot_stats.index:
goals = shot_stats.loc[team, 'goals']
xG = shot_stats.loc[team, 'total_xG']
diff = goals - xG
print(f"\n{team}: {goals} goals from {xG:.2f} xG ({'+' if diff > 0 else ''}{diff:.2f})")
# Filter for shots only
shots <- events %>% filter(type.name == "Shot")
cat(sprintf("Total shots in the match: %d\n", nrow(shots)))
# Calculate shot statistics by team
shot_stats <- shots %>%
group_by(team.name) %>%
summarise(
total_shots = n(),
goals = sum(shot.outcome.name == "Goal", na.rm = TRUE),
total_xG = sum(shot.statsbomb_xg, na.rm = TRUE),
shots_on_target = sum(shot.outcome.name %in% c("Goal", "Saved"), na.rm = TRUE),
avg_xG_per_shot = mean(shot.statsbomb_xg, na.rm = TRUE)
) %>%
mutate(across(where(is.numeric), ~round(., 2)))
cat("\n=== Shot Analysis: World Cup 2022 Final ===\n")
print(shot_stats)
# Calculate xG difference
shot_stats %>%
mutate(xG_diff = goals - total_xG) %>%
select(team.name, goals, total_xG, xG_diff) %>%
print()
=== Shot Analysis: World Cup 2022 Final ===
total_shots goals total_xG shots_on_target avg_xG_per_shot
team
Argentina 21 3 2.77 10 0.13
France 20 3 2.44 9 0.12
Argentina: 3 goals from 2.77 xG (+0.23)
France: 3 goals from 2.44 xG (+0.56)
Step 3: Visualize the Shots on a Pitch
from mplsoccer import VerticalPitch
import matplotlib.pyplot as plt
# Extract shot coordinates
shots['x'] = shots['location'].apply(lambda loc: loc[0])
shots['y'] = shots['location'].apply(lambda loc: loc[1])
shots['is_goal'] = shots['shot_outcome'] == 'Goal'
# Create figure with two pitches (one per team)
fig, axes = plt.subplots(1, 2, figsize=(16, 10))
teams = ['Argentina', 'France']
colors = {'Argentina': '#75AADB', 'France': '#002654'}
for idx, team in enumerate(teams):
pitch = VerticalPitch(
pitch_type='statsbomb',
half=True,
pitch_color='#22312b',
line_color='white'
)
pitch.draw(ax=axes[idx])
team_shots = shots[shots['team'] == team]
# Plot non-goals
non_goals = team_shots[~team_shots['is_goal']]
pitch.scatter(
non_goals['x'], non_goals['y'],
s=non_goals['shot_statsbomb_xg'] * 500 + 50,
c=colors[team], alpha=0.5,
edgecolors='white', linewidth=1,
ax=axes[idx], label='Shot'
)
# Plot goals
goals = team_shots[team_shots['is_goal']]
pitch.scatter(
goals['x'], goals['y'],
s=goals['shot_statsbomb_xg'] * 500 + 50,
c='#FFD700', alpha=1,
edgecolors='white', linewidth=2,
marker='*', ax=axes[idx], label='Goal'
)
# Add title with stats
team_xg = team_shots['shot_statsbomb_xg'].sum()
team_goals = team_shots['is_goal'].sum()
axes[idx].set_title(
f"{team}\n{team_goals} Goals | {team_xg:.2f} xG",
fontsize=14, fontweight='bold', color='white'
)
axes[idx].legend(loc='lower right')
plt.suptitle('World Cup 2022 Final - Shot Map', fontsize=16, fontweight='bold', y=1.02)
fig.patch.set_facecolor('#22312b')
plt.tight_layout()
plt.savefig('world_cup_final_shots.png', dpi=150, bbox_inches='tight',
facecolor='#22312b', edgecolor='none')
plt.show()
library(ggsoccer)
library(ggplot2)
# Prepare shot data
shots_plot <- shots %>%
mutate(
is_goal = shot.outcome.name == "Goal",
xG = shot.statsbomb_xg
)
# Create shot map
ggplot(shots_plot) +
annotate_pitch(colour = "white", fill = "#22312b") +
geom_point(
aes(x = location.x, y = location.y,
size = xG,
color = is_goal,
shape = is_goal),
alpha = 0.7
) +
scale_color_manual(
values = c("FALSE" = "#75AADB", "TRUE" = "#FFD700"),
labels = c("Shot", "Goal")
) +
scale_shape_manual(values = c("FALSE" = 16, "TRUE" = 18)) +
scale_size_continuous(range = c(2, 10)) +
coord_flip(xlim = c(60, 120)) +
facet_wrap(~team.name) +
theme_pitch() +
theme(
plot.background = element_rect(fill = "#22312b"),
strip.text = element_text(color = "white", size = 12, face = "bold"),
legend.position = "bottom",
legend.text = element_text(color = "white"),
legend.title = element_text(color = "white")
) +
labs(
title = "World Cup 2022 Final - Shot Map",
subtitle = "Argentina 3-3 France (Argentina wins on penalties)",
size = "xG",
color = "Outcome"
)
ggsave("world_cup_final_shots.png", width = 12, height = 8, dpi = 150)
What This Analysis Tells Us
The shot map reveals several insights about the World Cup Final:
- Argentina's volume: More shots, more central locations, consistent threat
- France's efficiency: Mbappé's hat-trick came from fewer, but high-quality chances
- The xG story: 2.77 vs 2.44 xG suggests Argentina created slightly better chances overall
- Both teams finished well: Each scored more than their xG suggested (clinical finishing)
1.7 Creating Advanced Visualizations
Now let's create more sophisticated visualizations that you'll use throughout your analytics career. We'll build an xG timeline, passing statistics chart, and player comparison radar.
xG Timeline Chart
An xG timeline shows how expected goals accumulate throughout a match. This reveals momentum shifts, key moments, and which team controlled the game at different phases.
from statsbombpy import sb
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load World Cup Final data
events = sb.events(match_id=3869685)
shots = events[events["type"] == "Shot"].copy()
# Create xG timeline data
shots = shots.sort_values(["minute", "second"])
shots["is_goal"] = shots["shot_outcome"] == "Goal"
# Calculate cumulative xG for each team
argentina_shots = shots[shots["team"] == "Argentina"].copy()
france_shots = shots[shots["team"] == "France"].copy()
argentina_shots["cumulative_xG"] = argentina_shots["shot_statsbomb_xg"].cumsum()
france_shots["cumulative_xG"] = france_shots["shot_statsbomb_xg"].cumsum()
# Create the plot
fig, ax = plt.subplots(figsize=(14, 7))
# Starting points
ax.plot([0], [0], "o", color="#75AADB", markersize=0)
ax.plot([0], [0], "o", color="#002654", markersize=0)
# Argentina xG line (step plot)
arg_minutes = [0] + argentina_shots["minute"].tolist()
arg_xg = [0] + argentina_shots["cumulative_xG"].tolist()
ax.step(arg_minutes, arg_xg, where="post", linewidth=2.5,
color="#75AADB", label="Argentina", alpha=0.9)
# France xG line
fra_minutes = [0] + france_shots["minute"].tolist()
fra_xg = [0] + france_shots["cumulative_xG"].tolist()
ax.step(fra_minutes, fra_xg, where="post", linewidth=2.5,
color="#002654", label="France", alpha=0.9)
# Mark goals with stars
arg_goals = argentina_shots[argentina_shots["is_goal"]]
fra_goals = france_shots[france_shots["is_goal"]]
ax.scatter(arg_goals["minute"], arg_goals["cumulative_xG"],
marker="*", s=300, color="#75AADB", edgecolors="gold",
linewidth=2, zorder=5)
ax.scatter(fra_goals["minute"], fra_goals["cumulative_xG"],
marker="*", s=300, color="#002654", edgecolors="gold",
linewidth=2, zorder=5)
# Add period markers
ax.axvline(x=45, color="gray", linestyle="--", alpha=0.5)
ax.axvline(x=90, color="gray", linestyle="--", alpha=0.5)
ax.axvline(x=105, color="gray", linestyle=":", alpha=0.5)
ax.text(45, ax.get_ylim()[1], "HT", ha="center", fontsize=9)
ax.text(90, ax.get_ylim()[1], "FT", ha="center", fontsize=9)
ax.text(105, ax.get_ylim()[1], "ET", ha="center", fontsize=9)
# Styling
ax.set_xlabel("Minute", fontsize=12)
ax.set_ylabel("Cumulative xG", fontsize=12)
ax.set_title("xG Timeline: World Cup 2022 Final\nArgentina 3-3 France",
fontsize=14, fontweight="bold")
ax.legend(loc="upper left", fontsize=11)
ax.set_xlim(0, 125)
ax.set_ylim(0, max(arg_xg[-1], fra_xg[-1]) + 0.5)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("xg_timeline.png", dpi=150, bbox_inches="tight")
plt.show()
library(tidyverse)
library(StatsBombR)
# Load World Cup Final data
events <- get.matchFree(data.frame(match_id = 3869685))
shots <- events %>% filter(type.name == "Shot")
# Create xG timeline data
xg_timeline <- shots %>%
arrange(minute, second) %>%
group_by(team.name) %>%
mutate(
cumulative_xG = cumsum(shot.statsbomb_xg),
is_goal = shot.outcome.name == "Goal"
) %>%
ungroup()
# Add starting point (0,0) for each team
start_points <- tibble(
team.name = c("Argentina", "France"),
minute = c(0, 0),
cumulative_xG = c(0, 0),
is_goal = c(FALSE, FALSE)
)
xg_timeline <- bind_rows(start_points, xg_timeline)
# Create the xG timeline plot
ggplot(xg_timeline, aes(x = minute, y = cumulative_xG, color = team.name)) +
# xG accumulation lines
geom_step(linewidth = 1.5, alpha = 0.8) +
# Goal markers
geom_point(
data = filter(xg_timeline, is_goal == TRUE),
aes(shape = team.name),
size = 5, stroke = 2
) +
# Styling
scale_color_manual(values = c("Argentina" = "#75AADB", "France" = "#002654")) +
scale_x_continuous(breaks = seq(0, 120, 15), limits = c(0, 125)) +
# Add halftime and fulltime lines
geom_vline(xintercept = c(45, 90), linetype = "dashed", alpha = 0.5) +
annotate("text", x = 45, y = max(xg_timeline$cumulative_xG) + 0.2,
label = "HT", size = 3) +
annotate("text", x = 90, y = max(xg_timeline$cumulative_xG) + 0.2,
label = "FT", size = 3) +
labs(
title = "xG Timeline: World Cup 2022 Final",
subtitle = "Argentina 3-3 France (Argentina wins on penalties)",
x = "Minute",
y = "Cumulative xG",
color = "Team",
shape = "Team"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom",
panel.grid.minor = element_blank()
)
ggsave("xg_timeline.png", width = 12, height = 6, dpi = 150)
Passing Statistics Bar Chart
Comparing team passing statistics helps understand playing styles. Let's create a professional bar chart comparing key passing metrics.
from statsbombpy import sb
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load match data
events = sb.events(match_id=3869685)
passes = events[events["type"] == "Pass"].copy()
# Calculate passing statistics by team
def calc_pass_stats(team_passes):
return {
"Total Passes": len(team_passes),
"Completion %": (team_passes["pass_outcome"].isna().sum() / len(team_passes)) * 100,
"Progressive": team_passes["pass_progressive"].sum() if "pass_progressive" in team_passes else 0,
"Final Third": (team_passes["pass_end_location"].apply(
lambda x: x[0] > 80 if isinstance(x, list) else False
).sum()),
"Key Passes": (team_passes["pass_shot_assist"].fillna(False).sum() +
team_passes["pass_goal_assist"].fillna(False).sum())
}
argentina_stats = calc_pass_stats(passes[passes["team"] == "Argentina"])
france_stats = calc_pass_stats(passes[passes["team"] == "France"])
# Prepare data for plotting
metrics = list(argentina_stats.keys())
arg_values = list(argentina_stats.values())
fra_values = list(france_stats.values())
# Create grouped bar chart
x = np.arange(len(metrics))
width = 0.35
fig, ax = plt.subplots(figsize=(14, 7))
bars1 = ax.bar(x - width/2, arg_values, width, label="Argentina",
color="#75AADB", edgecolor="white", linewidth=1.5)
bars2 = ax.bar(x + width/2, fra_values, width, label="France",
color="#002654", edgecolor="white", linewidth=1.5)
# Add value labels on bars
def add_labels(bars):
for bar in bars:
height = bar.get_height()
ax.annotate(f"{height:.1f}",
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset points",
ha="center", va="bottom", fontsize=10)
add_labels(bars1)
add_labels(bars2)
# Styling
ax.set_xlabel("Metric", fontsize=12)
ax.set_ylabel("Value", fontsize=12)
ax.set_title("Passing Comparison: World Cup 2022 Final\n" +
"Argentina dominated possession but France remained dangerous",
fontsize=14, fontweight="bold")
ax.set_xticks(x)
ax.set_xticklabels(metrics, fontsize=11)
ax.legend(fontsize=11)
ax.grid(axis="y", alpha=0.3)
ax.set_axisbelow(True)
plt.tight_layout()
plt.savefig("passing_comparison.png", dpi=150, bbox_inches="tight")
plt.show()
library(tidyverse)
library(StatsBombR)
# Load match data
events <- get.matchFree(data.frame(match_id = 3869685))
# Calculate passing statistics by team
pass_stats <- events %>%
filter(type.name == "Pass") %>%
group_by(team.name) %>%
summarise(
total_passes = n(),
successful_passes = sum(is.na(pass.outcome.name) | pass.outcome.name == "Complete"),
pass_completion = successful_passes / total_passes * 100,
progressive_passes = sum(pass.progressive == TRUE, na.rm = TRUE),
passes_final_third = sum(pass.end_location.x > 80, na.rm = TRUE),
key_passes = sum(pass.shot_assist == TRUE | pass.goal_assist == TRUE, na.rm = TRUE),
crosses = sum(pass.cross == TRUE, na.rm = TRUE),
long_balls = sum(pass.length > 30, na.rm = TRUE)
) %>%
pivot_longer(
cols = c(total_passes, pass_completion, progressive_passes,
passes_final_third, key_passes),
names_to = "metric",
values_to = "value"
)
# Create comparison bar chart
ggplot(pass_stats, aes(x = metric, y = value, fill = team.name)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +
geom_text(
aes(label = round(value, 1)),
position = position_dodge(width = 0.8),
vjust = -0.5, size = 3.5
) +
scale_fill_manual(values = c("Argentina" = "#75AADB", "France" = "#002654")) +
scale_x_discrete(
labels = c(
"total_passes" = "Total\nPasses",
"pass_completion" = "Completion\n%",
"progressive_passes" = "Progressive\nPasses",
"passes_final_third" = "Final Third\nPasses",
"key_passes" = "Key\nPasses"
)
) +
labs(
title = "Passing Comparison: World Cup 2022 Final",
subtitle = "Argentina dominated possession but France remained dangerous",
x = "",
y = "Value",
fill = "Team"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.text.x = element_text(size = 10),
legend.position = "top",
panel.grid.major.x = element_blank()
)
ggsave("passing_comparison.png", width = 12, height = 7, dpi = 150)
Player Performance Radar Chart
Radar charts (also called spider charts) are excellent for comparing players across multiple dimensions simultaneously. Let's compare Messi and Mbappé from the final.
from statsbombpy import sb
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import pi
# Load match data
events = sb.events(match_id=3869685)
# Filter for Messi and Mbappe
players = ["Lionel Andrés Messi Cuccittini", "Kylian Mbappé Lottin"]
player_events = events[events["player"].isin(players)]
# Calculate stats for each player
def calc_player_stats(player_name, all_events):
pe = all_events[all_events["player"] == player_name]
shots = pe[pe["type"] == "Shot"]
passes = pe[pe["type"] == "Pass"]
dribbles = pe[pe["type"] == "Dribble"]
return {
"Shots": len(shots),
"xG": shots["shot_statsbomb_xg"].sum(),
"Goals": (shots["shot_outcome"] == "Goal").sum(),
"Key Passes": (passes["pass_shot_assist"].fillna(False).sum() +
passes["pass_goal_assist"].fillna(False).sum()),
"Dribbles": (dribbles["dribble_outcome"] == "Complete").sum(),
"Touches": len(pe)
}
messi_stats = calc_player_stats(players[0], events)
mbappe_stats = calc_player_stats(players[1], events)
# Normalize to percentages (max across both players = 100)
categories = list(messi_stats.keys())
messi_values = list(messi_stats.values())
mbappe_values = list(mbappe_stats.values())
# Normalize
max_values = [max(m, mb) for m, mb in zip(messi_values, mbappe_values)]
messi_norm = [v/mx*100 if mx > 0 else 0 for v, mx in zip(messi_values, max_values)]
mbappe_norm = [v/mx*100 if mx > 0 else 0 for v, mx in zip(mbappe_values, max_values)]
# Create radar chart
fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))
# Calculate angles for each category
angles = [n / float(len(categories)) * 2 * pi for n in range(len(categories))]
angles += angles[:1] # Complete the loop
# Add data (closing the loop)
messi_norm += messi_norm[:1]
mbappe_norm += mbappe_norm[:1]
# Plot
ax.plot(angles, messi_norm, "o-", linewidth=2.5, color="#75AADB", label="Messi")
ax.fill(angles, messi_norm, alpha=0.25, color="#75AADB")
ax.plot(angles, mbappe_norm, "o-", linewidth=2.5, color="#002654", label="Mbappé")
ax.fill(angles, mbappe_norm, alpha=0.25, color="#002654")
# Set category labels
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, fontsize=12)
# Add actual values as annotations
for i, (angle, m_val, mb_val) in enumerate(zip(angles[:-1], messi_values, mbappe_values)):
ax.annotate(f"{m_val:.1f}", xy=(angle, messi_norm[i]+8),
ha="center", fontsize=9, color="#75AADB")
ax.annotate(f"{mb_val:.1f}", xy=(angle, mbappe_norm[i]-12),
ha="center", fontsize=9, color="#002654")
ax.set_title("Messi vs Mbappé\nWorld Cup 2022 Final Performance",
fontsize=14, fontweight="bold", y=1.08)
ax.legend(loc="upper right", bbox_to_anchor=(1.15, 1.1), fontsize=11)
plt.tight_layout()
plt.savefig("player_radar.png", dpi=150, bbox_inches="tight")
plt.show()
# Print raw stats
print("\nRaw Statistics:")
print(f"Messi: {messi_stats}")
print(f"Mbappé: {mbappe_stats}")
library(tidyverse)
library(StatsBombR)
library(fmsb)
# Load match data
events <- get.matchFree(data.frame(match_id = 3869685))
# Calculate player stats
player_stats <- events %>%
group_by(player.name) %>%
summarise(
shots = sum(type.name == "Shot"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
goals = sum(shot.outcome.name == "Goal", na.rm = TRUE),
passes = sum(type.name == "Pass"),
pass_completion = sum(type.name == "Pass" & is.na(pass.outcome.name)) /
sum(type.name == "Pass") * 100,
key_passes = sum(pass.shot_assist == TRUE | pass.goal_assist == TRUE, na.rm = TRUE),
dribbles = sum(type.name == "Dribble"),
successful_dribbles = sum(type.name == "Dribble" & dribble.outcome.name == "Complete", na.rm = TRUE),
touches = n()
) %>%
filter(player.name %in% c("Lionel Andrés Messi Cuccittini", "Kylian Mbappé Lottin"))
# Prepare radar data - normalize to 0-100 scale
radar_data <- player_stats %>%
select(player.name, shots, xG, goals, key_passes, successful_dribbles, touches) %>%
pivot_longer(-player.name, names_to = "metric", values_to = "value") %>%
group_by(metric) %>%
mutate(normalized = value / max(value) * 100) %>%
select(player.name, metric, normalized) %>%
pivot_wider(names_from = metric, values_from = normalized)
# Create radar chart with fmsb
# Add max and min rows required by fmsb
radar_df <- rbind(
rep(100, 6), # max
rep(0, 6), # min
radar_data %>% filter(str_detect(player.name, "Messi")) %>% select(-player.name),
radar_data %>% filter(str_detect(player.name, "Mbappé")) %>% select(-player.name)
)
colnames(radar_df) <- c("Shots", "xG", "Goals", "Key Passes", "Dribbles", "Touches")
# Plot
colors <- c("#75AADB", "#002654")
png("player_radar.png", width = 800, height = 600, res = 150)
radarchart(
radar_df,
axistype = 1,
pcol = colors,
pfcol = alpha(colors, 0.3),
plwd = 3,
plty = 1,
cglcol = "grey",
cglty = 1,
axislabcol = "grey40",
vlcex = 0.9,
title = "Messi vs Mbappé - World Cup Final Performance"
)
legend("topright", legend = c("Messi", "Mbappé"),
col = colors, lwd = 3, bty = "n")
dev.off()
Heat Map Visualization
Heat maps show where players or teams concentrate their activity on the pitch. This is crucial for understanding positioning and tactical tendencies.
from statsbombpy import sb
import pandas as pd
import matplotlib.pyplot as plt
from mplsoccer import Pitch, VerticalPitch
import numpy as np
from scipy.stats import gaussian_kde
# Load match data
events = sb.events(match_id=3869685)
# Get Messi events with location
messi_events = events[events["player"].str.contains("Messi", na=False)].copy()
messi_events = messi_events[messi_events["location"].notna()]
# Extract x, y coordinates
messi_events["x"] = messi_events["location"].apply(lambda loc: loc[0])
messi_events["y"] = messi_events["location"].apply(lambda loc: loc[1])
# Create pitch
pitch = Pitch(pitch_type="statsbomb", pitch_color="#22312b",
line_color="white", linewidth=1)
fig, ax = pitch.draw(figsize=(12, 8))
# Create heat map using kernel density estimation
pitch.kdeplot(
messi_events["x"], messi_events["y"],
ax=ax,
cmap="YlOrRd",
shade=True,
shade_lowest=False,
n_levels=25,
alpha=0.7
)
# Add scatter points
pitch.scatter(
messi_events["x"], messi_events["y"],
ax=ax,
s=20, color="white", alpha=0.3, edgecolors="none"
)
ax.set_title("Messi Touch Heat Map - World Cup 2022 Final\n" +
"Density of all touches throughout the match",
fontsize=14, fontweight="bold", color="white", y=1.02)
fig.patch.set_facecolor("#22312b")
plt.tight_layout()
plt.savefig("messi_heatmap.png", dpi=150, bbox_inches="tight",
facecolor="#22312b", edgecolor="none")
plt.show()
library(tidyverse)
library(StatsBombR)
library(ggsoccer)
# Load match data
events <- get.matchFree(data.frame(match_id = 3869685))
# Get Messi touches
messi_touches <- events %>%
filter(str_detect(player.name, "Messi")) %>%
filter(!is.na(location.x))
# Create heat map
ggplot(messi_touches, aes(x = location.x, y = location.y)) +
annotate_pitch(colour = "white", fill = "#22312b") +
stat_density_2d(
aes(fill = after_stat(level)),
geom = "polygon",
alpha = 0.7,
bins = 10
) +
scale_fill_gradient(low = "#75AADB", high = "#FFD700") +
geom_point(alpha = 0.3, color = "white", size = 1) +
coord_flip(xlim = c(0, 120), ylim = c(0, 80)) +
theme_pitch() +
theme(
plot.background = element_rect(fill = "#22312b"),
plot.title = element_text(color = "white", face = "bold", size = 14),
plot.subtitle = element_text(color = "white"),
legend.position = "none"
) +
labs(
title = "Messi Touch Heat Map - World Cup 2022 Final",
subtitle = "Density of all touches throughout the match"
)
ggsave("messi_heatmap.png", width = 12, height = 8, dpi = 150)
1.8 Practice Exercises
Now it's your turn to practice. Complete these exercises to solidify your understanding of the concepts covered in this chapter.
Exercise 1.1: Load Different Match Data
Task: Load data from a different World Cup 2022 match (e.g., Brazil vs Croatia quarter-final) and calculate basic shot statistics for both teams.
Steps:
- Find the match_id for Brazil vs Croatia
- Load all events from that match
- Filter for shots only
- Calculate total shots, shots on target, and total xG for each team
# Exercise 1.1 Solution
from statsbombpy import sb
import pandas as pd
# Load World Cup matches
matches = sb.matches(competition_id=43, season_id=106)
# Find Brazil vs Croatia
bra_cro = matches[
((matches["home_team"] == "Brazil") & (matches["away_team"] == "Croatia")) |
((matches["home_team"] == "Croatia") & (matches["away_team"] == "Brazil"))
].iloc[0]
print(f"Match ID: {bra_cro['match_id']}")
print(f"Score: {bra_cro['home_score']} - {bra_cro['away_score']}")
# Load events
events = sb.events(match_id=bra_cro["match_id"])
# Shot analysis
shots = events[events["type"] == "Shot"]
shot_stats = shots.groupby("team").agg(
total_shots=("type", "count"),
shots_on_target=("shot_outcome", lambda x: x.isin(["Goal", "Saved"]).sum()),
goals=("shot_outcome", lambda x: (x == "Goal").sum()),
total_xG=("shot_statsbomb_xg", "sum")
).round(2)
print("\nShot Statistics:")
print(shot_stats)
# Exercise 1.1 Solution
library(StatsBombR)
library(tidyverse)
# Load World Cup matches
matches <- FreeMatches(Competitions = FreeCompetitions() %>%
filter(competition_name == "FIFA World Cup", season_name == "2022"))
# Find Brazil vs Croatia
bra_cro <- matches %>%
filter((home_team.home_team_name == "Brazil" & away_team.away_team_name == "Croatia") |
(home_team.home_team_name == "Croatia" & away_team.away_team_name == "Brazil"))
cat(sprintf("Match ID: %d\n", bra_cro$match_id))
cat(sprintf("Score: %d - %d\n", bra_cro$home_score, bra_cro$away_score))
# Load events
events <- get.matchFree(bra_cro)
# Shot analysis
shot_stats <- events %>%
filter(type.name == "Shot") %>%
group_by(team.name) %>%
summarise(
total_shots = n(),
shots_on_target = sum(shot.outcome.name %in% c("Goal", "Saved"), na.rm = TRUE),
goals = sum(shot.outcome.name == "Goal", na.rm = TRUE),
total_xG = round(sum(shot.statsbomb_xg, na.rm = TRUE), 2)
)
print(shot_stats)
Exercise 1.2: Create a Pass Map
Task: Create a pass map showing all successful passes by a specific player (e.g., Enzo Fernández) in the World Cup Final.
Requirements:
- Filter for passes by the player
- Show only successful passes
- Draw arrows from pass start to end location
- Color-code by pass type (progressive vs. normal)
# Exercise 1.2 Solution
from statsbombpy import sb
import matplotlib.pyplot as plt
from mplsoccer import Pitch
# Load World Cup Final
events = sb.events(match_id=3869685)
# Get Enzo Fernandez passes
enzo_passes = events[
(events["player"].str.contains("Enzo", na=False)) &
(events["type"] == "Pass") &
(events["pass_outcome"].isna()) # Successful passes
].copy()
# Extract coordinates
enzo_passes["x"] = enzo_passes["location"].apply(lambda x: x[0])
enzo_passes["y"] = enzo_passes["location"].apply(lambda x: x[1])
enzo_passes["end_x"] = enzo_passes["pass_end_location"].apply(lambda x: x[0] if isinstance(x, list) else None)
enzo_passes["end_y"] = enzo_passes["pass_end_location"].apply(lambda x: x[1] if isinstance(x, list) else None)
# Check if progressive
enzo_passes["is_progressive"] = enzo_passes["pass_progressive"].fillna(False)
# Create pitch
pitch = Pitch(pitch_type="statsbomb", pitch_color="#22312b", line_color="white")
fig, ax = pitch.draw(figsize=(12, 8))
# Plot normal passes
normal = enzo_passes[~enzo_passes["is_progressive"]]
pitch.arrows(
normal["x"], normal["y"], normal["end_x"], normal["end_y"],
ax=ax, color="#75AADB", width=2, headwidth=6, headlength=5, alpha=0.7
)
# Plot progressive passes
progressive = enzo_passes[enzo_passes["is_progressive"]]
pitch.arrows(
progressive["x"], progressive["y"], progressive["end_x"], progressive["end_y"],
ax=ax, color="#FFD700", width=2, headwidth=6, headlength=5, alpha=0.9
)
ax.set_title("Enzo Fernández Pass Map - World Cup 2022 Final",
fontsize=14, fontweight="bold", color="white")
# Add legend
ax.plot([], [], color="#75AADB", label="Normal Pass", linewidth=3)
ax.plot([], [], color="#FFD700", label="Progressive Pass", linewidth=3)
ax.legend(loc="lower right", facecolor="#22312b", labelcolor="white")
fig.patch.set_facecolor("#22312b")
plt.tight_layout()
plt.savefig("enzo_pass_map.png", dpi=150, bbox_inches="tight", facecolor="#22312b")
plt.show()
# Exercise 1.2 Solution
library(StatsBombR)
library(tidyverse)
library(ggsoccer)
# Load World Cup Final
events <- get.matchFree(data.frame(match_id = 3869685))
# Get Enzo Fernandez passes
enzo_passes <- events %>%
filter(str_detect(player.name, "Enzo")) %>%
filter(type.name == "Pass") %>%
filter(is.na(pass.outcome.name) | pass.outcome.name == "Complete") %>%
mutate(
is_progressive = ifelse(is.na(pass.progressive), FALSE, pass.progressive)
)
# Create pass map
ggplot(enzo_passes) +
annotate_pitch(colour = "white", fill = "#22312b") +
geom_segment(
aes(x = location.x, y = location.y,
xend = pass.end_location.x, yend = pass.end_location.y,
color = is_progressive),
arrow = arrow(length = unit(0.15, "cm"), type = "closed"),
alpha = 0.7, linewidth = 0.8
) +
scale_color_manual(
values = c("FALSE" = "#75AADB", "TRUE" = "#FFD700"),
labels = c("Normal Pass", "Progressive Pass")
) +
coord_flip(xlim = c(0, 120), ylim = c(0, 80)) +
theme_pitch() +
theme(
plot.background = element_rect(fill = "#22312b"),
plot.title = element_text(color = "white", face = "bold"),
legend.position = "bottom",
legend.text = element_text(color = "white"),
legend.title = element_blank()
) +
labs(title = "Enzo Fernández Pass Map - World Cup 2022 Final")
ggsave("enzo_pass_map.png", width = 12, height = 8, dpi = 150)
Exercise 1.3: Team Comparison Dashboard
Task: Create a multi-panel visualization comparing Argentina and France across multiple metrics from the World Cup Final.
Panels to include:
- Shot locations for both teams
- xG timeline
- Passing statistics bar chart
- Key player statistics table
# Exercise 1.3 Solution - Multi-panel Dashboard
from statsbombpy import sb
import matplotlib.pyplot as plt
from mplsoccer import Pitch, VerticalPitch
import pandas as pd
import numpy as np
# Load data
events = sb.events(match_id=3869685)
shots = events[events["type"] == "Shot"].copy()
shots["x"] = shots["location"].apply(lambda x: x[0])
shots["y"] = shots["location"].apply(lambda x: x[1])
shots["is_goal"] = shots["shot_outcome"] == "Goal"
# Create figure with subplots
fig = plt.figure(figsize=(18, 14))
# Panel 1: Argentina Shots
ax1 = fig.add_subplot(2, 3, 1)
pitch = VerticalPitch(pitch_type="statsbomb", half=True, pitch_color="#22312b", line_color="white")
pitch.draw(ax=ax1)
arg_shots = shots[shots["team"] == "Argentina"]
pitch.scatter(arg_shots["x"], arg_shots["y"],
s=arg_shots["shot_statsbomb_xg"]*500+50,
c=["#FFD700" if g else "#75AADB" for g in arg_shots["is_goal"]],
edgecolors="white", ax=ax1, alpha=0.8)
ax1.set_title("Argentina Shots", color="white", fontweight="bold")
# Panel 2: France Shots
ax2 = fig.add_subplot(2, 3, 2)
pitch.draw(ax=ax2)
fra_shots = shots[shots["team"] == "France"]
pitch.scatter(fra_shots["x"], fra_shots["y"],
s=fra_shots["shot_statsbomb_xg"]*500+50,
c=["#FFD700" if g else "#002654" for g in fra_shots["is_goal"]],
edgecolors="white", ax=ax2, alpha=0.8)
ax2.set_title("France Shots", color="white", fontweight="bold")
# Panel 3: xG Timeline
ax3 = fig.add_subplot(2, 3, 3)
for team, color in [("Argentina", "#75AADB"), ("France", "#002654")]:
team_shots = shots[shots["team"] == team].sort_values("minute")
cum_xg = [0] + team_shots["shot_statsbomb_xg"].cumsum().tolist()
minutes = [0] + team_shots["minute"].tolist()
ax3.step(minutes, cum_xg, where="post", label=team, color=color, linewidth=2)
goals = team_shots[team_shots["is_goal"]]
if len(goals) > 0:
goal_xg = team_shots["shot_statsbomb_xg"].cumsum()
ax3.scatter(goals["minute"], goal_xg[goals.index],
marker="*", s=200, color=color, edgecolors="gold", linewidth=1.5, zorder=5)
ax3.set_xlabel("Minute")
ax3.set_ylabel("Cumulative xG")
ax3.set_title("xG Timeline", fontweight="bold")
ax3.legend()
ax3.grid(True, alpha=0.3)
# Panel 4-5: Team Statistics
ax4 = fig.add_subplot(2, 3, (4, 5))
stats = []
for team in ["Argentina", "France"]:
team_events = events[events["team"] == team]
team_shots = shots[shots["team"] == team]
team_passes = events[(events["team"] == team) & (events["type"] == "Pass")]
stats.append({
"Team": team,
"Shots": len(team_shots),
"Goals": team_shots["is_goal"].sum(),
"xG": round(team_shots["shot_statsbomb_xg"].sum(), 2),
"Passes": len(team_passes),
"Pass %": round(team_passes["pass_outcome"].isna().mean() * 100, 1)
})
stats_df = pd.DataFrame(stats)
ax4.axis("off")
table = ax4.table(cellText=stats_df.values, colLabels=stats_df.columns,
cellLoc="center", loc="center")
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 2)
ax4.set_title("Match Statistics", fontweight="bold", y=0.7)
# Overall title
fig.suptitle("World Cup 2022 Final Dashboard\nArgentina 3-3 France (Argentina wins on penalties)",
fontsize=16, fontweight="bold", y=0.98)
fig.patch.set_facecolor("#f5f5f5")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.savefig("match_dashboard.png", dpi=150, bbox_inches="tight")
plt.show()
# Exercise 1.3 Solution - Multi-panel Dashboard
library(StatsBombR)
library(tidyverse)
library(ggsoccer)
library(patchwork)
# Load data
events <- get.matchFree(data.frame(match_id = 3869685))
shots <- events %>% filter(type.name == "Shot")
passes <- events %>% filter(type.name == "Pass")
# Panel 1: Shot Map (Argentina)
p1 <- ggplot(filter(shots, team.name == "Argentina")) +
annotate_pitch(colour = "white", fill = "#22312b") +
geom_point(
aes(x = location.x, y = location.y,
size = shot.statsbomb_xg,
color = shot.outcome.name == "Goal"),
alpha = 0.7
) +
scale_color_manual(values = c("FALSE" = "#75AADB", "TRUE" = "#FFD700")) +
coord_flip(xlim = c(60, 120)) +
theme_pitch() +
theme(legend.position = "none", plot.background = element_rect(fill = "#22312b"),
plot.title = element_text(color = "white", face = "bold", hjust = 0.5)) +
labs(title = "Argentina Shots")
# Panel 2: Shot Map (France)
p2 <- ggplot(filter(shots, team.name == "France")) +
annotate_pitch(colour = "white", fill = "#22312b") +
geom_point(
aes(x = location.x, y = location.y,
size = shot.statsbomb_xg,
color = shot.outcome.name == "Goal"),
alpha = 0.7
) +
scale_color_manual(values = c("FALSE" = "#002654", "TRUE" = "#FFD700")) +
coord_flip(xlim = c(60, 120)) +
theme_pitch() +
theme(legend.position = "none", plot.background = element_rect(fill = "#22312b"),
plot.title = element_text(color = "white", face = "bold", hjust = 0.5)) +
labs(title = "France Shots")
# Panel 3: xG Timeline
xg_data <- shots %>%
arrange(minute) %>%
group_by(team.name) %>%
mutate(cumulative_xG = cumsum(shot.statsbomb_xg)) %>%
ungroup()
p3 <- ggplot(xg_data, aes(x = minute, y = cumulative_xG, color = team.name)) +
geom_step(linewidth = 1.5) +
geom_point(data = filter(xg_data, shot.outcome.name == "Goal"),
aes(shape = team.name), size = 4) +
scale_color_manual(values = c("Argentina" = "#75AADB", "France" = "#002654")) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "xG Timeline", x = "Minute", y = "Cumulative xG", color = "")
# Panel 4: Stats Summary
stats_summary <- events %>%
group_by(team.name) %>%
summarise(
Shots = sum(type.name == "Shot"),
Goals = sum(shot.outcome.name == "Goal", na.rm = TRUE),
xG = round(sum(shot.statsbomb_xg, na.rm = TRUE), 2),
Passes = sum(type.name == "Pass"),
`Pass %` = round(sum(type.name == "Pass" & is.na(pass.outcome.name)) /
sum(type.name == "Pass") * 100, 1)
)
library(gridExtra)
p4 <- tableGrob(stats_summary, rows = NULL)
# Combine panels
dashboard <- (p1 | p2) / (p3) + plot_annotation(
title = "World Cup 2022 Final Dashboard",
subtitle = "Argentina 3-3 France (Argentina wins on penalties)",
theme = theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12)
)
)
ggsave("match_dashboard.png", dashboard, width = 16, height = 12, dpi = 150)
1.9 Summary
In this chapter, you learned:
Key Concepts
- What soccer analytics is and its three pillars
- The history and evolution of football analytics
- Types of football data (event, tracking, aggregate)
- How Expected Goals (xG) measures shot quality
Technical Skills
- Setting up Python or R for football analytics
- Loading data from StatsBomb
- Calculating basic shot statistics
- Creating a shot map visualization
What's Next
In Chapter 2: Data Wrangling for Football, we'll dive deeper into working with football data—handling different coordinate systems, dealing with missing data, and transforming raw events into analysis-ready datasets.
Key Takeaways
- Analytics enhances, not replaces traditional football expertise
- Expected Goals (xG) is the foundational metric of modern football analytics
- Event data is the most accessible detailed data source (free via StatsBomb)
- Both R and Python have excellent ecosystems for football analytics
- Start simple - even basic shot analysis can reveal interesting insights