Capstone - Complete Analytics System
The Metric That Changed Football
Expected Goals (xG) is the single most important innovation in modern football analytics. It answers a simple question: "How likely was that shot to result in a goal?" By assigning probabilities to every shot, xG reveals the true quality of chances created and finished.
Before xG, we judged strikers by goals scored. But goals are noisy—a player might score 20 goals from 15 xG worth of chances (lucky) or 10 goals from 15 xG (unlucky). xG separates skill from variance, revealing who creates quality chances and who converts them efficiently.
Why xG Matters
- Predictive Power: xG predicts future goals better than past goals do
- Process vs. Outcome: Evaluate decision-making independent of finishing luck
- Fair Comparison: Compare players/teams controlling for chance quality
- Tactical Insight: Understand how teams create and concede chances
- Transfer Decisions: Identify undervalued players and avoid overpaying for luck
A Brief History of xG
How xG Works
At its core, xG is a machine learning model trained on historical shots. Given features about a shot (location, body part, assist type, etc.), the model predicts the probability of scoring.
The Basic Concept
xG Definition: The probability that an average player would score from a given shot situation, based on historical conversion rates of similar shots.
If a shot has xG = 0.35, it means that historically, 35% of shots from similar positions with similar characteristics have resulted in goals. An "average" shooter would score this chance about once every three attempts.
Key Features in xG Models
| Feature | Description | Impact on xG |
|---|---|---|
| Distance to Goal | Euclidean distance from shot location to center of goal | Closer = Higher xG (strongest predictor) |
| Angle to Goal | Angle between shot location and goal posts | Wider angle = Higher xG |
| Body Part | Foot, head, or other | Foot > Head typically |
| Shot Type | Open play, set piece, penalty, etc. | Penalties ≈ 0.76 xG |
| Assist Type | Through ball, cross, cutback, etc. | Through balls/cutbacks higher |
| Game State | Score differential at time of shot | Trailing teams shoot from worse positions |
| Defender Positions | Number/location of defenders (advanced models) | Fewer defenders = Higher xG |
| Goalkeeper Position | Distance from goal line (advanced models) | GK off line = Higher xG |
# Understanding xG features in StatsBomb data
library(StatsBombR)
library(dplyr)
# Load sample match
matches <- FreeMatches(FreeCompetitions() %>%
filter(competition_id == 43, season_id == 106))
events <- get.matchFree(matches[1, ])
# Examine shot features
shots <- events %>%
filter(type.name == "Shot") %>%
select(player.name, location.x, location.y,
shot.statsbomb_xg, shot.body_part.name,
shot.type.name, shot.technique.name,
shot.outcome.name) %>%
mutate(
# Calculate distance to goal center (120, 40)
distance = sqrt((120 - location.x)^2 + (40 - location.y)^2),
# Calculate angle (simplified)
angle = atan2(8, distance) * 180 / pi # Goal is 8 yards wide
)
# View relationship between features and xG
print(shots %>%
select(player.name, distance, angle, shot.statsbomb_xg,
shot.body_part.name, shot.outcome.name) %>%
arrange(desc(shot.statsbomb_xg)))
# Average xG by body part
print(shots %>%
group_by(shot.body_part.name) %>%
summarise(
shots = n(),
avg_xG = mean(shot.statsbomb_xg, na.rm = TRUE),
goals = sum(shot.outcome.name == "Goal")
))chapter6-xg-featuresExploring xG features in shot dataThe xG Probability Distribution
Most shots have low xG values. The distribution is heavily right-skewed:
# Visualize xG distribution
library(ggplot2)
# Load more matches for better distribution
events <- free_allevents(MatchesDF = matches[1:10, ])
shots <- events %>% filter(type.name == "Shot")
# xG histogram
ggplot(shots, aes(x = shot.statsbomb_xg)) +
geom_histogram(bins = 50, fill = "#1B5E20", color = "white", alpha = 0.8) +
geom_vline(xintercept = mean(shots$shot.statsbomb_xg, na.rm = TRUE),
color = "red", linetype = "dashed", size = 1) +
annotate("text", x = 0.2, y = 150,
label = paste("Mean xG:", round(mean(shots$shot.statsbomb_xg, na.rm = TRUE), 3)),
color = "red") +
labs(title = "Distribution of Expected Goals",
subtitle = "Most shots have low xG; high-quality chances are rare",
x = "xG Value", y = "Number of Shots") +
theme_minimal() +
scale_x_continuous(breaks = seq(0, 1, 0.1))
# xG categories
shots %>%
mutate(xg_category = case_when(
shot.statsbomb_xg >= 0.5 ~ "Big Chance (0.5+)",
shot.statsbomb_xg >= 0.2 ~ "Good Chance (0.2-0.5)",
shot.statsbomb_xg >= 0.1 ~ "Reasonable (0.1-0.2)",
shot.statsbomb_xg >= 0.05 ~ "Low Quality (0.05-0.1)",
TRUE ~ "Very Low (<0.05)"
)) %>%
group_by(xg_category) %>%
summarise(
shots = n(),
pct = n() / nrow(shots) * 100,
avg_conversion = mean(shot.outcome.name == "Goal") * 100
) %>%
arrange(desc(pct))chapter6-xg-distributionAnalyzing the xG distributionUsing Pre-Built xG Data
Most analysts use xG values provided by data companies rather than building their own models. Here's how to work with xG data from major sources.
StatsBomb xG
StatsBomb provides the most detailed free xG data, including freeze-frame information about player and goalkeeper positions:
# Working with StatsBomb xG
library(StatsBombR)
library(dplyr)
# Load competition data
comps <- FreeCompetitions() %>%
filter(competition_id == 43, season_id == 106) # World Cup 2022
matches <- FreeMatches(comps)
events <- free_allevents(MatchesDF = matches)
# Extract all shots with xG
shots <- events %>%
filter(type.name == "Shot") %>%
select(match_id, team.name, player.name, minute,
location.x, location.y,
shot.statsbomb_xg, shot.outcome.name,
shot.body_part.name, shot.type.name,
shot.first_time, shot.one_on_one)
# Team xG totals
team_xg <- shots %>%
group_by(team.name) %>%
summarise(
matches = n_distinct(match_id),
shots = n(),
goals = sum(shot.outcome.name == "Goal"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
npxG = sum(shot.statsbomb_xg[shot.type.name != "Penalty"], na.rm = TRUE)
) %>%
mutate(
xG_per_match = round(xG / matches, 2),
goals_minus_xG = goals - xG,
conversion = round(goals / shots * 100, 1)
) %>%
arrange(desc(xG))
print("World Cup 2022 Team xG:")
print(head(team_xg, 10))
# Player xG leaders
player_xg <- shots %>%
group_by(player.name, team.name) %>%
summarise(
shots = n(),
goals = sum(shot.outcome.name == "Goal"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(goals_minus_xG = goals - xG) %>%
arrange(desc(xG))
print("\nTop 10 Players by xG:")
print(head(player_xg, 10))chapter6-statsbomb-xgWorking with StatsBomb xG dataUnderstat xG
Understat provides free xG data for the top 5 European leagues:
# Working with Understat xG via understatr
library(understatr)
library(dplyr)
# Get team season data
epl_teams <- get_league_teams_stats(league_name = "EPL", year = 2023)
# View team xG data
team_xg <- epl_teams %>%
select(team_name, matches, scored, missed, xG, xGA, xpts, pts) %>%
mutate(
goals_minus_xG = scored - xG,
conceded_minus_xGA = missed - xGA,
pts_minus_xpts = pts - xpts
) %>%
arrange(desc(xG))
print("EPL 2023-24 Team xG:")
print(team_xg)
# Get player data
player_data <- get_league_players_stats(league_name = "EPL", year = 2023)
# Top scorers by xG
top_xg <- player_data %>%
select(player_name, team_name, games, goals, xG, shots) %>%
mutate(
goals_minus_xG = goals - xG,
xG_per_shot = xG / shots
) %>%
arrange(desc(xG)) %>%
head(15)
print("\nTop 15 Players by xG:")
print(top_xg)chapter6-understat-xgWorking with Understat xG dataFBref xG (via StatsBomb)
FBref provides StatsBomb xG data with convenient aggregations:
# Scraping FBref xG data with worldfootballR
library(worldfootballR)
library(dplyr)
# Get league-wide player stats
epl_stats <- fb_big5_advanced_season_stats(
season_end_year = 2024,
stat_type = "shooting",
team_or_player = "player"
)
# Filter for EPL and analyze xG
epl_players <- epl_stats %>%
filter(Comp == "Premier League") %>%
select(Player, Squad, Min_Playing, Gls_Standard, xG_Expected,
npxG_Expected, Sh_Standard, SoT_Standard) %>%
mutate(
nineties = Min_Playing / 90,
goals_minus_xG = Gls_Standard - xG_Expected,
xG_per_90 = xG_Expected / nineties,
shots_per_90 = Sh_Standard / nineties
) %>%
filter(Min_Playing >= 900) %>% # Minimum 10 matches
arrange(desc(xG_Expected))
print("EPL Top Scorers by xG (min 900 mins):")
print(head(epl_players, 15))
# Biggest overperformers
print("\nBiggest Overperformers (Goals - xG):")
print(epl_players %>%
arrange(desc(goals_minus_xG)) %>%
select(Player, Squad, Gls_Standard, xG_Expected, goals_minus_xG) %>%
head(10))chapter6-fbref-xgWorking with FBref xG dataInterpreting xG Correctly
xG is powerful but often misunderstood. Here's how to use it properly.
xG Is Probabilistic, Not Deterministic
Common Misunderstanding
"Team A had 2.5 xG so they deserved to win" - Wrong!
xG tells us the probability distribution of outcomes, not what "should" happen. A team with 2.5 xG might score 0, 1, 2, 3, 4, or more goals on any given day.
# Simulate goal outcomes from xG
library(dplyr)
simulate_goals <- function(xg_values, n_simulations = 10000) {
# Each shot is a Bernoulli trial with p = xG
goals_per_sim <- sapply(1:n_simulations, function(i) {
sum(runif(length(xg_values)) < xg_values)
})
return(goals_per_sim)
}
# Example: Team had shots with these xG values
shot_xgs <- c(0.75, 0.35, 0.12, 0.08, 0.05, 0.03, 0.02, 0.02)
total_xg <- sum(shot_xgs) # 1.42 xG
# Simulate 10,000 times
simulated_goals <- simulate_goals(shot_xgs)
# What percentage of simulations result in each goal count?
goal_distribution <- table(simulated_goals) / 10000 * 100
cat(sprintf("Total xG: %.2f\n", total_xg))
cat("\nGoal distribution from 10,000 simulations:\n")
print(round(goal_distribution, 1))
cat(sprintf("\nMost likely outcome: %d goals (%.1f%%)\n",
as.numeric(names(which.max(goal_distribution))),
max(goal_distribution)))
cat(sprintf("Probability of 0 goals: %.1f%%\n",
goal_distribution["0"]))chapter6-xg-simulationSimulating goal outcomes from xGOver/Underperformance and Regression
When a player's goals significantly differ from their xG, we should expect regression to the mean:
Possible explanations:
- Elite finishing skill (sustained over multiple seasons)
- Luck/variance (likely if short sample)
- Shot selection bias (only shoots when confident)
Expectation: Goals will likely decrease unless proven elite finisher
Possible explanations:
- Poor finishing (sustained over multiple seasons)
- Bad luck/variance (likely if short sample)
- Injury affecting shooting
Expectation: Goals will likely increase (bounce-back candidate)
# Analyze over/underperformance
library(dplyr)
# Calculate player finishing skill
player_finishing <- shots %>%
group_by(player.name, team.name) %>%
summarise(
shots = n(),
goals = sum(shot.outcome.name == "Goal"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
.groups = "drop"
) %>%
filter(shots >= 10) %>% # Minimum sample size
mutate(
goals_minus_xG = goals - xG,
finishing_pct = (goals - xG) / xG * 100, # % over/under xG
conversion = goals / shots * 100,
xG_per_shot = xG / shots
)
# Overperformers (potential regression candidates)
cat("\nBiggest Overperformers (may regress):\n")
print(player_finishing %>%
filter(goals_minus_xG > 0) %>%
arrange(desc(goals_minus_xG)) %>%
select(player.name, shots, goals, xG, goals_minus_xG) %>%
head(5))
# Underperformers (potential bounce-back candidates)
cat("\nBiggest Underperformers (may improve):\n")
print(player_finishing %>%
filter(goals_minus_xG < 0) %>%
arrange(goals_minus_xG) %>%
select(player.name, shots, goals, xG, goals_minus_xG) %>%
head(5))
# Note: True finishing skill requires multi-season analysis
cat("\nNote: Single tournament data is noisy.")
cat("\nMulti-season analysis needed for true finishing skill.")chapter6-regressionAnalyzing over/underperformance and regression candidatesNon-Penalty xG (npxG)
Penalties are almost automatic goals (~76% conversion). To fairly compare players who take different numbers of penalties, use non-penalty xG (npxG):
# Calculate npxG
player_npxg <- shots %>%
group_by(player.name, team.name) %>%
summarise(
total_shots = n(),
penalties = sum(shot.type.name == "Penalty"),
non_pen_shots = sum(shot.type.name != "Penalty"),
goals = sum(shot.outcome.name == "Goal"),
pen_goals = sum(shot.type.name == "Penalty" & shot.outcome.name == "Goal"),
np_goals = goals - pen_goals,
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
npxG = sum(shot.statsbomb_xg[shot.type.name != "Penalty"], na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
np_goals_minus_npxG = np_goals - npxG
) %>%
filter(non_pen_shots >= 5)
# Compare with and without penalties
print("Player xG vs npxG Comparison:")
print(player_npxg %>%
arrange(desc(xG)) %>%
select(player.name, goals, xG, np_goals, npxG, penalties) %>%
head(10))chapter6-npxgCalculating non-penalty xGTeam-Level xG Analysis
xG is even more powerful at the team level, where individual variance averages out faster.
xG Difference (xGD)
The difference between xG created and xG conceded is highly predictive of future performance:
# Calculate team xG difference
library(dplyr)
# Get xG for and against per team
team_xg_analysis <- events %>%
filter(type.name == "Shot") %>%
group_by(match_id, team.name) %>%
summarise(xG_for = sum(shot.statsbomb_xg, na.rm = TRUE),
goals_for = sum(shot.outcome.name == "Goal"),
.groups = "drop")
# Get opponent xG for each team in each match
match_xg <- team_xg_analysis %>%
group_by(match_id) %>%
mutate(
xG_against = sum(xG_for) - xG_for,
goals_against = sum(goals_for) - goals_for
) %>%
ungroup()
# Aggregate to team level
team_xgd <- match_xg %>%
group_by(team.name) %>%
summarise(
matches = n(),
xG_for = sum(xG_for),
xG_against = sum(xG_against),
goals_for = sum(goals_for),
goals_against = sum(goals_against)
) %>%
mutate(
xGD = xG_for - xG_against,
actual_GD = goals_for - goals_against,
xGD_per_match = round(xGD / matches, 2),
# Performance vs expectation
goals_vs_xG = goals_for - xG_for,
conceded_vs_xGA = goals_against - xG_against
) %>%
arrange(desc(xGD))
print("Team xG Difference Rankings:")
print(team_xgd %>%
select(team.name, matches, xG_for, xG_against, xGD,
actual_GD, xGD_per_match) %>%
head(10))chapter6-team-xgdCalculating team xG differenceExpected Points (xPts)
We can simulate match outcomes to calculate expected points:
# Calculate expected points from match xG
calculate_xpts <- function(xg_for, xg_against, n_sims = 10000) {
# Simulate goals using Poisson distribution
goals_for <- rpois(n_sims, xg_for)
goals_against <- rpois(n_sims, xg_against)
# Calculate points: 3 for win, 1 for draw, 0 for loss
points <- ifelse(goals_for > goals_against, 3,
ifelse(goals_for == goals_against, 1, 0))
# Return expected points and win/draw/loss probabilities
return(list(
xPts = mean(points),
win_prob = mean(goals_for > goals_against),
draw_prob = mean(goals_for == goals_against),
loss_prob = mean(goals_for < goals_against)
))
}
# Example: A match where Team A had 2.1 xG and Team B had 0.8 xG
result <- calculate_xpts(2.1, 0.8)
cat(sprintf("Team A (2.1 xG vs 0.8 xG):\n"))
cat(sprintf(" Expected Points: %.2f\n", result$xPts))
cat(sprintf(" Win Probability: %.1f%%\n", result$win_prob * 100))
cat(sprintf(" Draw Probability: %.1f%%\n", result$draw_prob * 100))
cat(sprintf(" Loss Probability: %.1f%%\n", result$loss_prob * 100))
# Calculate xPts for all matches
match_xpts <- match_xg %>%
rowwise() %>%
mutate(
xPts = calculate_xpts(xG_for, xG_against)$xPts,
actual_pts = case_when(
goals_for > goals_against ~ 3,
goals_for == goals_against ~ 1,
TRUE ~ 0
)
) %>%
ungroup()
# Team xPts totals
team_xpts <- match_xpts %>%
group_by(team.name) %>%
summarise(
matches = n(),
xPts = sum(xPts),
actual_pts = sum(actual_pts),
pts_difference = actual_pts - xPts
) %>%
arrange(desc(xPts))
print("\nTeam Expected Points:")
print(team_xpts)chapter6-xptsCalculating expected points from xGVisualizing xG
Effective xG visualizations communicate chance quality at a glance.
# Create xG shot map with size encoding
library(ggplot2)
library(ggsoccer)
match_shots <- events %>%
filter(type.name == "Shot", match_id == matches$match_id[1])
# xG shot map
ggplot(match_shots) +
annotate_pitch(colour = "white", fill = "#1a472a") +
geom_point(aes(x = location.x, y = location.y,
size = shot.statsbomb_xg,
color = shot.outcome.name == "Goal"),
alpha = 0.8) +
scale_size_continuous(range = c(2, 12), name = "xG") +
scale_color_manual(values = c("FALSE" = "#CCCCCC", "TRUE" = "#FFD700"),
labels = c("No Goal", "Goal"), name = "Result") +
coord_flip(xlim = c(60, 120)) +
theme_pitch() +
facet_wrap(~team.name, ncol = 2) +
labs(title = "Match xG Shot Map",
subtitle = "Point size represents expected goals value") +
theme(legend.position = "bottom",
strip.text = element_text(size = 12, face = "bold"))chapter6-xg-vizCreating xG shot mapsChapter Summary
Key Takeaways
- xG measures chance quality - probability a shot results in a goal
- Key factors: Distance, angle, body part, assist type, game state
- Use pre-built xG - StatsBomb, Understat, FBref provide reliable data
- xG is probabilistic - variance is expected; don't over-interpret single matches
- Regression to the mean - over/underperformers usually revert
- Use npxG - for fair comparison across penalty-takers
- xGD is predictive - team xG difference predicts future performance
xG Quick Reference
| Shot Type | Typical xG Range | Example |
|---|---|---|
| Penalty | 0.76 | Standard penalty kick |
| Open goal (6-yard) | 0.70-0.95 | Tap-in from 3 yards |
| 1v1 with keeper | 0.30-0.50 | Through ball, clear on goal |
| Header from cross | 0.05-0.15 | 6-yard box header |
| Edge of box shot | 0.05-0.10 | 18-yard shot, central |
| Long-range shot | 0.02-0.05 | 25+ yards out |
xG Visualization Tutorials
Effective visualization is crucial for communicating xG insights. Here are the most important xG visualizations you should master.
xG Shot Map with Color Gradient
Shot maps with xG-colored markers show where teams create quality chances:
from statsbombpy import sb
import matplotlib.pyplot as plt
from mplsoccer import VerticalPitch
import numpy as np
# Load World Cup Final
events = sb.events(match_id=3869685)
shots = events[events["type"] == "Shot"].copy()
# Extract coordinates
shots["x"] = shots["location"].apply(lambda loc: loc[0])
shots["y"] = shots["location"].apply(lambda loc: loc[1])
shots["is_goal"] = shots["shot_outcome"] == "Goal"
# Create figure with two half-pitches
fig, axes = plt.subplots(1, 2, figsize=(16, 10))
teams = ["Argentina", "France"]
for idx, team in enumerate(teams):
pitch = VerticalPitch(
pitch_type="statsbomb", half=True,
pitch_color="#1a472a", line_color="white", linewidth=1
)
pitch.draw(ax=axes[idx])
team_shots = shots[shots["team"] == team]
# Create scatter with xG color gradient
scatter = pitch.scatter(
team_shots["x"], team_shots["y"],
s=team_shots["shot_statsbomb_xg"] * 800 + 100,
c=team_shots["shot_statsbomb_xg"],
cmap="RdYlBu_r",
edgecolors="white",
linewidth=1.5,
alpha=0.85,
ax=axes[idx],
vmin=0, vmax=0.8
)
# Mark goals with stars
goals = team_shots[team_shots["is_goal"]]
pitch.scatter(
goals["x"], goals["y"],
s=300, marker="*", c="gold",
edgecolors="black", linewidth=1,
ax=axes[idx], zorder=5
)
# Add xG total
total_xg = team_shots["shot_statsbomb_xg"].sum()
goals_scored = team_shots["is_goal"].sum()
axes[idx].set_title(f"{team}\n{goals_scored} Goals | {total_xg:.2f} xG",
color="white", fontsize=14, fontweight="bold", pad=10)
# Add colorbar
cbar = fig.colorbar(scatter, ax=axes, orientation="horizontal",
fraction=0.05, pad=0.08, aspect=40)
cbar.set_label("xG Value", color="white", fontsize=12)
cbar.ax.xaxis.set_tick_params(color="white")
plt.setp(plt.getp(cbar.ax.axes, "xticklabels"), color="white")
fig.suptitle("xG Shot Map: World Cup 2022 Final", fontsize=18,
fontweight="bold", color="white", y=0.98)
fig.patch.set_facecolor("#1a472a")
plt.tight_layout(rect=[0, 0.05, 1, 0.95])
plt.savefig("xg_shot_map.png", dpi=150, bbox_inches="tight", facecolor="#1a472a")
plt.show()
library(StatsBombR)
library(tidyverse)
library(ggsoccer)
# Load World Cup Final
events <- get.matchFree(data.frame(match_id = 3869685))
shots <- events %>% filter(type.name == "Shot")
# Create xG shot map with color gradient
ggplot(shots) +
annotate_pitch(colour = "white", fill = "#1a472a") +
geom_point(
aes(x = location.x, y = location.y,
size = shot.statsbomb_xg,
fill = shot.statsbomb_xg,
shape = ifelse(shot.outcome.name == "Goal", "Goal", "No Goal")),
color = "white", stroke = 1.2, alpha = 0.85
) +
scale_fill_gradient2(
low = "#2196F3", mid = "#FFC107", high = "#F44336",
midpoint = 0.3, limits = c(0, 1),
name = "xG Value"
) +
scale_size_continuous(range = c(3, 15), name = "xG Value") +
scale_shape_manual(values = c("Goal" = 23, "No Goal" = 21), name = "Outcome") +
coord_flip(xlim = c(60, 122), ylim = c(0, 80)) +
facet_wrap(~team.name, ncol = 2) +
theme_pitch() +
theme(
plot.background = element_rect(fill = "#1a472a"),
strip.text = element_text(color = "white", size = 14, face = "bold"),
legend.position = "bottom",
legend.text = element_text(color = "white"),
legend.title = element_text(color = "white"),
plot.title = element_text(color = "white", size = 16, face = "bold", hjust = 0.5)
) +
guides(size = "none") +
labs(
title = "xG Shot Map: World Cup 2022 Final",
subtitle = "Size and color indicate shot quality (xG)"
)
ggsave("xg_shot_map.png", width = 14, height = 8, dpi = 150)
xG vs Actual Goals Scatter Plot
This visualization reveals over/underperformers relative to their xG:
import matplotlib.pyplot as plt
import numpy as np
# Aggregate player data
player_xg = shots.groupby(["player", "team"]).agg(
shots_count=("shot_statsbomb_xg", "count"),
goals=("shot_outcome", lambda x: (x == "Goal").sum()),
xG=("shot_statsbomb_xg", "sum")
).reset_index()
player_xg = player_xg[player_xg["shots_count"] >= 3]
# Create scatter plot
fig, ax = plt.subplots(figsize=(10, 8))
# Identity line
ax.plot([0, 4], [0, 4], "k--", alpha=0.5, linewidth=2, label="Expected")
# Scatter by team
colors = {"Argentina": "#75AADB", "France": "#002654"}
for team in colors:
team_data = player_xg[player_xg["team"] == team]
ax.scatter(
team_data["xG"], team_data["goals"],
s=team_data["shots_count"] * 30 + 50,
c=colors[team], alpha=0.7, edgecolors="white",
linewidth=1.5, label=team
)
# Add labels for key players
for _, row in player_xg[player_xg["goals"] >= 2].iterrows():
last_name = row["player"].split()[-1]
ax.annotate(last_name, (row["xG"], row["goals"]),
xytext=(5, 5), textcoords="offset points",
fontsize=10, fontweight="bold")
# Add region labels
ax.text(0.5, 2.8, "Overperforming", color="#4CAF50",
fontsize=11, fontstyle="italic")
ax.text(2.5, 0.5, "Underperforming", color="#F44336",
fontsize=11, fontstyle="italic")
ax.set_xlabel("Expected Goals (xG)", fontsize=12)
ax.set_ylabel("Actual Goals", fontsize=12)
ax.set_title("Goals vs xG: World Cup 2022 Final\n" +
"Points above dashed line = finishing above expectation",
fontsize=14, fontweight="bold")
ax.legend(loc="upper left")
ax.set_xlim(0, 3.5)
ax.set_ylim(0, 3.5)
ax.set_aspect("equal")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("xg_vs_goals.png", dpi=150, bbox_inches="tight")
plt.show()
library(tidyverse)
# Create player xG vs Goals scatter plot
player_xg_data <- shots %>%
group_by(player.name, team.name) %>%
summarise(
shots = n(),
goals = sum(shot.outcome.name == "Goal"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
.groups = "drop"
) %>%
filter(shots >= 3) # Minimum shots filter
# Create scatter plot
ggplot(player_xg_data, aes(x = xG, y = goals)) +
# Identity line (expected performance)
geom_abline(intercept = 0, slope = 1, color = "gray50",
linetype = "dashed", linewidth = 1) +
# Points
geom_point(aes(size = shots, color = team.name),
alpha = 0.7) +
# Labels for top performers
geom_text(
data = filter(player_xg_data, goals >= 2 | xG >= 1),
aes(label = str_extract(player.name, "\\w+$")), # Last name
vjust = -0.8, size = 3.5, fontface = "bold"
) +
# Styling
scale_color_manual(values = c("Argentina" = "#75AADB", "France" = "#002654")) +
scale_size_continuous(range = c(3, 12)) +
annotate("text", x = 0.5, y = 2.8, label = "Overperforming",
color = "#4CAF50", fontface = "italic", size = 4) +
annotate("text", x = 2.5, y = 0.5, label = "Underperforming",
color = "#F44336", fontface = "italic", size = 4) +
labs(
title = "Goals vs xG: World Cup 2022 Final",
subtitle = "Points above the dashed line = finishing above expectation",
x = "Expected Goals (xG)",
y = "Actual Goals",
color = "Team",
size = "Shots"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right"
) +
coord_equal(xlim = c(0, 3.5), ylim = c(0, 3.5))
ggsave("xg_vs_goals.png", width = 10, height = 8, dpi = 150)
Cumulative xG Over Time
Track how xG accumulates throughout a match or season:
# Simplified version for single match timeline with game phases
import matplotlib.pyplot as plt
import numpy as np
# Calculate cumulative xG for both teams
fig, ax = plt.subplots(figsize=(14, 7))
for team, color in [("Argentina", "#75AADB"), ("France", "#002654")]:
team_shots = shots[shots["team"] == team].sort_values(["minute", "second"])
team_shots["cumulative_xG"] = team_shots["shot_statsbomb_xg"].cumsum()
# Add starting point
minutes = [0] + team_shots["minute"].tolist()
cum_xg = [0] + team_shots["cumulative_xG"].tolist()
ax.step(minutes, cum_xg, where="post", linewidth=2.5,
color=color, label=f"{team} ({cum_xg[-1]:.2f} xG)", alpha=0.9)
# Mark goals
goals = team_shots[team_shots["is_goal"]]
for _, goal in goals.iterrows():
ax.scatter(goal["minute"], goal["cumulative_xG"],
marker="*", s=400, c=color, edgecolors="gold",
linewidth=2, zorder=5)
# Add match phase indicators
ax.axvline(x=45, color="gray", linestyle="--", alpha=0.5, linewidth=1.5)
ax.axvline(x=90, color="gray", linestyle="--", alpha=0.5, linewidth=1.5)
ax.axvline(x=105, color="gray", linestyle=":", alpha=0.5, linewidth=1.5)
ax.text(22.5, ax.get_ylim()[1]*0.95, "1st Half", ha="center",
fontsize=10, color="gray")
ax.text(67.5, ax.get_ylim()[1]*0.95, "2nd Half", ha="center",
fontsize=10, color="gray")
ax.text(112.5, ax.get_ylim()[1]*0.95, "ET", ha="center",
fontsize=10, color="gray")
ax.set_xlabel("Minute", fontsize=12)
ax.set_ylabel("Cumulative xG", fontsize=12)
ax.set_title("Cumulative xG Timeline: World Cup 2022 Final\n" +
"Stars indicate goals scored",
fontsize=14, fontweight="bold")
ax.legend(loc="upper left", fontsize=11)
ax.set_xlim(0, 125)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("cumulative_xg_timeline.png", dpi=150, bbox_inches="tight")
plt.show()
# Cumulative xG over a season (example with multiple matches)
library(tidyverse)
# Load multiple World Cup matches
matches <- FreeMatches(FreeCompetitions() %>%
filter(competition_id == 43, season_id == 106))
# Get Argentina matches
arg_matches <- matches %>%
filter(home_team.home_team_name == "Argentina" |
away_team.away_team_name == "Argentina")
# Load all events
all_events <- free_allevents(MatchesDF = arg_matches)
# Calculate cumulative xG per match
arg_xg_progression <- all_events %>%
filter(type.name == "Shot") %>%
filter(team.name == "Argentina") %>%
arrange(match_id, minute, second) %>%
group_by(match_id) %>%
mutate(
cumulative_xG = cumsum(shot.statsbomb_xg),
shot_number = row_number()
) %>%
ungroup()
# Join with match info
arg_xg_progression <- arg_xg_progression %>%
left_join(
arg_matches %>%
select(match_id, home_team.home_team_name, away_team.away_team_name),
by = "match_id"
) %>%
mutate(
opponent = ifelse(home_team.home_team_name == "Argentina",
away_team.away_team_name,
home_team.home_team_name)
)
# Plot cumulative xG for each match
ggplot(arg_xg_progression, aes(x = minute, y = cumulative_xG, color = opponent)) +
geom_step(linewidth = 1.2, alpha = 0.8) +
geom_point(data = filter(arg_xg_progression, shot.outcome.name == "Goal"),
size = 4, shape = 18) +
scale_color_viridis_d(option = "plasma") +
labs(
title = "Argentina Cumulative xG by Match - World Cup 2022",
subtitle = "Each line represents one match; diamonds indicate goals",
x = "Minute",
y = "Cumulative xG",
color = "Opponent"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right"
) +
scale_x_continuous(breaks = seq(0, 120, 15))
ggsave("cumulative_xg_season.png", width = 14, height = 8, dpi = 150)
Practice Exercises
Exercise 6.1: Calculate Team xG
Task: Load a different World Cup 2022 match and calculate the total xG for each team. Identify which team "deserved" to win based on xG.
# Exercise 6.1 Solution
from statsbombpy import sb
# Find Brazil vs Croatia match
matches = sb.matches(competition_id=43, season_id=106)
bra_cro = matches[
((matches["home_team"] == "Brazil") | (matches["away_team"] == "Brazil")) &
((matches["home_team"] == "Croatia") | (matches["away_team"] == "Croatia"))
].iloc[0]
events = sb.events(match_id=bra_cro["match_id"])
shots = events[events["type"] == "Shot"]
# Calculate team xG
team_xg = shots.groupby("team").agg(
shots=("type", "count"),
goals=("shot_outcome", lambda x: (x == "Goal").sum()),
xG=("shot_statsbomb_xg", "sum"),
big_chances=("shot_statsbomb_xg", lambda x: (x > 0.3).sum())
).round(2)
print("Brazil vs Croatia xG Analysis:")
print(team_xg)
xg_winner = team_xg["xG"].idxmax()
print(f"\nBased on xG, {xg_winner} created better chances.")
# Exercise 6.1 Solution
library(StatsBombR)
library(tidyverse)
# Load Brazil vs Croatia quarter-final
matches <- FreeMatches(FreeCompetitions() %>%
filter(competition_id == 43, season_id == 106))
bra_cro <- matches %>%
filter((home_team.home_team_name == "Brazil" |
away_team.away_team_name == "Brazil") &
(home_team.home_team_name == "Croatia" |
away_team.away_team_name == "Croatia"))
events <- get.matchFree(bra_cro)
# Calculate team xG
team_xg <- events %>%
filter(type.name == "Shot") %>%
group_by(team.name) %>%
summarise(
shots = n(),
goals = sum(shot.outcome.name == "Goal"),
xG = round(sum(shot.statsbomb_xg, na.rm = TRUE), 2),
big_chances = sum(shot.statsbomb_xg > 0.3)
)
print("Brazil vs Croatia xG Analysis:")
print(team_xg)
# Determine "deserved" winner
xg_winner <- team_xg %>% filter(xG == max(xG)) %>% pull(team.name)
cat(sprintf("\nBased on xG, %s created better chances.\n", xg_winner))
Exercise 6.2: Find the Best Finisher
Task: Analyze all World Cup 2022 matches to find the player who most outperformed their xG (minimum 5 shots).
# Exercise 6.2 Solution
from statsbombpy import sb
import pandas as pd
# Load all World Cup matches
matches = sb.matches(competition_id=43, season_id=106)
all_shots = []
for match_id in matches["match_id"]:
events = sb.events(match_id=match_id)
shots = events[events["type"] == "Shot"]
all_shots.append(shots)
shots_df = pd.concat(all_shots, ignore_index=True)
# Calculate player finishing
player_finishing = shots_df.groupby(["player", "team"]).agg(
shots=("type", "count"),
goals=("shot_outcome", lambda x: (x == "Goal").sum()),
xG=("shot_statsbomb_xg", "sum")
).reset_index()
player_finishing = player_finishing[player_finishing["shots"] >= 5].copy()
player_finishing["goals_minus_xG"] = player_finishing["goals"] - player_finishing["xG"]
player_finishing["conversion_rate"] = player_finishing["goals"] / player_finishing["shots"] * 100
player_finishing = player_finishing.sort_values("goals_minus_xG", ascending=False)
print("Top 10 Finishers (Goals - xG):")
print(player_finishing.head(10)[["player", "team", "shots", "goals", "xG", "goals_minus_xG"]])
best = player_finishing.iloc[0]
print(f"\nBest finisher: {best['player']} ({best['team']})")
print(f"Scored {best['goals']} goals from {best['xG']:.2f} xG (+{best['goals_minus_xG']:.2f})")
# Exercise 6.2 Solution
library(StatsBombR)
library(tidyverse)
# Load all World Cup matches
matches <- FreeMatches(FreeCompetitions() %>%
filter(competition_id == 43, season_id == 106))
all_events <- free_allevents(MatchesDF = matches)
# Calculate player finishing
player_finishing <- all_events %>%
filter(type.name == "Shot") %>%
group_by(player.name, team.name) %>%
summarise(
shots = n(),
goals = sum(shot.outcome.name == "Goal"),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
.groups = "drop"
) %>%
filter(shots >= 5) %>%
mutate(
goals_minus_xG = goals - xG,
conversion_rate = goals / shots * 100
) %>%
arrange(desc(goals_minus_xG))
print("Top 10 Finishers (Goals - xG):")
print(head(player_finishing, 10))
# Best finisher
best <- player_finishing %>% slice(1)
cat(sprintf("\nBest finisher: %s (%s)\n", best$player.name, best$team.name))
cat(sprintf("Scored %d goals from %.2f xG (+%.2f)\n",
best$goals, best$xG, best$goals_minus_xG))
Exercise 6.3: Create an xG Race Chart
Task: Create a visualization showing the running xG total for Argentina throughout the entire World Cup 2022 tournament.
# Exercise 6.3 Solution - Argentina xG Race Chart
from statsbombpy import sb
import matplotlib.pyplot as plt
import pandas as pd
# Load all Argentina matches
matches = sb.matches(competition_id=43, season_id=106)
arg_matches = matches[
(matches["home_team"] == "Argentina") |
(matches["away_team"] == "Argentina")
].sort_values("match_date")
# Collect all Argentina shots across tournament
all_shots = []
for _, match in arg_matches.iterrows():
events = sb.events(match_id=match["match_id"])
shots = events[(events["type"] == "Shot") & (events["team"] == "Argentina")]
shots["match_date"] = match["match_date"]
all_shots.append(shots)
shots_df = pd.concat(all_shots, ignore_index=True)
shots_df = shots_df.sort_values(["match_date", "minute", "second"])
shots_df["shot_number"] = range(1, len(shots_df) + 1)
shots_df["cumulative_xG"] = shots_df["shot_statsbomb_xg"].cumsum()
shots_df["cumulative_goals"] = (shots_df["shot_outcome"] == "Goal").cumsum()
# Find match boundaries
match_boundaries = shots_df.groupby("match_id")["shot_number"].max().tolist()
# Create plot
fig, ax = plt.subplots(figsize=(14, 8))
ax.fill_between(shots_df["shot_number"], shots_df["cumulative_xG"],
alpha=0.3, color="#75AADB")
ax.plot(shots_df["shot_number"], shots_df["cumulative_xG"],
linewidth=2.5, color="#75AADB", label=f"xG ({shots_df['cumulative_xG'].iloc[-1]:.1f})")
ax.step(shots_df["shot_number"], shots_df["cumulative_goals"], where="post",
linewidth=2.5, color="#FFD700", label=f"Goals ({shots_df['cumulative_goals'].iloc[-1]})")
# Add match boundaries
for boundary in match_boundaries[:-1]:
ax.axvline(x=boundary, color="gray", linestyle="--", alpha=0.5)
ax.set_xlabel("Shot Number (Tournament Total)", fontsize=12)
ax.set_ylabel("Cumulative Value", fontsize=12)
ax.set_title("Argentina World Cup 2022 - xG Accumulation\n" +
"Dashed lines indicate match boundaries",
fontsize=14, fontweight="bold")
ax.legend(loc="upper left", fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("argentina_xg_race.png", dpi=150, bbox_inches="tight")
plt.show()
# Exercise 6.3 Solution - Argentina xG Race Chart
library(StatsBombR)
library(tidyverse)
# Load all Argentina World Cup matches
matches <- FreeMatches(FreeCompetitions() %>%
filter(competition_id == 43, season_id == 106)) %>%
filter(home_team.home_team_name == "Argentina" |
away_team.away_team_name == "Argentina") %>%
arrange(match_date)
all_events <- free_allevents(MatchesDF = matches)
# Create tournament progression
arg_progression <- all_events %>%
filter(type.name == "Shot", team.name == "Argentina") %>%
arrange(match_id, minute) %>%
mutate(
cumulative_xG = cumsum(shot.statsbomb_xg),
cumulative_goals = cumsum(shot.outcome.name == "Goal"),
shot_number = row_number()
)
# Add match labels
match_order <- arg_progression %>%
group_by(match_id) %>%
summarise(last_shot = max(shot_number)) %>%
arrange(last_shot) %>%
mutate(match_num = row_number())
arg_progression <- arg_progression %>%
left_join(match_order, by = "match_id")
# Create race chart
ggplot(arg_progression, aes(x = shot_number)) +
geom_area(aes(y = cumulative_xG), fill = "#75AADB", alpha = 0.4) +
geom_line(aes(y = cumulative_xG, color = "xG"), linewidth = 1.5) +
geom_step(aes(y = cumulative_goals, color = "Goals"), linewidth = 1.5) +
geom_vline(data = match_order, aes(xintercept = last_shot),
linetype = "dashed", alpha = 0.5) +
scale_color_manual(values = c("xG" = "#75AADB", "Goals" = "#FFD700")) +
labs(
title = "Argentina World Cup 2022 - xG Accumulation",
subtitle = "Dashed lines indicate match boundaries",
x = "Shot Number (Tournament Total)",
y = "Cumulative Value",
color = ""
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
ggsave("argentina_xg_race.png", width = 14, height = 8, dpi = 150)
Ready for Advanced xG?
Explore post-shot xG, goalkeeper evaluation, xG models, and finishing skill analysis.
Continue to Advanced xG Concepts