Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples
0 of 60 chapters completed (0%)

The Metric That Changed Football

Expected Goals (xG) is the single most important innovation in modern football analytics. It answers a simple question: "How likely was that shot to result in a goal?" By assigning probabilities to every shot, xG reveals the true quality of chances created and finished.

Before xG, we judged strikers by goals scored. But goals are noisy—a player might score 20 goals from 15 xG worth of chances (lucky) or 10 goals from 15 xG (unlucky). xG separates skill from variance, revealing who creates quality chances and who converts them efficiently.

Why xG Matters
  • Predictive Power: xG predicts future goals better than past goals do
  • Process vs. Outcome: Evaluate decision-making independent of finishing luck
  • Fair Comparison: Compare players/teams controlling for chance quality
  • Tactical Insight: Understand how teams create and concede chances
  • Transfer Decisions: Identify undervalued players and avoid overpaying for luck

A Brief History of xG

2004
First academic paper on shot probability by Sam Green
2012
Opta develops internal xG model for clients
2014
Michael Caley publishes influential xG work; term gains traction
2017
StatsBomb releases free xG data; mainstream adoption begins
2020+
xG becomes standard broadcast metric; Premier League displays live

How xG Works

At its core, xG is a machine learning model trained on historical shots. Given features about a shot (location, body part, assist type, etc.), the model predicts the probability of scoring.

The Basic Concept

xG Definition: The probability that an average player would score from a given shot situation, based on historical conversion rates of similar shots.

If a shot has xG = 0.35, it means that historically, 35% of shots from similar positions with similar characteristics have resulted in goals. An "average" shooter would score this chance about once every three attempts.

Key Features in xG Models

Feature Description Impact on xG
Distance to Goal Euclidean distance from shot location to center of goal Closer = Higher xG (strongest predictor)
Angle to Goal Angle between shot location and goal posts Wider angle = Higher xG
Body Part Foot, head, or other Foot > Head typically
Shot Type Open play, set piece, penalty, etc. Penalties ≈ 0.76 xG
Assist Type Through ball, cross, cutback, etc. Through balls/cutbacks higher
Game State Score differential at time of shot Trailing teams shoot from worse positions
Defender Positions Number/location of defenders (advanced models) Fewer defenders = Higher xG
Goalkeeper Position Distance from goal line (advanced models) GK off line = Higher xG
# Understanding xG features in StatsBomb data from statsbombpy import sb import pandas as pd import numpy as np # Load sample match events = sb.events(match_id=3869685) # World Cup Final # Filter shots and extract features shots = events[events["type"] == "Shot"].copy() # Extract coordinates shots["x"] = shots["location"].apply(lambda l: l[0] if l else None) shots["y"] = shots["location"].apply(lambda l: l[1] if l else None) # Calculate distance to goal center (120, 40) shots["distance"] = np.sqrt((120 - shots["x"])**2 + (40 - shots["y"])**2) # Calculate angle (simplified - angle subtended by goal) shots["angle"] = np.degrees(np.arctan2(8, shots["distance"])) # View relationship print("Shot features and xG:") print(shots[["player", "x", "y", "distance", "angle", "shot_statsbomb_xg", "shot_body_part", "shot_outcome"]].head(10)) # Average xG by body part print("\nxG by Body Part:") print(shots.groupby("shot_body_part").agg( shots=("type", "count"), avg_xG=("shot_statsbomb_xg", "mean"), goals=("shot_outcome", lambda x: (x == "Goal").sum()) ).round(3))
# Understanding xG features in StatsBomb data
library(StatsBombR)
library(dplyr)

# Load sample match
matches <- FreeMatches(FreeCompetitions() %>%
                        filter(competition_id == 43, season_id == 106))
events <- get.matchFree(matches[1, ])

# Examine shot features
shots <- events %>%
  filter(type.name == "Shot") %>%
  select(player.name, location.x, location.y,
         shot.statsbomb_xg, shot.body_part.name,
         shot.type.name, shot.technique.name,
         shot.outcome.name) %>%
  mutate(
    # Calculate distance to goal center (120, 40)
    distance = sqrt((120 - location.x)^2 + (40 - location.y)^2),
    # Calculate angle (simplified)
    angle = atan2(8, distance) * 180 / pi  # Goal is 8 yards wide
  )

# View relationship between features and xG
print(shots %>%
        select(player.name, distance, angle, shot.statsbomb_xg,
               shot.body_part.name, shot.outcome.name) %>%
        arrange(desc(shot.statsbomb_xg)))

# Average xG by body part
print(shots %>%
        group_by(shot.body_part.name) %>%
        summarise(
          shots = n(),
          avg_xG = mean(shot.statsbomb_xg, na.rm = TRUE),
          goals = sum(shot.outcome.name == "Goal")
        ))
chapter6-xg-features
Output
Exploring xG features in shot data

The xG Probability Distribution

Most shots have low xG values. The distribution is heavily right-skewed:

# Visualize xG distribution import matplotlib.pyplot as plt import numpy as np # Load multiple matches matches = sb.matches(competition_id=43, season_id=106) all_shots = [] for mid in matches["match_id"][:10]: ev = sb.events(match_id=mid) all_shots.append(ev[ev["type"] == "Shot"]) shots = pd.concat(all_shots, ignore_index=True) # Create histogram fig, ax = plt.subplots(figsize=(12, 6)) ax.hist(shots["shot_statsbomb_xg"].dropna(), bins=50, color="#1B5E20", edgecolor="white", alpha=0.8) mean_xg = shots["shot_statsbomb_xg"].mean() ax.axvline(mean_xg, color="red", linestyle="--", linewidth=2, label=f"Mean xG: {mean_xg:.3f}") ax.set_xlabel("xG Value", fontsize=12) ax.set_ylabel("Number of Shots", fontsize=12) ax.set_title("Distribution of Expected Goals\nMost shots have low xG; high-quality chances are rare", fontsize=14) ax.legend() plt.show() # xG categories def categorize_xg(xg): if xg >= 0.5: return "Big Chance (0.5+)" elif xg >= 0.2: return "Good Chance (0.2-0.5)" elif xg >= 0.1: return "Reasonable (0.1-0.2)" elif xg >= 0.05: return "Low Quality (0.05-0.1)" else: return "Very Low (<0.05)" shots["xg_category"] = shots["shot_statsbomb_xg"].apply(categorize_xg) print("\nxG Categories:") print(shots.groupby("xg_category").agg( shots=("type", "count"), conversion=("shot_outcome", lambda x: (x == "Goal").mean() * 100) ).round(1))
# Visualize xG distribution
library(ggplot2)

# Load more matches for better distribution
events <- free_allevents(MatchesDF = matches[1:10, ])
shots <- events %>% filter(type.name == "Shot")

# xG histogram
ggplot(shots, aes(x = shot.statsbomb_xg)) +
  geom_histogram(bins = 50, fill = "#1B5E20", color = "white", alpha = 0.8) +
  geom_vline(xintercept = mean(shots$shot.statsbomb_xg, na.rm = TRUE),
             color = "red", linetype = "dashed", size = 1) +
  annotate("text", x = 0.2, y = 150,
           label = paste("Mean xG:", round(mean(shots$shot.statsbomb_xg, na.rm = TRUE), 3)),
           color = "red") +
  labs(title = "Distribution of Expected Goals",
       subtitle = "Most shots have low xG; high-quality chances are rare",
       x = "xG Value", y = "Number of Shots") +
  theme_minimal() +
  scale_x_continuous(breaks = seq(0, 1, 0.1))

# xG categories
shots %>%
  mutate(xg_category = case_when(
    shot.statsbomb_xg >= 0.5 ~ "Big Chance (0.5+)",
    shot.statsbomb_xg >= 0.2 ~ "Good Chance (0.2-0.5)",
    shot.statsbomb_xg >= 0.1 ~ "Reasonable (0.1-0.2)",
    shot.statsbomb_xg >= 0.05 ~ "Low Quality (0.05-0.1)",
    TRUE ~ "Very Low (<0.05)"
  )) %>%
  group_by(xg_category) %>%
  summarise(
    shots = n(),
    pct = n() / nrow(shots) * 100,
    avg_conversion = mean(shot.outcome.name == "Goal") * 100
  ) %>%
  arrange(desc(pct))
chapter6-xg-distribution
Output
Analyzing the xG distribution

Using Pre-Built xG Data

Most analysts use xG values provided by data companies rather than building their own models. Here's how to work with xG data from major sources.

StatsBomb xG

StatsBomb provides the most detailed free xG data, including freeze-frame information about player and goalkeeper positions:

# Working with StatsBomb xG from statsbombpy import sb import pandas as pd # Load all World Cup 2022 matches matches = sb.matches(competition_id=43, season_id=106) # Collect all shots all_shots = [] for mid in matches["match_id"]: events = sb.events(match_id=mid) shots = events[events["type"] == "Shot"][ ["match_id", "team", "player", "minute", "shot_statsbomb_xg", "shot_outcome", "shot_type", "shot_body_part"] ] all_shots.append(shots) shots_df = pd.concat(all_shots, ignore_index=True) # Team xG totals team_xg = shots_df.groupby("team").agg( matches=("match_id", "nunique"), shots=("shot_statsbomb_xg", "count"), goals=("shot_outcome", lambda x: (x == "Goal").sum()), xG=("shot_statsbomb_xg", "sum"), npxG=("shot_statsbomb_xg", lambda x: x[shots_df.loc[x.index, "shot_type"] != "Penalty"].sum()) ).reset_index() team_xg["xG_per_match"] = (team_xg["xG"] / team_xg["matches"]).round(2) team_xg["goals_minus_xG"] = team_xg["goals"] - team_xg["xG"] print("World Cup 2022 Team xG:") print(team_xg.sort_values("xG", ascending=False).head(10)) # Player xG leaders player_xg = shots_df.groupby(["player", "team"]).agg( shots=("shot_statsbomb_xg", "count"), goals=("shot_outcome", lambda x: (x == "Goal").sum()), xG=("shot_statsbomb_xg", "sum") ).reset_index() player_xg["goals_minus_xG"] = player_xg["goals"] - player_xg["xG"] print("\nTop 10 Players by xG:") print(player_xg.sort_values("xG", ascending=False).head(10))
# Working with StatsBomb xG
library(StatsBombR)
library(dplyr)

# Load competition data
comps <- FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106)  # World Cup 2022
matches <- FreeMatches(comps)
events <- free_allevents(MatchesDF = matches)

# Extract all shots with xG
shots <- events %>%
  filter(type.name == "Shot") %>%
  select(match_id, team.name, player.name, minute,
         location.x, location.y,
         shot.statsbomb_xg, shot.outcome.name,
         shot.body_part.name, shot.type.name,
         shot.first_time, shot.one_on_one)

# Team xG totals
team_xg <- shots %>%
  group_by(team.name) %>%
  summarise(
    matches = n_distinct(match_id),
    shots = n(),
    goals = sum(shot.outcome.name == "Goal"),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    npxG = sum(shot.statsbomb_xg[shot.type.name != "Penalty"], na.rm = TRUE)
  ) %>%
  mutate(
    xG_per_match = round(xG / matches, 2),
    goals_minus_xG = goals - xG,
    conversion = round(goals / shots * 100, 1)
  ) %>%
  arrange(desc(xG))

print("World Cup 2022 Team xG:")
print(head(team_xg, 10))

# Player xG leaders
player_xg <- shots %>%
  group_by(player.name, team.name) %>%
  summarise(
    shots = n(),
    goals = sum(shot.outcome.name == "Goal"),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(goals_minus_xG = goals - xG) %>%
  arrange(desc(xG))

print("\nTop 10 Players by xG:")
print(head(player_xg, 10))
chapter6-statsbomb-xg
Output
Working with StatsBomb xG data

Understat xG

Understat provides free xG data for the top 5 European leagues:

# Working with Understat xG import asyncio from understat import Understat import pandas as pd async def get_understat_data(): async with Understat() as understat: # Get league table with xG teams = await understat.get_league_table("EPL", 2023) # Get top players players = await understat.get_league_players("EPL", 2023) return teams, players # Run async teams, players = asyncio.run(get_understat_data()) # Team xG analysis team_df = pd.DataFrame(teams) team_df["xG"] = team_df["xG"].astype(float) team_df["xGA"] = team_df["xGA"].astype(float) team_df["G"] = team_df["G"].astype(int) team_df["goals_minus_xG"] = team_df["G"] - team_df["xG"] print("EPL 2023-24 Team xG:") print(team_df[["title", "M", "G", "xG", "goals_minus_xG"]].head(10)) # Player xG analysis player_df = pd.DataFrame(players) player_df["xG"] = player_df["xG"].astype(float) player_df["goals"] = player_df["goals"].astype(int) player_df["goals_minus_xG"] = player_df["goals"] - player_df["xG"] print("\nTop 15 Players by xG:") print(player_df.nlargest(15, "xG")[["player_name", "team_title", "games", "goals", "xG", "goals_minus_xG"]])
# Working with Understat xG via understatr
library(understatr)
library(dplyr)

# Get team season data
epl_teams <- get_league_teams_stats(league_name = "EPL", year = 2023)

# View team xG data
team_xg <- epl_teams %>%
  select(team_name, matches, scored, missed, xG, xGA, xpts, pts) %>%
  mutate(
    goals_minus_xG = scored - xG,
    conceded_minus_xGA = missed - xGA,
    pts_minus_xpts = pts - xpts
  ) %>%
  arrange(desc(xG))

print("EPL 2023-24 Team xG:")
print(team_xg)

# Get player data
player_data <- get_league_players_stats(league_name = "EPL", year = 2023)

# Top scorers by xG
top_xg <- player_data %>%
  select(player_name, team_name, games, goals, xG, shots) %>%
  mutate(
    goals_minus_xG = goals - xG,
    xG_per_shot = xG / shots
  ) %>%
  arrange(desc(xG)) %>%
  head(15)

print("\nTop 15 Players by xG:")
print(top_xg)
chapter6-understat-xg
Output
Working with Understat xG data

FBref xG (via StatsBomb)

FBref provides StatsBomb xG data with convenient aggregations:

# Scraping FBref data would use worldfootballR in R # Python alternative using soccerdata or manual scraping # For demonstration, showing the structure: import pandas as pd # FBref provides xG data in these formats: # - Player shooting stats: Goals, xG, npxG, xG/90 # - Team stats: xG For, xG Against, xG Difference # - Match stats: Team xG for each match # Example of what FBref data looks like: fbref_structure = pd.DataFrame({ "Player": ["Erling Haaland", "Mohamed Salah", "Son Heung-Min"], "Squad": ["Manchester City", "Liverpool", "Tottenham"], "Min": [2700, 2850, 2600], "Gls": [36, 18, 17], "xG": [32.5, 17.8, 12.1], "npxG": [26.2, 15.3, 11.5], "Sh": [120, 95, 88], "SoT": [65, 42, 38] }) fbref_structure["goals_minus_xG"] = fbref_structure["Gls"] - fbref_structure["xG"] fbref_structure["xG_per_90"] = (fbref_structure["xG"] / (fbref_structure["Min"] / 90)).round(2) print("FBref-style xG Data:") print(fbref_structure) print("\nNote: Use worldfootballR in R or soccerdata library in Python") print("for actual FBref scraping.")
# Scraping FBref xG data with worldfootballR
library(worldfootballR)
library(dplyr)

# Get league-wide player stats
epl_stats <- fb_big5_advanced_season_stats(
  season_end_year = 2024,
  stat_type = "shooting",
  team_or_player = "player"
)

# Filter for EPL and analyze xG
epl_players <- epl_stats %>%
  filter(Comp == "Premier League") %>%
  select(Player, Squad, Min_Playing, Gls_Standard, xG_Expected,
         npxG_Expected, Sh_Standard, SoT_Standard) %>%
  mutate(
    nineties = Min_Playing / 90,
    goals_minus_xG = Gls_Standard - xG_Expected,
    xG_per_90 = xG_Expected / nineties,
    shots_per_90 = Sh_Standard / nineties
  ) %>%
  filter(Min_Playing >= 900) %>%  # Minimum 10 matches
  arrange(desc(xG_Expected))

print("EPL Top Scorers by xG (min 900 mins):")
print(head(epl_players, 15))

# Biggest overperformers
print("\nBiggest Overperformers (Goals - xG):")
print(epl_players %>%
        arrange(desc(goals_minus_xG)) %>%
        select(Player, Squad, Gls_Standard, xG_Expected, goals_minus_xG) %>%
        head(10))
chapter6-fbref-xg
Output
Working with FBref xG data

Interpreting xG Correctly

xG is powerful but often misunderstood. Here's how to use it properly.

xG Is Probabilistic, Not Deterministic

Common Misunderstanding

"Team A had 2.5 xG so they deserved to win" - Wrong!

xG tells us the probability distribution of outcomes, not what "should" happen. A team with 2.5 xG might score 0, 1, 2, 3, 4, or more goals on any given day.

# Simulate goal outcomes from xG import numpy as np from collections import Counter def simulate_goals(xg_values, n_simulations=10000): """Simulate goal outcomes from shot xG values.""" results = [] for _ in range(n_simulations): # Each shot: score if random < xG goals = sum(np.random.random() < xg for xg in xg_values) results.append(goals) return results # Example: Team had shots with these xG values shot_xgs = [0.75, 0.35, 0.12, 0.08, 0.05, 0.03, 0.02, 0.02] total_xg = sum(shot_xgs) # 1.42 xG # Simulate 10,000 times simulated_goals = simulate_goals(shot_xgs) # Goal distribution goal_counts = Counter(simulated_goals) total = len(simulated_goals) print(f"Total xG: {total_xg:.2f}") print("\nGoal distribution from 10,000 simulations:") for goals in sorted(goal_counts.keys()): pct = goal_counts[goals] / total * 100 print(f" {goals} goals: {pct:.1f}%") most_likely = max(goal_counts, key=goal_counts.get) print(f"\nMost likely outcome: {most_likely} goals ({goal_counts[most_likely]/total*100:.1f}%)") print(f"Probability of 0 goals: {goal_counts.get(0, 0)/total*100:.1f}%")
# Simulate goal outcomes from xG
library(dplyr)

simulate_goals <- function(xg_values, n_simulations = 10000) {
  # Each shot is a Bernoulli trial with p = xG
  goals_per_sim <- sapply(1:n_simulations, function(i) {
    sum(runif(length(xg_values)) < xg_values)
  })

  return(goals_per_sim)
}

# Example: Team had shots with these xG values
shot_xgs <- c(0.75, 0.35, 0.12, 0.08, 0.05, 0.03, 0.02, 0.02)
total_xg <- sum(shot_xgs)  # 1.42 xG

# Simulate 10,000 times
simulated_goals <- simulate_goals(shot_xgs)

# What percentage of simulations result in each goal count?
goal_distribution <- table(simulated_goals) / 10000 * 100

cat(sprintf("Total xG: %.2f\n", total_xg))
cat("\nGoal distribution from 10,000 simulations:\n")
print(round(goal_distribution, 1))

cat(sprintf("\nMost likely outcome: %d goals (%.1f%%)\n",
            as.numeric(names(which.max(goal_distribution))),
            max(goal_distribution)))
cat(sprintf("Probability of 0 goals: %.1f%%\n",
            goal_distribution["0"]))
chapter6-xg-simulation
Output
Simulating goal outcomes from xG

Over/Underperformance and Regression

When a player's goals significantly differ from their xG, we should expect regression to the mean:

Overperformance (Goals > xG)

Possible explanations:

  • Elite finishing skill (sustained over multiple seasons)
  • Luck/variance (likely if short sample)
  • Shot selection bias (only shoots when confident)

Expectation: Goals will likely decrease unless proven elite finisher

Underperformance (Goals < xG)

Possible explanations:

  • Poor finishing (sustained over multiple seasons)
  • Bad luck/variance (likely if short sample)
  • Injury affecting shooting

Expectation: Goals will likely increase (bounce-back candidate)

# Analyze over/underperformance player_finishing = shots_df.groupby(["player", "team"]).agg( shots=("shot_statsbomb_xg", "count"), goals=("shot_outcome", lambda x: (x == "Goal").sum()), xG=("shot_statsbomb_xg", "sum") ).reset_index() # Filter minimum sample player_finishing = player_finishing[player_finishing["shots"] >= 10].copy() player_finishing["goals_minus_xG"] = player_finishing["goals"] - player_finishing["xG"] player_finishing["finishing_pct"] = ((player_finishing["goals"] - player_finishing["xG"]) / player_finishing["xG"] * 100) # Overperformers print("\nBiggest Overperformers (may regress):") print(player_finishing[player_finishing["goals_minus_xG"] > 0].nlargest( 5, "goals_minus_xG")[["player", "shots", "goals", "xG", "goals_minus_xG"]]) # Underperformers print("\nBiggest Underperformers (may improve):") print(player_finishing[player_finishing["goals_minus_xG"] < 0].nsmallest( 5, "goals_minus_xG")[["player", "shots", "goals", "xG", "goals_minus_xG"]]) print("\nNote: Single tournament data is noisy.") print("Multi-season analysis needed for true finishing skill.")
# Analyze over/underperformance
library(dplyr)

# Calculate player finishing skill
player_finishing <- shots %>%
  group_by(player.name, team.name) %>%
  summarise(
    shots = n(),
    goals = sum(shot.outcome.name == "Goal"),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(shots >= 10) %>%  # Minimum sample size
  mutate(
    goals_minus_xG = goals - xG,
    finishing_pct = (goals - xG) / xG * 100,  # % over/under xG
    conversion = goals / shots * 100,
    xG_per_shot = xG / shots
  )

# Overperformers (potential regression candidates)
cat("\nBiggest Overperformers (may regress):\n")
print(player_finishing %>%
        filter(goals_minus_xG > 0) %>%
        arrange(desc(goals_minus_xG)) %>%
        select(player.name, shots, goals, xG, goals_minus_xG) %>%
        head(5))

# Underperformers (potential bounce-back candidates)
cat("\nBiggest Underperformers (may improve):\n")
print(player_finishing %>%
        filter(goals_minus_xG < 0) %>%
        arrange(goals_minus_xG) %>%
        select(player.name, shots, goals, xG, goals_minus_xG) %>%
        head(5))

# Note: True finishing skill requires multi-season analysis
cat("\nNote: Single tournament data is noisy.")
cat("\nMulti-season analysis needed for true finishing skill.")
chapter6-regression
Output
Analyzing over/underperformance and regression candidates

Non-Penalty xG (npxG)

Penalties are almost automatic goals (~76% conversion). To fairly compare players who take different numbers of penalties, use non-penalty xG (npxG):

# Calculate npxG player_npxg = shots_df.groupby(["player", "team"]).agg( total_shots=("shot_statsbomb_xg", "count"), penalties=("shot_type", lambda x: (x == "Penalty").sum()), goals=("shot_outcome", lambda x: (x == "Goal").sum()), pen_goals=("shot_outcome", lambda x: ( (x == "Goal") & (shots_df.loc[x.index, "shot_type"] == "Penalty")).sum()), xG=("shot_statsbomb_xg", "sum"), npxG=("shot_statsbomb_xg", lambda x: x[ shots_df.loc[x.index, "shot_type"] != "Penalty"].sum()) ).reset_index() player_npxg["non_pen_shots"] = player_npxg["total_shots"] - player_npxg["penalties"] player_npxg["np_goals"] = player_npxg["goals"] - player_npxg["pen_goals"] player_npxg["np_goals_minus_npxG"] = player_npxg["np_goals"] - player_npxg["npxG"] # Filter and display player_npxg = player_npxg[player_npxg["non_pen_shots"] >= 5] print("Player xG vs npxG Comparison:") print(player_npxg.nlargest(10, "xG")[ ["player", "goals", "xG", "np_goals", "npxG", "penalties"]])
# Calculate npxG
player_npxg <- shots %>%
  group_by(player.name, team.name) %>%
  summarise(
    total_shots = n(),
    penalties = sum(shot.type.name == "Penalty"),
    non_pen_shots = sum(shot.type.name != "Penalty"),
    goals = sum(shot.outcome.name == "Goal"),
    pen_goals = sum(shot.type.name == "Penalty" & shot.outcome.name == "Goal"),
    np_goals = goals - pen_goals,
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    npxG = sum(shot.statsbomb_xg[shot.type.name != "Penalty"], na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    np_goals_minus_npxG = np_goals - npxG
  ) %>%
  filter(non_pen_shots >= 5)

# Compare with and without penalties
print("Player xG vs npxG Comparison:")
print(player_npxg %>%
        arrange(desc(xG)) %>%
        select(player.name, goals, xG, np_goals, npxG, penalties) %>%
        head(10))
chapter6-npxg
Output
Calculating non-penalty xG

Team-Level xG Analysis

xG is even more powerful at the team level, where individual variance averages out faster.

xG Difference (xGD)

The difference between xG created and xG conceded is highly predictive of future performance:

# Calculate team xG difference # Get xG for each team in each match team_match_xg = shots_df.groupby(["match_id", "team"]).agg( xG_for=("shot_statsbomb_xg", "sum"), goals_for=("shot_outcome", lambda x: (x == "Goal").sum()) ).reset_index() # Calculate xG against (opponent xG in same match) match_totals = team_match_xg.groupby("match_id").agg({ "xG_for": "sum", "goals_for": "sum" }).reset_index() match_totals.columns = ["match_id", "total_xG", "total_goals"] team_match_xg = team_match_xg.merge(match_totals, on="match_id") team_match_xg["xG_against"] = team_match_xg["total_xG"] - team_match_xg["xG_for"] team_match_xg["goals_against"] = team_match_xg["total_goals"] - team_match_xg["goals_for"] # Aggregate to team level team_xgd = team_match_xg.groupby("team").agg( matches=("match_id", "count"), xG_for=("xG_for", "sum"), xG_against=("xG_against", "sum"), goals_for=("goals_for", "sum"), goals_against=("goals_against", "sum") ).reset_index() team_xgd["xGD"] = team_xgd["xG_for"] - team_xgd["xG_against"] team_xgd["actual_GD"] = team_xgd["goals_for"] - team_xgd["goals_against"] team_xgd["xGD_per_match"] = (team_xgd["xGD"] / team_xgd["matches"]).round(2) print("Team xG Difference Rankings:") print(team_xgd.sort_values("xGD", ascending=False)[ ["team", "matches", "xG_for", "xG_against", "xGD", "actual_GD", "xGD_per_match"] ].head(10))
# Calculate team xG difference
library(dplyr)

# Get xG for and against per team
team_xg_analysis <- events %>%
  filter(type.name == "Shot") %>%
  group_by(match_id, team.name) %>%
  summarise(xG_for = sum(shot.statsbomb_xg, na.rm = TRUE),
            goals_for = sum(shot.outcome.name == "Goal"),
            .groups = "drop")

# Get opponent xG for each team in each match
match_xg <- team_xg_analysis %>%
  group_by(match_id) %>%
  mutate(
    xG_against = sum(xG_for) - xG_for,
    goals_against = sum(goals_for) - goals_for
  ) %>%
  ungroup()

# Aggregate to team level
team_xgd <- match_xg %>%
  group_by(team.name) %>%
  summarise(
    matches = n(),
    xG_for = sum(xG_for),
    xG_against = sum(xG_against),
    goals_for = sum(goals_for),
    goals_against = sum(goals_against)
  ) %>%
  mutate(
    xGD = xG_for - xG_against,
    actual_GD = goals_for - goals_against,
    xGD_per_match = round(xGD / matches, 2),

    # Performance vs expectation
    goals_vs_xG = goals_for - xG_for,
    conceded_vs_xGA = goals_against - xG_against
  ) %>%
  arrange(desc(xGD))

print("Team xG Difference Rankings:")
print(team_xgd %>%
        select(team.name, matches, xG_for, xG_against, xGD,
               actual_GD, xGD_per_match) %>%
        head(10))
chapter6-team-xgd
Output
Calculating team xG difference

Expected Points (xPts)

We can simulate match outcomes to calculate expected points:

import numpy as np def calculate_xpts(xg_for, xg_against, n_sims=10000): """Calculate expected points from match xG using Poisson simulation.""" # Simulate goals using Poisson distribution goals_for = np.random.poisson(xg_for, n_sims) goals_against = np.random.poisson(xg_against, n_sims) # Calculate points wins = (goals_for > goals_against).sum() draws = (goals_for == goals_against).sum() losses = (goals_for < goals_against).sum() xPts = (wins * 3 + draws * 1) / n_sims return { "xPts": xPts, "win_prob": wins / n_sims, "draw_prob": draws / n_sims, "loss_prob": losses / n_sims } # Example match result = calculate_xpts(2.1, 0.8) print("Team A (2.1 xG vs 0.8 xG):") print(f" Expected Points: {result[\"xPts\"]:.2f}") print(f" Win Probability: {result[\"win_prob\"]*100:.1f}%") print(f" Draw Probability: {result[\"draw_prob\"]*100:.1f}%") print(f" Loss Probability: {result[\"loss_prob\"]*100:.1f}%") # Calculate xPts for all teams team_xpts = [] for team in team_match_xg["team"].unique(): team_matches = team_match_xg[team_match_xg["team"] == team] total_xpts = 0 total_actual = 0 for _, match in team_matches.iterrows(): result = calculate_xpts(match["xG_for"], match["xG_against"]) total_xpts += result["xPts"] if match["goals_for"] > match["goals_against"]: total_actual += 3 elif match["goals_for"] == match["goals_against"]: total_actual += 1 team_xpts.append({ "team": team, "matches": len(team_matches), "xPts": total_xpts, "actual_pts": total_actual, "pts_diff": total_actual - total_xpts }) xpts_df = pd.DataFrame(team_xpts).sort_values("xPts", ascending=False) print("\nTeam Expected Points:") print(xpts_df.head(10))
# Calculate expected points from match xG
calculate_xpts <- function(xg_for, xg_against, n_sims = 10000) {
  # Simulate goals using Poisson distribution
  goals_for <- rpois(n_sims, xg_for)
  goals_against <- rpois(n_sims, xg_against)

  # Calculate points: 3 for win, 1 for draw, 0 for loss
  points <- ifelse(goals_for > goals_against, 3,
                   ifelse(goals_for == goals_against, 1, 0))

  # Return expected points and win/draw/loss probabilities
  return(list(
    xPts = mean(points),
    win_prob = mean(goals_for > goals_against),
    draw_prob = mean(goals_for == goals_against),
    loss_prob = mean(goals_for < goals_against)
  ))
}

# Example: A match where Team A had 2.1 xG and Team B had 0.8 xG
result <- calculate_xpts(2.1, 0.8)
cat(sprintf("Team A (2.1 xG vs 0.8 xG):\n"))
cat(sprintf("  Expected Points: %.2f\n", result$xPts))
cat(sprintf("  Win Probability: %.1f%%\n", result$win_prob * 100))
cat(sprintf("  Draw Probability: %.1f%%\n", result$draw_prob * 100))
cat(sprintf("  Loss Probability: %.1f%%\n", result$loss_prob * 100))

# Calculate xPts for all matches
match_xpts <- match_xg %>%
  rowwise() %>%
  mutate(
    xPts = calculate_xpts(xG_for, xG_against)$xPts,
    actual_pts = case_when(
      goals_for > goals_against ~ 3,
      goals_for == goals_against ~ 1,
      TRUE ~ 0
    )
  ) %>%
  ungroup()

# Team xPts totals
team_xpts <- match_xpts %>%
  group_by(team.name) %>%
  summarise(
    matches = n(),
    xPts = sum(xPts),
    actual_pts = sum(actual_pts),
    pts_difference = actual_pts - xPts
  ) %>%
  arrange(desc(xPts))

print("\nTeam Expected Points:")
print(team_xpts)
chapter6-xpts
Output
Calculating expected points from xG

Visualizing xG

Effective xG visualizations communicate chance quality at a glance.

# Create xG shot map with size encoding from mplsoccer import VerticalPitch import matplotlib.pyplot as plt # Get shots from one match match_id = matches["match_id"].iloc[0] events = sb.events(match_id=match_id) shots = events[events["type"] == "Shot"].copy() shots["x"] = shots["location"].apply(lambda l: l[0]) shots["y"] = shots["location"].apply(lambda l: l[1]) teams = shots["team"].unique() fig, axes = plt.subplots(1, 2, figsize=(16, 8)) for ax, team in zip(axes, teams): pitch = VerticalPitch(pitch_color="#1a472a", line_color="white", half=True) pitch.draw(ax=ax) team_shots = shots[shots["team"] == team] # Non-goals non_goals = team_shots[team_shots["shot_outcome"] != "Goal"] ax.scatter(non_goals["x"], non_goals["y"], s=non_goals["shot_statsbomb_xg"] * 500, c="#CCCCCC", alpha=0.7, edgecolors="black") # Goals goals = team_shots[team_shots["shot_outcome"] == "Goal"] ax.scatter(goals["x"], goals["y"], s=goals["shot_statsbomb_xg"] * 500, c="#FFD700", alpha=0.9, edgecolors="black") total_xg = team_shots["shot_statsbomb_xg"].sum() ax.set_title(f"{team}\nxG: {total_xg:.2f}", fontsize=14) fig.suptitle("Match xG Shot Map", fontsize=16, fontweight="bold") fig.patch.set_facecolor("#1a1a2e") plt.tight_layout() plt.show()
# Create xG shot map with size encoding
library(ggplot2)
library(ggsoccer)

match_shots <- events %>%
  filter(type.name == "Shot", match_id == matches$match_id[1])

# xG shot map
ggplot(match_shots) +
  annotate_pitch(colour = "white", fill = "#1a472a") +
  geom_point(aes(x = location.x, y = location.y,
                 size = shot.statsbomb_xg,
                 color = shot.outcome.name == "Goal"),
             alpha = 0.8) +
  scale_size_continuous(range = c(2, 12), name = "xG") +
  scale_color_manual(values = c("FALSE" = "#CCCCCC", "TRUE" = "#FFD700"),
                     labels = c("No Goal", "Goal"), name = "Result") +
  coord_flip(xlim = c(60, 120)) +
  theme_pitch() +
  facet_wrap(~team.name, ncol = 2) +
  labs(title = "Match xG Shot Map",
       subtitle = "Point size represents expected goals value") +
  theme(legend.position = "bottom",
        strip.text = element_text(size = 12, face = "bold"))
chapter6-xg-viz
Output
Creating xG shot maps

Chapter Summary

Key Takeaways
  • xG measures chance quality - probability a shot results in a goal
  • Key factors: Distance, angle, body part, assist type, game state
  • Use pre-built xG - StatsBomb, Understat, FBref provide reliable data
  • xG is probabilistic - variance is expected; don't over-interpret single matches
  • Regression to the mean - over/underperformers usually revert
  • Use npxG - for fair comparison across penalty-takers
  • xGD is predictive - team xG difference predicts future performance

xG Quick Reference

Shot Type Typical xG Range Example
Penalty0.76Standard penalty kick
Open goal (6-yard)0.70-0.95Tap-in from 3 yards
1v1 with keeper0.30-0.50Through ball, clear on goal
Header from cross0.05-0.156-yard box header
Edge of box shot0.05-0.1018-yard shot, central
Long-range shot0.02-0.0525+ yards out

xG Visualization Tutorials

Effective visualization is crucial for communicating xG insights. Here are the most important xG visualizations you should master.

xG Shot Map with Color Gradient

Shot maps with xG-colored markers show where teams create quality chances:

script
from statsbombpy import sb
import matplotlib.pyplot as plt
from mplsoccer import VerticalPitch
import numpy as np

# Load World Cup Final
events = sb.events(match_id=3869685)
shots = events[events["type"] == "Shot"].copy()

# Extract coordinates
shots["x"] = shots["location"].apply(lambda loc: loc[0])
shots["y"] = shots["location"].apply(lambda loc: loc[1])
shots["is_goal"] = shots["shot_outcome"] == "Goal"

# Create figure with two half-pitches
fig, axes = plt.subplots(1, 2, figsize=(16, 10))

teams = ["Argentina", "France"]
for idx, team in enumerate(teams):
    pitch = VerticalPitch(
        pitch_type="statsbomb", half=True,
        pitch_color="#1a472a", line_color="white", linewidth=1
    )
    pitch.draw(ax=axes[idx])

    team_shots = shots[shots["team"] == team]

    # Create scatter with xG color gradient
    scatter = pitch.scatter(
        team_shots["x"], team_shots["y"],
        s=team_shots["shot_statsbomb_xg"] * 800 + 100,
        c=team_shots["shot_statsbomb_xg"],
        cmap="RdYlBu_r",
        edgecolors="white",
        linewidth=1.5,
        alpha=0.85,
        ax=axes[idx],
        vmin=0, vmax=0.8
    )

    # Mark goals with stars
    goals = team_shots[team_shots["is_goal"]]
    pitch.scatter(
        goals["x"], goals["y"],
        s=300, marker="*", c="gold",
        edgecolors="black", linewidth=1,
        ax=axes[idx], zorder=5
    )

    # Add xG total
    total_xg = team_shots["shot_statsbomb_xg"].sum()
    goals_scored = team_shots["is_goal"].sum()
    axes[idx].set_title(f"{team}\n{goals_scored} Goals | {total_xg:.2f} xG",
                        color="white", fontsize=14, fontweight="bold", pad=10)

# Add colorbar
cbar = fig.colorbar(scatter, ax=axes, orientation="horizontal",
                    fraction=0.05, pad=0.08, aspect=40)
cbar.set_label("xG Value", color="white", fontsize=12)
cbar.ax.xaxis.set_tick_params(color="white")
plt.setp(plt.getp(cbar.ax.axes, "xticklabels"), color="white")

fig.suptitle("xG Shot Map: World Cup 2022 Final", fontsize=18,
             fontweight="bold", color="white", y=0.98)
fig.patch.set_facecolor("#1a472a")
plt.tight_layout(rect=[0, 0.05, 1, 0.95])
plt.savefig("xg_shot_map.png", dpi=150, bbox_inches="tight", facecolor="#1a472a")
plt.show()

library(StatsBombR)
library(tidyverse)
library(ggsoccer)

# Load World Cup Final
events <- get.matchFree(data.frame(match_id = 3869685))
shots <- events %>% filter(type.name == "Shot")

# Create xG shot map with color gradient
ggplot(shots) +
  annotate_pitch(colour = "white", fill = "#1a472a") +
  geom_point(
    aes(x = location.x, y = location.y,
        size = shot.statsbomb_xg,
        fill = shot.statsbomb_xg,
        shape = ifelse(shot.outcome.name == "Goal", "Goal", "No Goal")),
    color = "white", stroke = 1.2, alpha = 0.85
  ) +
  scale_fill_gradient2(
    low = "#2196F3", mid = "#FFC107", high = "#F44336",
    midpoint = 0.3, limits = c(0, 1),
    name = "xG Value"
  ) +
  scale_size_continuous(range = c(3, 15), name = "xG Value") +
  scale_shape_manual(values = c("Goal" = 23, "No Goal" = 21), name = "Outcome") +
  coord_flip(xlim = c(60, 122), ylim = c(0, 80)) +
  facet_wrap(~team.name, ncol = 2) +
  theme_pitch() +
  theme(
    plot.background = element_rect(fill = "#1a472a"),
    strip.text = element_text(color = "white", size = 14, face = "bold"),
    legend.position = "bottom",
    legend.text = element_text(color = "white"),
    legend.title = element_text(color = "white"),
    plot.title = element_text(color = "white", size = 16, face = "bold", hjust = 0.5)
  ) +
  guides(size = "none") +
  labs(
    title = "xG Shot Map: World Cup 2022 Final",
    subtitle = "Size and color indicate shot quality (xG)"
  )

ggsave("xg_shot_map.png", width = 14, height = 8, dpi = 150)

xG vs Actual Goals Scatter Plot

This visualization reveals over/underperformers relative to their xG:

script
import matplotlib.pyplot as plt
import numpy as np

# Aggregate player data
player_xg = shots.groupby(["player", "team"]).agg(
    shots_count=("shot_statsbomb_xg", "count"),
    goals=("shot_outcome", lambda x: (x == "Goal").sum()),
    xG=("shot_statsbomb_xg", "sum")
).reset_index()

player_xg = player_xg[player_xg["shots_count"] >= 3]

# Create scatter plot
fig, ax = plt.subplots(figsize=(10, 8))

# Identity line
ax.plot([0, 4], [0, 4], "k--", alpha=0.5, linewidth=2, label="Expected")

# Scatter by team
colors = {"Argentina": "#75AADB", "France": "#002654"}
for team in colors:
    team_data = player_xg[player_xg["team"] == team]
    ax.scatter(
        team_data["xG"], team_data["goals"],
        s=team_data["shots_count"] * 30 + 50,
        c=colors[team], alpha=0.7, edgecolors="white",
        linewidth=1.5, label=team
    )

# Add labels for key players
for _, row in player_xg[player_xg["goals"] >= 2].iterrows():
    last_name = row["player"].split()[-1]
    ax.annotate(last_name, (row["xG"], row["goals"]),
                xytext=(5, 5), textcoords="offset points",
                fontsize=10, fontweight="bold")

# Add region labels
ax.text(0.5, 2.8, "Overperforming", color="#4CAF50",
        fontsize=11, fontstyle="italic")
ax.text(2.5, 0.5, "Underperforming", color="#F44336",
        fontsize=11, fontstyle="italic")

ax.set_xlabel("Expected Goals (xG)", fontsize=12)
ax.set_ylabel("Actual Goals", fontsize=12)
ax.set_title("Goals vs xG: World Cup 2022 Final\n" +
             "Points above dashed line = finishing above expectation",
             fontsize=14, fontweight="bold")
ax.legend(loc="upper left")
ax.set_xlim(0, 3.5)
ax.set_ylim(0, 3.5)
ax.set_aspect("equal")
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("xg_vs_goals.png", dpi=150, bbox_inches="tight")
plt.show()

library(tidyverse)

# Create player xG vs Goals scatter plot
player_xg_data <- shots %>%
  group_by(player.name, team.name) %>%
  summarise(
    shots = n(),
    goals = sum(shot.outcome.name == "Goal"),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(shots >= 3)  # Minimum shots filter

# Create scatter plot
ggplot(player_xg_data, aes(x = xG, y = goals)) +
  # Identity line (expected performance)
  geom_abline(intercept = 0, slope = 1, color = "gray50",
              linetype = "dashed", linewidth = 1) +
  # Points
  geom_point(aes(size = shots, color = team.name),
             alpha = 0.7) +
  # Labels for top performers
  geom_text(
    data = filter(player_xg_data, goals >= 2 | xG >= 1),
    aes(label = str_extract(player.name, "\\w+$")),  # Last name
    vjust = -0.8, size = 3.5, fontface = "bold"
  ) +
  # Styling
  scale_color_manual(values = c("Argentina" = "#75AADB", "France" = "#002654")) +
  scale_size_continuous(range = c(3, 12)) +
  annotate("text", x = 0.5, y = 2.8, label = "Overperforming",
           color = "#4CAF50", fontface = "italic", size = 4) +
  annotate("text", x = 2.5, y = 0.5, label = "Underperforming",
           color = "#F44336", fontface = "italic", size = 4) +
  labs(
    title = "Goals vs xG: World Cup 2022 Final",
    subtitle = "Points above the dashed line = finishing above expectation",
    x = "Expected Goals (xG)",
    y = "Actual Goals",
    color = "Team",
    size = "Shots"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  ) +
  coord_equal(xlim = c(0, 3.5), ylim = c(0, 3.5))

ggsave("xg_vs_goals.png", width = 10, height = 8, dpi = 150)

Cumulative xG Over Time

Track how xG accumulates throughout a match or season:

script
# Simplified version for single match timeline with game phases
import matplotlib.pyplot as plt
import numpy as np

# Calculate cumulative xG for both teams
fig, ax = plt.subplots(figsize=(14, 7))

for team, color in [("Argentina", "#75AADB"), ("France", "#002654")]:
    team_shots = shots[shots["team"] == team].sort_values(["minute", "second"])
    team_shots["cumulative_xG"] = team_shots["shot_statsbomb_xg"].cumsum()

    # Add starting point
    minutes = [0] + team_shots["minute"].tolist()
    cum_xg = [0] + team_shots["cumulative_xG"].tolist()

    ax.step(minutes, cum_xg, where="post", linewidth=2.5,
            color=color, label=f"{team} ({cum_xg[-1]:.2f} xG)", alpha=0.9)

    # Mark goals
    goals = team_shots[team_shots["is_goal"]]
    for _, goal in goals.iterrows():
        ax.scatter(goal["minute"], goal["cumulative_xG"],
                   marker="*", s=400, c=color, edgecolors="gold",
                   linewidth=2, zorder=5)

# Add match phase indicators
ax.axvline(x=45, color="gray", linestyle="--", alpha=0.5, linewidth=1.5)
ax.axvline(x=90, color="gray", linestyle="--", alpha=0.5, linewidth=1.5)
ax.axvline(x=105, color="gray", linestyle=":", alpha=0.5, linewidth=1.5)

ax.text(22.5, ax.get_ylim()[1]*0.95, "1st Half", ha="center",
        fontsize=10, color="gray")
ax.text(67.5, ax.get_ylim()[1]*0.95, "2nd Half", ha="center",
        fontsize=10, color="gray")
ax.text(112.5, ax.get_ylim()[1]*0.95, "ET", ha="center",
        fontsize=10, color="gray")

ax.set_xlabel("Minute", fontsize=12)
ax.set_ylabel("Cumulative xG", fontsize=12)
ax.set_title("Cumulative xG Timeline: World Cup 2022 Final\n" +
             "Stars indicate goals scored",
             fontsize=14, fontweight="bold")
ax.legend(loc="upper left", fontsize=11)
ax.set_xlim(0, 125)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("cumulative_xg_timeline.png", dpi=150, bbox_inches="tight")
plt.show()

# Cumulative xG over a season (example with multiple matches)
library(tidyverse)

# Load multiple World Cup matches
matches <- FreeMatches(FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106))

# Get Argentina matches
arg_matches <- matches %>%
  filter(home_team.home_team_name == "Argentina" |
         away_team.away_team_name == "Argentina")

# Load all events
all_events <- free_allevents(MatchesDF = arg_matches)

# Calculate cumulative xG per match
arg_xg_progression <- all_events %>%
  filter(type.name == "Shot") %>%
  filter(team.name == "Argentina") %>%
  arrange(match_id, minute, second) %>%
  group_by(match_id) %>%
  mutate(
    cumulative_xG = cumsum(shot.statsbomb_xg),
    shot_number = row_number()
  ) %>%
  ungroup()

# Join with match info
arg_xg_progression <- arg_xg_progression %>%
  left_join(
    arg_matches %>%
      select(match_id, home_team.home_team_name, away_team.away_team_name),
    by = "match_id"
  ) %>%
  mutate(
    opponent = ifelse(home_team.home_team_name == "Argentina",
                      away_team.away_team_name,
                      home_team.home_team_name)
  )

# Plot cumulative xG for each match
ggplot(arg_xg_progression, aes(x = minute, y = cumulative_xG, color = opponent)) +
  geom_step(linewidth = 1.2, alpha = 0.8) +
  geom_point(data = filter(arg_xg_progression, shot.outcome.name == "Goal"),
             size = 4, shape = 18) +
  scale_color_viridis_d(option = "plasma") +
  labs(
    title = "Argentina Cumulative xG by Match - World Cup 2022",
    subtitle = "Each line represents one match; diamonds indicate goals",
    x = "Minute",
    y = "Cumulative xG",
    color = "Opponent"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  ) +
  scale_x_continuous(breaks = seq(0, 120, 15))

ggsave("cumulative_xg_season.png", width = 14, height = 8, dpi = 150)

Practice Exercises

Exercise 6.1: Calculate Team xG

Task: Load a different World Cup 2022 match and calculate the total xG for each team. Identify which team "deserved" to win based on xG.

script
# Exercise 6.1 Solution
from statsbombpy import sb

# Find Brazil vs Croatia match
matches = sb.matches(competition_id=43, season_id=106)
bra_cro = matches[
    ((matches["home_team"] == "Brazil") | (matches["away_team"] == "Brazil")) &
    ((matches["home_team"] == "Croatia") | (matches["away_team"] == "Croatia"))
].iloc[0]

events = sb.events(match_id=bra_cro["match_id"])
shots = events[events["type"] == "Shot"]

# Calculate team xG
team_xg = shots.groupby("team").agg(
    shots=("type", "count"),
    goals=("shot_outcome", lambda x: (x == "Goal").sum()),
    xG=("shot_statsbomb_xg", "sum"),
    big_chances=("shot_statsbomb_xg", lambda x: (x > 0.3).sum())
).round(2)

print("Brazil vs Croatia xG Analysis:")
print(team_xg)

xg_winner = team_xg["xG"].idxmax()
print(f"\nBased on xG, {xg_winner} created better chances.")

# Exercise 6.1 Solution
library(StatsBombR)
library(tidyverse)

# Load Brazil vs Croatia quarter-final
matches <- FreeMatches(FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106))

bra_cro <- matches %>%
  filter((home_team.home_team_name == "Brazil" |
          away_team.away_team_name == "Brazil") &
         (home_team.home_team_name == "Croatia" |
          away_team.away_team_name == "Croatia"))

events <- get.matchFree(bra_cro)

# Calculate team xG
team_xg <- events %>%
  filter(type.name == "Shot") %>%
  group_by(team.name) %>%
  summarise(
    shots = n(),
    goals = sum(shot.outcome.name == "Goal"),
    xG = round(sum(shot.statsbomb_xg, na.rm = TRUE), 2),
    big_chances = sum(shot.statsbomb_xg > 0.3)
  )

print("Brazil vs Croatia xG Analysis:")
print(team_xg)

# Determine "deserved" winner
xg_winner <- team_xg %>% filter(xG == max(xG)) %>% pull(team.name)
cat(sprintf("\nBased on xG, %s created better chances.\n", xg_winner))
Exercise 6.2: Find the Best Finisher

Task: Analyze all World Cup 2022 matches to find the player who most outperformed their xG (minimum 5 shots).

script
# Exercise 6.2 Solution
from statsbombpy import sb
import pandas as pd

# Load all World Cup matches
matches = sb.matches(competition_id=43, season_id=106)

all_shots = []
for match_id in matches["match_id"]:
    events = sb.events(match_id=match_id)
    shots = events[events["type"] == "Shot"]
    all_shots.append(shots)

shots_df = pd.concat(all_shots, ignore_index=True)

# Calculate player finishing
player_finishing = shots_df.groupby(["player", "team"]).agg(
    shots=("type", "count"),
    goals=("shot_outcome", lambda x: (x == "Goal").sum()),
    xG=("shot_statsbomb_xg", "sum")
).reset_index()

player_finishing = player_finishing[player_finishing["shots"] >= 5].copy()
player_finishing["goals_minus_xG"] = player_finishing["goals"] - player_finishing["xG"]
player_finishing["conversion_rate"] = player_finishing["goals"] / player_finishing["shots"] * 100

player_finishing = player_finishing.sort_values("goals_minus_xG", ascending=False)

print("Top 10 Finishers (Goals - xG):")
print(player_finishing.head(10)[["player", "team", "shots", "goals", "xG", "goals_minus_xG"]])

best = player_finishing.iloc[0]
print(f"\nBest finisher: {best['player']} ({best['team']})")
print(f"Scored {best['goals']} goals from {best['xG']:.2f} xG (+{best['goals_minus_xG']:.2f})")

# Exercise 6.2 Solution
library(StatsBombR)
library(tidyverse)

# Load all World Cup matches
matches <- FreeMatches(FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106))

all_events <- free_allevents(MatchesDF = matches)

# Calculate player finishing
player_finishing <- all_events %>%
  filter(type.name == "Shot") %>%
  group_by(player.name, team.name) %>%
  summarise(
    shots = n(),
    goals = sum(shot.outcome.name == "Goal"),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(shots >= 5) %>%
  mutate(
    goals_minus_xG = goals - xG,
    conversion_rate = goals / shots * 100
  ) %>%
  arrange(desc(goals_minus_xG))

print("Top 10 Finishers (Goals - xG):")
print(head(player_finishing, 10))

# Best finisher
best <- player_finishing %>% slice(1)
cat(sprintf("\nBest finisher: %s (%s)\n", best$player.name, best$team.name))
cat(sprintf("Scored %d goals from %.2f xG (+%.2f)\n",
            best$goals, best$xG, best$goals_minus_xG))
Exercise 6.3: Create an xG Race Chart

Task: Create a visualization showing the running xG total for Argentina throughout the entire World Cup 2022 tournament.

script
# Exercise 6.3 Solution - Argentina xG Race Chart
from statsbombpy import sb
import matplotlib.pyplot as plt
import pandas as pd

# Load all Argentina matches
matches = sb.matches(competition_id=43, season_id=106)
arg_matches = matches[
    (matches["home_team"] == "Argentina") |
    (matches["away_team"] == "Argentina")
].sort_values("match_date")

# Collect all Argentina shots across tournament
all_shots = []
for _, match in arg_matches.iterrows():
    events = sb.events(match_id=match["match_id"])
    shots = events[(events["type"] == "Shot") & (events["team"] == "Argentina")]
    shots["match_date"] = match["match_date"]
    all_shots.append(shots)

shots_df = pd.concat(all_shots, ignore_index=True)
shots_df = shots_df.sort_values(["match_date", "minute", "second"])
shots_df["shot_number"] = range(1, len(shots_df) + 1)
shots_df["cumulative_xG"] = shots_df["shot_statsbomb_xg"].cumsum()
shots_df["cumulative_goals"] = (shots_df["shot_outcome"] == "Goal").cumsum()

# Find match boundaries
match_boundaries = shots_df.groupby("match_id")["shot_number"].max().tolist()

# Create plot
fig, ax = plt.subplots(figsize=(14, 8))

ax.fill_between(shots_df["shot_number"], shots_df["cumulative_xG"],
                alpha=0.3, color="#75AADB")
ax.plot(shots_df["shot_number"], shots_df["cumulative_xG"],
        linewidth=2.5, color="#75AADB", label=f"xG ({shots_df['cumulative_xG'].iloc[-1]:.1f})")
ax.step(shots_df["shot_number"], shots_df["cumulative_goals"], where="post",
        linewidth=2.5, color="#FFD700", label=f"Goals ({shots_df['cumulative_goals'].iloc[-1]})")

# Add match boundaries
for boundary in match_boundaries[:-1]:
    ax.axvline(x=boundary, color="gray", linestyle="--", alpha=0.5)

ax.set_xlabel("Shot Number (Tournament Total)", fontsize=12)
ax.set_ylabel("Cumulative Value", fontsize=12)
ax.set_title("Argentina World Cup 2022 - xG Accumulation\n" +
             "Dashed lines indicate match boundaries",
             fontsize=14, fontweight="bold")
ax.legend(loc="upper left", fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("argentina_xg_race.png", dpi=150, bbox_inches="tight")
plt.show()

# Exercise 6.3 Solution - Argentina xG Race Chart
library(StatsBombR)
library(tidyverse)

# Load all Argentina World Cup matches
matches <- FreeMatches(FreeCompetitions() %>%
  filter(competition_id == 43, season_id == 106)) %>%
  filter(home_team.home_team_name == "Argentina" |
         away_team.away_team_name == "Argentina") %>%
  arrange(match_date)

all_events <- free_allevents(MatchesDF = matches)

# Create tournament progression
arg_progression <- all_events %>%
  filter(type.name == "Shot", team.name == "Argentina") %>%
  arrange(match_id, minute) %>%
  mutate(
    cumulative_xG = cumsum(shot.statsbomb_xg),
    cumulative_goals = cumsum(shot.outcome.name == "Goal"),
    shot_number = row_number()
  )

# Add match labels
match_order <- arg_progression %>%
  group_by(match_id) %>%
  summarise(last_shot = max(shot_number)) %>%
  arrange(last_shot) %>%
  mutate(match_num = row_number())

arg_progression <- arg_progression %>%
  left_join(match_order, by = "match_id")

# Create race chart
ggplot(arg_progression, aes(x = shot_number)) +
  geom_area(aes(y = cumulative_xG), fill = "#75AADB", alpha = 0.4) +
  geom_line(aes(y = cumulative_xG, color = "xG"), linewidth = 1.5) +
  geom_step(aes(y = cumulative_goals, color = "Goals"), linewidth = 1.5) +
  geom_vline(data = match_order, aes(xintercept = last_shot),
             linetype = "dashed", alpha = 0.5) +
  scale_color_manual(values = c("xG" = "#75AADB", "Goals" = "#FFD700")) +
  labs(
    title = "Argentina World Cup 2022 - xG Accumulation",
    subtitle = "Dashed lines indicate match boundaries",
    x = "Shot Number (Tournament Total)",
    y = "Cumulative Value",
    color = ""
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

ggsave("argentina_xg_race.png", width = 14, height = 8, dpi = 150)

Ready for Advanced xG?

Explore post-shot xG, goalkeeper evaluation, xG models, and finishing skill analysis.

Continue to Advanced xG Concepts