Tutorials

Getting Started with StatsBomb Open Data in Python (statsbombpy)

From pip install to your first leaderboard, in real Python.

StatsBomb publishes a genuinely free, genuinely rich open dataset — 80 competition-seasons covering 24 distinct competitions, including the full 2022 World Cup — and all it takes to unlock it is a single Python package and about fifty lines of code. No account. No API key. No scraping. Here is exactly how to get from nothing to a top-xG leaderboard, with every command copy-pasteable.

Installation

Two libraries carry the load: statsbombpy, StatsBomb's official Python client, and pandas for the usual wrangling. Add matplotlib if you want charts (you do).

pip install statsbombpy matplotlib pandas

That is the entire setup. The package talks directly to StatsBomb's public GitHub-hosted JSON files, so nothing lives on your machine until you ask for it. One import and you're ready.

from statsbombpy import sb

The data model: competitions → matches → events

StatsBomb organises everything in three layers, and understanding them upfront saves a lot of confused index errors.

Competitions. The outermost layer. Calling sb.competitions() returns a DataFrame with one row per competition-season — the 2022 World Cup is one row, the 2018 World Cup is another. As of this writing the free tier has 80 such rows across 24 distinct competition names.

Matches. Once you have a competition-season you can call sb.matches(competition_id=..., season_id=...) to get all matches in that tournament. The 2022 World Cup returns 64 rows, one per match, each carrying kick-off time, final score, home and away team names, and a match_id you'll use for the next step.

Events. The real payload. sb.events(match_id=...) returns every on-ball action in a single match: passes, shots, carries, duels, goalkeeper actions, set pieces. A match might produce a few thousand rows. Each event has a type, a team, a player, a location, a timestamp, and a cluster of type-specific columns — pass_end_location for passes, shot_statsbomb_xg for shots, and so on.

Coordinate convention
StatsBomb maps every event onto a 120×80 pitch, measured in yards. The origin (0, 0) is the bottom-left corner of the pitch as drawn. The attacking direction is always toward x = 120 — StatsBomb flips coordinates so the team in possession is always going right. When you plot data from two different teams in the same match without accounting for this, everything lands in the wrong half.

Loading competitions, then drilling down

The competition call is the starting point for almost every analysis. It is free and fast — it just fetches a small JSON file.

from statsbombpy import sb

# 1 — what's available? competitions() returns one row per competition-season.
comps = sb.competitions()
print(f"{len(comps)} competition-seasons available, "
      f"{comps.competition_name.nunique()} distinct competitions.")

Running this against the current open dataset prints 80 competition-seasons available, 24 distinct competitions. That range includes Champions League campaigns, La Liga seasons, the Women's World Cup, NWSL, and the full men's World Cup from 2018 and 2022. The 2022 tournament is competition_id 43.

# 2 — pick a competition-season (men's 2022 World Cup) and list its matches.
season = comps[(comps.competition_id == 43) & (comps.season_name == "2022")].iloc[0]
matches = sb.matches(competition_id=43, season_id=int(season.season_id))
print(f"{len(matches)} matches at the 2022 World Cup.")

That prints 64 matches at the 2022 World Cup. — every group stage game through the final. Each row contains both team names, the final score, and the match_id needed for the next step.

Pulling shots across a whole tournament

Calling sb.events() once per match and concatenating the results is the standard pattern for tournament-level analysis. It takes a minute or two on a normal connection — each match is a separate HTTP request — but it is straightforward:

# 3 — pull events for every match and keep the shots.
shots = []
for match_id in matches.match_id:
    ev = sb.events(match_id=int(match_id))
    shots.append(ev[ev.type == "Shot"])
shots = pd.concat(shots, ignore_index=True)
shots["xg"] = shots.shot_statsbomb_xg.astype(float)

The explicit astype(float) is worth explaining: shot_statsbomb_xg comes back as a mixed-type column because StatsBomb stores non-shot events in the same DataFrame with nulls in shot-specific columns. Casting it after filtering to shots only avoids the occasional coercion warning downstream.

From here, building the leaderboard is one line of pandas groupby:

# 4 — the ten players with the most expected goals.
top = shots.groupby("player").xg.sum().sort_values(ascending=False).head(10).round(2)
goals = shots[shots.shot_outcome == "Goal"].groupby("player").size()

The goals Series counts actual goals for the same players, which is what makes the output interesting: comparing xG to actual goals tells you immediately who was clinical and who left chances on the pitch.

Building the chart

The companion script produces a horizontal bar chart, which works better than a vertical one here because the player names are long. The top three bars get an accent colour to pull the eye; the rest are the house green. The label on each bar prints both xG and actual goals so readers don't have to cross-reference a table.

import matplotlib.pyplot as plt

PAPER, GREEN, GREEN_DK, ACCENT, INK = "#fbfaf3", "#0a5a3f", "#073e2c", "#f4c20d", "#16231c"

fig, ax = plt.subplots(figsize=(8.8, 5.6))
names = list(top.index)[::-1]   # reverse so highest is at the top
vals  = list(top.values)[::-1]
colors = [ACCENT if v >= sorted(vals)[-3] else GREEN for v in vals]

ax.barh(names, vals, color=colors, edgecolor=GREEN_DK, linewidth=0.5)
for i, (n, v) in enumerate(zip(names, vals)):
    ax.text(v, i, f"  {v:.1f} xG ({int(goals.get(n, 0))}G)",
            va="center", fontsize=9, color=INK)

ax.set_xlabel("Total expected goals (xG)")
ax.set_title("Most xG at the 2022 World Cup", loc="left",
             color=GREEN_DK, fontweight="bold")
ax.spines[["top", "right"]].set_visible(False)
ax.margins(x=0.18)
fig.patch.set_facecolor(PAPER)
ax.set_facecolor(PAPER)
fig.savefig("statsbomb-top-xg-wc2022.png", dpi=150,
            bbox_inches="tight", facecolor=PAPER)
Bar chart of the ten players with the most expected goals at the 2022 World Cup, computed from StatsBomb open data with statsbombpy.
The ten players with the most xG at the 2022 World Cup, built end-to-end with statsbombpy.Data: StatsBomb open data, retrieved June 2026.

Reading the leaderboard

The numbers tell a vivid story. Here is the full top ten, drawn directly from the data:

Top ten players by total xG at the 2022 FIFA World Cup (all rounds, penalty shootouts excluded). Data: StatsBomb open data, retrieved June 2026.
Player xG Goals
Lionel Andrés Messi Cuccittini7.609
Kylian Mbappé Lottin5.029
Robert Lewandowski3.132
Olivier Giroud3.044
Lautaro Javier Martínez2.911
Breel-Donald Embolo2.392
Harry Kane2.332
Nikola Vlašić2.202
Neymar da Silva Santos Junior2.132
Randal Kolo Muani2.052

Messi led the tournament with 7.60 xG and converted 9 goals — heavily outperforming his model, much of it through penalties. Mbappé matched that goal tally from only 5.02 xG, a finishing rate that speaks for itself. Lewandowski sits third on 3.13 xG but scored only 2 goals: Poland's group-stage exit meant he never converted the chances he found. Lautaro Martínez is the most striking entry — 2.91 xG, 1 goal, the clearest case of a striker who was in the right places and made nothing of it.

Attribution — don't skip this

StatsBomb's open data comes with a use requirement. Any public work that uses it — a blog post, a tweet, a dashboard — must carry clear credit to StatsBomb. The exact wording they request is "Data provided by StatsBomb" or "Powered by StatsBomb", plus a link to statsbomb.com. The data is free; the attribution is the price. It is a fair deal and worth honouring.

If you are building something that will be published — even informally — add a small text label to every figure and a note in any accompanying write-up. The companion script already appends Data: StatsBomb open data to each chart's source line; replicate that pattern and you're covered.

Where to go next

Once you are comfortable pulling shots you are a short step from most of the standard analyses in football analytics:

  • Filter events to type == "Pass" and use location and pass_end_location to draw a pass map — the mplsoccer tutorial on this site covers that exactly.
  • Group events by period to study how a team's pressing intensity changes as a match ages.
  • Pull the full freeze_frame column from shot events to see where the defence was standing at the moment of each shot — StatsBomb's data is one of the few free sources with that level of detail.
  • Loop over every competition-season available, not just the World Cup, to build career-level xG totals for any player who appears in the open data.

The full companion script is at scripts/getting-started-with-statsbomb-open-data.py in this site's repository. Change the competition_id filter and it maps any of the 24 competitions StatsBomb has opened up.

Sources & further reading