Tutorials

Build a Poisson Goals Model in Python: From Team Strengths to Scoreline Probabilities

Turn attack and defence strengths into the odds of every scoreline.

Football has a strange property that makes it both maddening and modellable: goals arrive almost at random. Almost. They are not quite a coin flip, because some teams really are better at scoring and others better at preventing it. The classic way to capture exactly that much structure — randomness plus a little team skill — is the Poisson model. In an afternoon, you can build one in Python that turns two teams' strengths into the probability of every scoreline.

Why goals are (nearly) Poisson

The Poisson distribution describes the number of times a rare, independent event happens in a fixed interval when there is some constant underlying rate. A football match fits that picture surprisingly well: shots are relatively rare, roughly independent, and each team has some average scoring rate per game. Empirically, the count of goals a team scores in a match is well approximated by a Poisson distribution, and this has been known since the mid-twentieth century. It is not perfect — we will get to where it breaks — but it is a genuinely good first model, and good first models are worth their weight in gold.

A Poisson distribution has a single parameter, usually written as the Greek letter lambda, which is its mean: the expected number of goals. If we can estimate a sensible lambda for the home team and another for the away team in a given fixture, the distribution hands us the probability of 0 goals, 1 goal, 2 goals, and so on. Combine the two and we have the probability of any exact scoreline, and from there the odds of a home win, a draw, or an away win.

Attack and defence strengths

The trick is estimating each team's lambda for a specific fixture. The standard recipe expresses every team's ability relative to the league average, using four quantities you compute from your own results data:

  • Attack strength. A team's average goals scored, divided by the league average goals scored. Above 1 means a better-than-average attack; below 1, worse.
  • Defence strength. A team's average goals conceded, divided by the league average goals conceded. Here, lower is better — below 1 means a meaner-than-average defence.
  • League baselines. The average goals scored by home teams and by away teams across the league, computed separately because home sides reliably score more.
  • A home-advantage factor. Folded into the model by using those separate home and away baselines, so the home team's expectation is built on the higher home scoring rate.

The expected goals for a fixture then come from multiplying these together. For a match between a home and an away side:

The core formula
home λ = home team attack × away team defence × league average home goals
away λ = away team attack × home team defence × league average away goals

Read it left to right and it is just bookkeeping: the home team's expected goals scale up with its own attacking strength, scale up further if the opponent's defence is leaky, and are anchored to how many goals a typical home team scores. The away expectation mirrors it. Note one important honesty point: the "attack" and "defence" strengths here are derived from your reader's own results data — a season of final scores you have collected — not from any table reproduced in this article. The numbers in the toy example below are clearly labelled and invented only to make the code runnable.

Setting up the data

You need a record of past results: home team, away team, home goals, away goals. In practice you would load a full season; here we use a tiny made-up table of four fictional teams so the script runs end to end. Replace it with your own data and nothing else changes.

import numpy as np
import pandas as pd
from scipy.stats import poisson

# --- TOY DATA (invented, for illustration only) -------------------
# Replace this with a real season of results: one row per match.
results = pd.DataFrame([
    # home,   away,    home_goals, away_goals
    ("Reds",   "Blues",  2, 1),
    ("Reds",   "Greens", 3, 0),
    ("Reds",   "Whites", 1, 1),
    ("Blues",  "Reds",   0, 2),
    ("Blues",  "Greens", 1, 1),
    ("Blues",  "Whites", 2, 0),
    ("Greens", "Reds",   1, 2),
    ("Greens", "Blues",  0, 0),
    ("Greens", "Whites", 1, 2),
    ("Whites", "Reds",   0, 1),
    ("Whites", "Blues",  1, 1),
    ("Whites", "Greens", 3, 1),
], columns=["home", "away", "home_goals", "away_goals"])

If you do not yet have results in this shape, the companion piece on assembling event and results data, getting started with StatsBomb open data, walks through loading matches into a DataFrame like this one.

Estimating the strengths

Now compute the league baselines and each team's attack and defence strength. Everything is an average divided by a league average, exactly as defined above.

# League-wide baselines: average goals by home and away teams.
avg_home_goals = results["home_goals"].mean()
avg_away_goals = results["away_goals"].mean()

# Goals each team scored / conceded at home and away.
home = results.groupby("home").agg(
    home_scored=("home_goals", "mean"),
    home_conceded=("away_goals", "mean"),
)
away = results.groupby("away").agg(
    away_scored=("away_goals", "mean"),
    away_conceded=("home_goals", "mean"),
)
teams = home.join(away)

# Strengths, relative to the league average. ~1.0 is average.
# Attack: higher is better. Defence: lower (concede less) is better.
teams["attack_home"]  = teams["home_scored"]   / avg_home_goals
teams["defence_home"] = teams["home_conceded"] / avg_away_goals
teams["attack_away"]  = teams["away_scored"]   / avg_away_goals
teams["defence_away"] = teams["away_conceded"] / avg_home_goals

print(teams[["attack_home", "defence_home",
             "attack_away", "defence_away"]].round(2))

With more data you would often pool home and away into a single attack and a single defence rating per team and fit them jointly (a Poisson regression), but separate home/away strengths keep this first version transparent and let the home-field effect fall out naturally.

Predicting a fixture

To predict a specific match, combine the relevant strengths into the two lambdas, then ask scipy.stats.poisson for the probability of each goal count. The outer product of the home and away goal distributions gives a matrix whose entry [i, j] is the probability of the scoreline i–j.

def predict(home_team, away_team, max_goals=10):
    # Expected goals for each side (the two Poisson means).
    home_lambda = (teams.loc[home_team, "attack_home"]
                   * teams.loc[away_team, "defence_away"]
                   * avg_home_goals)
    away_lambda = (teams.loc[away_team, "attack_away"]
                   * teams.loc[home_team, "defence_home"]
                   * avg_away_goals)

    # P(0), P(1), ... P(max_goals) goals for each team.
    goals = np.arange(0, max_goals + 1)
    home_probs = poisson.pmf(goals, home_lambda)
    away_probs = poisson.pmf(goals, away_lambda)

    # Joint scoreline matrix: rows = home goals, cols = away goals.
    # Assumes the two scores are independent (the simple model's key bet).
    score_matrix = np.outer(home_probs, away_probs)

    home_win = np.tril(score_matrix, -1).sum()  # home goals > away goals
    draw     = np.trace(score_matrix)           # equal goals
    away_win = np.triu(score_matrix,  1).sum()  # away goals > home goals

    return home_lambda, away_lambda, home_win, draw, away_win


hl, al, hw, dr, aw = predict("Reds", "Blues")
print(f"Expected goals: Reds {hl:.2f} - {al:.2f} Blues")
print(f"Home win {hw:.1%} | Draw {dr:.1%} | Away win {aw:.1%}")

The three outcome probabilities are read straight off the scoreline matrix: sum everything below the diagonal for a home win, the diagonal itself for a draw, and everything above it for an away win. (They sum to slightly under 1 only because we capped the matrix at ten goals; raise max_goals and the leak vanishes.) Run it on your own season and you have a working match-odds model. To turn the same scoreline matrix into a live, minute-by-minute picture, the principles carry straight over to how win probability models work.

Where the simple model breaks — and Dixon-Coles

The basic Poisson model makes one assumption it should not: that the home and away scores are independent. They are not quite. Real football produces slightly too many 0–0 and 1–1 draws, and slightly too few 1–0 and 0–1 results, compared with what independent Poissons predict — the scores are mildly correlated, especially at low totals.

The standard fix is the Dixon-Coles model, the refinement most working forecasters reach for. It adds two things to the recipe above. First, a low-score correction: a small adjustment factor applied to the 0–0, 1–0, 0–1, and 1–1 cells of the scoreline matrix, nudging those probabilities to match what really happens. Second, time weighting: recent matches are weighted more heavily than old ones when estimating strengths, so the model tracks a team's current form rather than averaging over a stale full season. Those two tweaks turn a good toy into something close to what professional models use.

Even unrefined, the Poisson model is a foundation worth having — the same attack-and-defence-strength thinking underlies the league table you can build from xG in build an xG-difference league table. And if your numbers ever disagree with a published forecast, the reasons usually trace back to modelling choices exactly like these; we lay them out in why league projection models disagree.

Sources & further reading

  • Free textbook: Chapter 20: Predictive Modeling — the theory behind this, at DataField.dev.
  • StatsBomb open data — match results you can load into the DataFrame above to estimate real attack and defence strengths.
  • FBref — historical results and scorelines for a wide range of competitions.
  • Understat — match and season data for the major European leagues.
  • StatsBomb — background on modelling goals and match outcomes.