Tutorials

Every Free Public Soccer Data Source, Ranked

Where to actually get free football data, ranked by usefulness.

Before you can analyse anything, you need data. The good news is that football has more free public data than almost any other sport. The bad news is that it comes from a dozen different sources with inconsistent coverage, opaque licensing, and rate-limit tripwires that will get your IP blocked on a Saturday afternoon. Here is every source worth your time, ranked by how useful it actually is for real analysis — plus the catch that each one comes with.

How I ranked these

I scored each source on four dimensions: granularity (event-level beats match totals beats standings-only), breadth (how many competitions and seasons), ease of access (Python client beats documented API beats scraping beats "download this CSV manually"), and licence clarity (explicit open licence beats vague "for educational use" beats nothing stated at all). Where two sources tie on granularity, ease of access breaks the tie. Licence clarity is not a tiebreaker — it is a constraint. Data you are not allowed to republish or build products on is simply less useful, even if it is technically reachable.

The comparison at a glance

Free public soccer data sources compared. "Access" describes how you get the data programmatically.
Source What you get Access The catch
StatsBomb open data Full event streams — every pass, shot, carry, pressure, with freeze frames and pre-computed xG. Dozens of competitions including multiple full seasons of WSL, La Liga, Champions League and the last several World Cups. Git clone or statsbombpy Python client; no account needed. Competition coverage is curated by StatsBomb's commercial interests; the Premier League is almost entirely absent. Non-commercial use only; attribution required. Data can be large — the full repo is several GB.
Understat Match xG, shot-level xG with location, and season xG tables for the top six European leagues: Premier League, La Liga, Bundesliga, Serie A, Ligue 1, and RFPL. Data goes back to the 2014/15 season. Unofficial Python client (understat) or direct JSON scraping from the site's embedded scripts. No formal API and no explicit open licence. The underlying xG model is a black box — the exact feature set is undocumented. RFPL coverage can lag during the season.
FBref Broad advanced stats for hundreds of competitions: xG, progressive passes and carries, defensive pressure, squad and player-level. Powered by Opta data. Web scraping with soccerdata or direct HTML scraping; no official API. Sports Reference rate-limits aggressively and bans scrapers that ignore pauses. The terms of service restrict systematic scraping; use their rate-limit guidance (one request per three seconds minimum, cache everything). Data is Opta-licensed so redistribution is off the table.
football-data.co.uk Match results, final scores, half-time scores, referee, and a full suite of historic bookmaker odds for the top European and some lower-division leagues, going back to the early 1990s in places. Plain CSV files. Direct CSV download; the file structure is consistent enough to automate with a script. No event data, no player-level data — purely match-level results and odds. The site's primary use case is odds history, which makes it feel a little alien for pure analytics work.
OpenFootball / jokecamp Results, fixtures, and standings for major leagues and international tournaments, formatted as CSV or JSON and hosted on GitHub. Git clone or raw GitHub download. Community-maintained with no formal update guarantee. Coverage is uneven — some repositories are years out of date. Good for historical lookups; not a live feed.
Kaggle datasets A long tail of community-uploaded datasets: historical results, player attributes from Football Manager–style databases, pre-built xG tables, and occasional scrapes of the sources above. Kaggle API or browser download. Quality ranges from excellent to completely wrong. Always check the source of the source: many Kaggle datasets are re-exports from FBref or Understat, so the same licence restrictions apply. Treat as a discovery layer, not a primary source.
Club and league official APIs A handful of leagues publish official APIs: the Premier League has a documented public API; some national federations have fixture feeds. Coverage is results and lineups only. HTTP REST API, no client libraries; usually no authentication required for basic endpoints. These APIs are undocumented, unofficial, and change without warning. The Premier League's API in particular has broken third-party tools several times. Treat as a pleasant surprise rather than a dependency.

1. StatsBomb open data — the clear winner

Nothing else comes close on granularity. StatsBomb's free dataset is the same format as their commercial product — every event on the pitch, timestamped and coordinated on a 120×80 grid, with optional freeze-frame snapshots of every player's position at the moment of each shot. The pre-computed xG values are baked in. The statsbombpy Python client means you are three lines of code from a full match worth of events, with no account, no key, and no rate limit to worry about.

The honest caveat is that StatsBomb's free release is shaped by their marketing. Competitions that their commercial clients care about — the Women's Super League, the African Cup of Nations, La Liga with its Messi years — are well represented. The Premier League, where every club already pays for data, barely appears. If your analysis lives and dies in the English top flight, StatsBomb open data is a research tool, not a production one.

Licence summary
Non-commercial use with attribution. Republishing raw data files is not permitted. Building public visualisations and writing articles is fine; selling a product built on the data is not. When in doubt, read the repository's README.

2. Understat — the best season-level xG archive

Understat built its xG model independently and has been running it since 2014. For the six leagues it covers, it provides not just season totals but the individual shot record — location, xG value, minute, and outcome — which means you can reconstruct shot maps and run your own aggregations without relying on the site's tables. The Python understat library wraps the JSON that the site embeds in its HTML pages, which is a fragile mechanism but has been stable for years.

The weakness is the closed model. You cannot inspect what Understat's xG formula actually weights. For exploratory work that is fine; for a published methodology you want to know whether the model controls for defensive pressure or just uses geometry. It almost certainly uses something close to the latter, but you cannot be certain.

3. FBref — breadth at the cost of friction

FBref is remarkable in scope. Hundreds of competitions, player-level advanced stats, season-on-season trends going back years — powered by Opta, which means the underlying data quality is as good as anything in the commercial world. For competition breadth it has no peer among free sources. The soccerdata library handles the scraping cleanly and respects rate limits out of the box.

The friction is real, though. FBref is a web scraping target, not a data API, and Sports Reference will block you if you ignore their rate-limit guidance. The rule is simple and non-negotiable: one request every three seconds, every response cached locally, never scraping from CI. Beyond the technical limits, the Opta licence means you cannot republish the underlying numbers. You can publish derivative analysis — a chart, a model output — but not a CSV of the raw stats. For a blog or personal project that is rarely a problem; for a tool that surfaces raw FBref numbers to end users, it is a real constraint.

4. football-data.co.uk — the underrated results archive

Tony Hirst's site has been publishing plain CSV files of match results and odds since the late 1990s, and it still works exactly the same way. You curl a URL and you get a spreadsheet. The historical depth is unmatched — some English lower-division series go back thirty years — and the bookmaker odds columns make it essential for any work that needs implied probability benchmarks or wants to test a betting-market-beating model.

Its ranking here reflects its ceiling, not a flaw: there is no pass-level data, no xG, no player-level anything. It is a results archive, and it is the best free results archive that exists. If your question is "do home teams in League One win more in November than March?", this is your first stop. If your question involves anything that happened during the ninety minutes, you need a different source.

5. OpenFootball and Kaggle — useful but verify everything

The various OpenFootball repositories on GitHub — including the long-running jokecamp collection — are excellent for structured historical results when you want clean JSON or CSV without any scraping at all. The problem is maintenance: open-source sports data projects are notorious for going quiet at exactly the moment you need last season's data. Before building a pipeline on any of them, check the last commit date.

Kaggle sits above these in discoverability but below them in trustworthiness. It is where you find that someone has already assembled the dataset you need — and also where you find out, three hours later, that they made a systematic error in the season mapping. Use it as a signpost to what is possible and then go upstream to the original source.

A note on licences and rate limits

Two rules that are worth tattooing on your wrist before you start any project that touches free public data.

First, "free to access" is not the same as "free to use". StatsBomb's open data requires attribution and prohibits commercial use. FBref's data is Opta's property, licensed to Sports Reference for display — scraping it does not transfer any licence to you. football-data.co.uk does not publish a formal licence at all, which in practice means you cannot assume you have any rights beyond personal use. Whenever you plan to publish derived work or share data files with others, read the terms. If there are no terms, ask before you ship.

Second, rate limits are not suggestions. A polite scraper caches responses and sleeps between requests. An impolite scraper burns through a site's infrastructure, gets blocked, and then complains on forums that the data "went down". The community's continued access to sources like FBref depends on scrapers behaving themselves. One request every three seconds, always. Set it and forget it.

Sources & further reading

  • Free textbook: Chapter 4: Python Tools for Soccer Analytics — the theory behind this, at DataField.dev.
  • StatsBomb open data — the richest free event-level dataset in football.
  • StatsBomb — documentation on the data spec, event types, and freeze-frame format.
  • Understat — xG and shot-level data for the top six European leagues since 2014/15.
  • FBref — broad advanced stats across hundreds of competitions, powered by Opta.
  • football-data.co.uk — match results and bookmaker odds going back decades, in plain CSV.
  • jokecamp FootballData — community-curated results and fixture data on GitHub.
  • Kaggle datasets — community-uploaded football datasets; useful as a discovery layer.
  • soccerdata — Python library for scraping FBref, Understat and others with sensible rate limiting built in.