Capstone - Complete Analytics System
Introduction to Women's Football Analytics
Women's football has experienced explosive growth over the past decade, with record-breaking attendance figures, increased broadcast coverage, and growing investment from clubs and federations. Analytics has become an essential tool for teams looking to gain competitive advantages in this rapidly professionalizing environment.
The Growth of Women's Football Analytics
While women's football has historically had less data coverage, the landscape is rapidly changing. Major data providers now cover top women's leagues, creating unprecedented opportunities for analysts to contribute to the development of the sport.
# Loading Women's Football Data from StatsBomb
from statsbombpy import sb
import pandas as pd
# Get available competitions
competitions = sb.competitions()
# Filter for women's competitions
womens_comps = competitions[
competitions["competition_name"].str.contains(
"Women|WSL|NWSL|FAWSL", case=False, na=False
)
]
print(womens_comps[["competition_name", "season_name"]])
# Load Women's World Cup 2023 data
wwc_matches = sb.matches(competition_id=72, season_id=107)
# Load event data for matches
all_events = []
for match_id in wwc_matches["match_id"]:
events = sb.events(match_id=match_id)
all_events.append(events)
wwc_events = pd.concat(all_events, ignore_index=True)
# Summarize available data
print(f"Total matches: {len(wwc_matches)}")
print(f"Total events: {len(wwc_events)}")
# Event type breakdown
event_summary = wwc_events["type"].value_counts().head(10)
print(event_summary)
# Loading Women's Football Data from StatsBomb
library(StatsBombR)
library(tidyverse)
# StatsBomb provides free women's football data
# Get available competitions
competitions <- FreeCompetitions()
# Filter for women's competitions
womens_comps <- competitions %>%
filter(str_detect(competition_name, "Women|WSL|NWSL|FAWSL"))
print(womens_comps %>% select(competition_name, season_name))
# Load Women's World Cup 2023 data
wwc_matches <- FreeMatches(Competitions = 72)
# Load all event data
wwc_events <- get.matchFree(wwc_matches)
# Summarize available data
cat("Total matches:", nrow(wwc_matches), "\n")
cat("Total events:", nrow(wwc_events), "\n")
# Event type breakdown
event_summary <- wwc_events %>%
count(type.name, sort = TRUE) %>%
head(10)
print(event_summary)
The Women's Football Data Landscape
Understanding the data ecosystem for women's football is crucial for analysts working in this space. While coverage has expanded significantly, there are still important differences compared to men's football data availability.
- English WSL (Women's Super League)
- Spanish Liga F
- German Frauen-Bundesliga
- French Division 1 Feminine
- NWSL (USA)
- UEFA Women's Champions League
- FIFA Women's World Cup
- StatsBomb - Event data, free WWC data
- Opta - Event data for top leagues
- Wyscout - Video and event data
- Second Spectrum - Tracking data (limited)
- SkillCorner - Broadcast tracking
- FBref - Free aggregated statistics
# Scraping Women's Football Stats from FBref
import pandas as pd
# FBref URLs for women's leagues
WSL_URL = "https://fbref.com/en/comps/189/2023-2024/2023-2024-Womens-Super-League-Stats"
LIGA_F_URL = "https://fbref.com/en/comps/230/Liga-F-Stats"
NWSL_URL = "https://fbref.com/en/comps/182/NWSL-Stats"
def get_league_table(url):
"""Scrape league standings from FBref"""
tables = pd.read_html(url)
# League table is typically the first table
standings = tables[0]
return standings
# Get WSL standings
wsl_standings = get_league_table(WSL_URL)
print("WSL Standings:")
print(wsl_standings[["Squad", "MP", "W", "D", "L", "GF", "GA", "GD", "Pts"]].head())
# Alternative: Using soccerdata library
try:
import soccerdata as sd
# Initialize FBref reader for women's data
fbref = sd.FBref(leagues=["ENG-WSL"], seasons=["2023-2024"])
# Get squad statistics
squad_stats = fbref.read_team_season_stats()
print(squad_stats.head())
except ImportError:
print("Install soccerdata: pip install soccerdata")
# Custom scraping function for player stats
def get_player_stats(url):
"""Extract player statistics from FBref page"""
tables = pd.read_html(url)
for table in tables:
if "Player" in table.columns:
return table
return None
# Compare goal-scoring across leagues
def compare_leagues(urls_dict):
"""Compare statistics across multiple women's leagues"""
comparison = []
for league, url in urls_dict.items():
try:
stats = get_player_stats(url)
if stats is not None:
comparison.append({
"League": league,
"Avg_Goals": stats["Gls"].mean(),
"Max_Goals": stats["Gls"].max()
})
except:
pass
return pd.DataFrame(comparison)
# Scraping Women's Football Stats from FBref
library(worldfootballR)
library(tidyverse)
# Get WSL (Women's Super League) standings
wsl_standings <- fb_season_team_stats(
country = "ENG",
gender = "F",
season_end_year = 2024,
tier = "1st",
stat_type = "league_table"
)
print(wsl_standings %>% select(Squad, MP, W, D, L, GF, GA, GD, Pts))
# Get player stats from Liga F
ligaf_stats <- fb_season_team_stats(
country = "ESP",
gender = "F",
season_end_year = 2024,
tier = "1st",
stat_type = "standard"
)
# Top scorers analysis
top_scorers <- ligaf_stats %>%
arrange(desc(Gls)) %>%
select(Squad, Player, Gls, Ast, xG, xAG) %>%
head(10)
print(top_scorers)
# Get NWSL data
nwsl_stats <- fb_season_team_stats(
country = "USA",
gender = "F",
season_end_year = 2024,
tier = "1st",
stat_type = "standard"
)
# Compare leagues
compare_leagues <- function(data, league_name) {
data %>%
summarise(
League = league_name,
Avg_Goals = mean(Gls, na.rm = TRUE),
Avg_xG = mean(xG, na.rm = TRUE),
Max_Goals = max(Gls, na.rm = TRUE)
)
}
Analytical Considerations for Women's Football
While the fundamental principles of football analytics apply across both men's and women's football, there are important considerations and nuances that analysts should be aware of when working with women's football data.
Important Considerations
Women's football should be analyzed on its own terms, not simply compared to men's football. Metrics and models should be calibrated specifically for women's football data, and insights should be contextualized within the women's game.
Key Analytical Differences
Physical Metrics
- Different baseline values for speed, distance, acceleration
- Pitch dimensions may vary (some leagues use smaller pitches)
- Ball size and weight standardization differences
- Goalkeeper reach and diving ranges differ
Statistical Baselines
- League-specific xG models needed
- Different goal-scoring rates and patterns
- Set piece conversion rates vary
- Pressing intensity benchmarks differ
# Building Women's Football Specific xG Model
from statsbombpy import sb
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss, roc_auc_score
# Load Women's World Cup data
wwc_matches = sb.matches(competition_id=72, season_id=107)
# Collect all shots
all_shots = []
for match_id in wwc_matches["match_id"]:
events = sb.events(match_id=match_id)
shots = events[events["type"] == "Shot"]
all_shots.append(shots)
shots_df = pd.concat(all_shots, ignore_index=True)
# Extract shot location
shots_df["x"] = shots_df["location"].apply(lambda x: x[0] if x else None)
shots_df["y"] = shots_df["location"].apply(lambda x: x[1] if x else None)
# Calculate features
shots_df["distance"] = np.sqrt(
(120 - shots_df["x"])**2 + (40 - shots_df["y"])**2
)
shots_df["angle"] = np.arctan2(
np.abs(40 - shots_df["y"]),
120 - shots_df["x"]
) * 180 / np.pi
# Binary features
shots_df["is_header"] = (shots_df["shot_body_part"] == "Head").astype(int)
shots_df["is_first_time"] = shots_df["shot_first_time"].fillna(False).astype(int)
shots_df["is_open_play"] = (shots_df["shot_type"] == "Open Play").astype(int)
shots_df["is_goal"] = (shots_df["shot_outcome"] == "Goal").astype(int)
# Baseline statistics
print("Women's WWC Shot Statistics:")
print(f"Total shots: {len(shots_df)}")
print(f"Conversion rate: {shots_df['is_goal'].mean():.3f}")
print(f"Header conversion: {shots_df[shots_df['is_header']==1]['is_goal'].mean():.3f}")
# Prepare model features
features = ["distance", "angle", "is_header", "is_first_time", "is_open_play"]
X = shots_df[features].dropna()
y = shots_df.loc[X.index, "is_goal"]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train xG model
xg_model = LogisticRegression(max_iter=1000)
xg_model.fit(X_train, y_train)
# Predictions
shots_df.loc[X.index, "xG_custom"] = xg_model.predict_proba(X)[:, 1]
# Evaluate
y_pred = xg_model.predict_proba(X_test)[:, 1]
print(f"\nModel Performance:")
print(f"Brier Score: {brier_score_loss(y_test, y_pred):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred):.4f}")
# Compare with StatsBomb xG
correlation = shots_df[["xG_custom", "shot_statsbomb_xg"]].corr()
print(f"\nCorrelation with StatsBomb xG: {correlation.iloc[0,1]:.3f}")
# Building Women's Football Specific xG Model
library(tidyverse)
library(StatsBombR)
# Load Women's World Cup shot data
wwc_matches <- FreeMatches(Competitions = 72)
wwc_events <- get.matchFree(wwc_matches)
# Filter shots and prepare features
shots <- wwc_events %>%
filter(type.name == "Shot") %>%
mutate(
# Calculate distance and angle
distance = sqrt((120 - location.x)^2 + (40 - location.y)^2),
angle = atan2(abs(40 - location.y), 120 - location.x) * 180 / pi,
# Shot type features
is_header = shot.body_part.name == "Head",
is_first_time = shot.first_time == TRUE,
# Situation features
is_open_play = shot.type.name == "Open Play",
is_penalty = shot.type.name == "Penalty",
is_freekick = shot.type.name == "Free Kick",
# Outcome
is_goal = shot.outcome.name == "Goal"
)
# Calculate baseline conversion rates
baseline_stats <- shots %>%
summarise(
total_shots = n(),
total_goals = sum(is_goal),
conversion_rate = mean(is_goal),
# By type
header_conversion = mean(is_goal[is_header], na.rm = TRUE),
open_play_conversion = mean(is_goal[is_open_play], na.rm = TRUE),
penalty_conversion = mean(is_goal[is_penalty], na.rm = TRUE)
)
print(baseline_stats)
# Build logistic regression xG model
xg_model <- glm(
is_goal ~ distance + angle + is_header + is_first_time +
is_open_play + is_freekick,
data = shots,
family = binomial()
)
summary(xg_model)
# Add xG predictions
shots$xG_custom <- predict(xg_model, type = "response")
# Compare with StatsBomb xG
comparison <- shots %>%
filter(!is.na(shot.statsbomb_xg)) %>%
summarise(
correlation = cor(xG_custom, shot.statsbomb_xg),
mean_difference = mean(xG_custom - shot.statsbomb_xg)
)
cat("Correlation with StatsBomb xG:", comparison$correlation, "\n")
Player Evaluation in Women's Football
Player evaluation in women's football requires understanding the context of the women's game, including league quality differences, international experience, and the relatively smaller talent pool compared to men's football.
# Player Evaluation Framework for Women's Football
from statsbombpy import sb
import pandas as pd
import numpy as np
# Load Women's World Cup data
wwc_matches = sb.matches(competition_id=72, season_id=107)
# Collect all events
all_events = []
for match_id in wwc_matches["match_id"]:
events = sb.events(match_id=match_id)
events["match_id"] = match_id
all_events.append(events)
events_df = pd.concat(all_events, ignore_index=True)
# Calculate player statistics
def calculate_player_stats(events):
"""Calculate comprehensive player statistics"""
# Group by player
player_stats = events.groupby(["player", "team"]).agg({
"match_id": "nunique",
"type": "count"
}).reset_index()
player_stats.columns = ["player", "team", "matches", "events"]
# Shots and goals
shots = events[events["type"] == "Shot"].groupby("player").agg({
"shot_statsbomb_xg": ["count", "sum"],
"shot_outcome": lambda x: (x == "Goal").sum()
}).reset_index()
shots.columns = ["player", "shots", "xG", "goals"]
# Passes
passes = events[events["type"] == "Pass"].groupby("player").agg({
"id": "count",
"pass_outcome": lambda x: x.isna().mean(), # Completion rate
"pass_shot_assist": "sum"
}).reset_index()
passes.columns = ["player", "passes", "pass_completion", "key_passes"]
# Defensive actions
defensive = events[events["type"].isin(["Pressure", "Interception"])].groupby("player").agg({
"type": [
lambda x: (x == "Pressure").sum(),
lambda x: (x == "Interception").sum()
]
}).reset_index()
defensive.columns = ["player", "pressures", "interceptions"]
# Merge all stats
player_stats = player_stats.merge(shots, on="player", how="left")
player_stats = player_stats.merge(passes, on="player", how="left")
player_stats = player_stats.merge(defensive, on="player", how="left")
return player_stats.fillna(0)
player_stats = calculate_player_stats(events_df)
# Calculate per-90 metrics (estimate minutes)
player_stats["est_minutes"] = player_stats["matches"] * 75
player_stats["xG_p90"] = player_stats["xG"] / player_stats["est_minutes"] * 90
player_stats["shots_p90"] = player_stats["shots"] / player_stats["est_minutes"] * 90
player_stats["pressures_p90"] = player_stats["pressures"] / player_stats["est_minutes"] * 90
# Filter for players with sufficient playing time
qualified = player_stats[player_stats["matches"] >= 3].copy()
# Percentile rankings
for col in ["xG_p90", "shots_p90", "pressures_p90"]:
qualified[f"{col}_pct"] = qualified[col].rank(pct=True) * 100
# Top performers
top_by_xg = qualified.nlargest(10, "xG_p90")[
["player", "team", "matches", "xG", "xG_p90", "xG_p90_pct"]
]
print("Top Performers by xG per 90:")
print(top_by_xg.to_string(index=False))
# Player Evaluation Framework for Women's Football
library(tidyverse)
library(StatsBombR)
# Load FAWSL data (if available) or use WWC as proxy
wwc_events <- get.matchFree(FreeMatches(Competitions = 72))
# Calculate per-90 metrics for players
player_stats <- wwc_events %>%
filter(type.name %in% c("Pass", "Shot", "Dribble", "Ball Receipt*",
"Carry", "Pressure", "Duel", "Interception")) %>%
group_by(player.id, player.name, team.name) %>%
summarise(
matches = n_distinct(match_id),
minutes = sum(ifelse(type.name == "Starting XI", 90, 0), na.rm = TRUE),
# Attacking
shots = sum(type.name == "Shot"),
goals = sum(type.name == "Shot" & shot.outcome.name == "Goal", na.rm = TRUE),
xG = sum(shot.statsbomb_xg, na.rm = TRUE),
# Passing
passes = sum(type.name == "Pass"),
pass_completion = mean(is.na(pass.outcome.name[type.name == "Pass"])),
key_passes = sum(pass.shot_assist == TRUE, na.rm = TRUE),
# Dribbling
dribbles = sum(type.name == "Dribble"),
dribble_success = mean(dribble.outcome.name == "Complete", na.rm = TRUE),
# Defensive
pressures = sum(type.name == "Pressure"),
interceptions = sum(type.name == "Interception"),
.groups = "drop"
)
# Calculate per-90 stats (estimate minutes from events)
player_stats <- player_stats %>%
mutate(
est_minutes = matches * 75, # Rough estimate
shots_p90 = shots / est_minutes * 90,
xG_p90 = xG / est_minutes * 90,
passes_p90 = passes / est_minutes * 90,
pressures_p90 = pressures / est_minutes * 90
)
# Percentile ranking within tournament
player_stats <- player_stats %>%
filter(matches >= 3) %>% # Minimum appearances
mutate(
xG_percentile = percent_rank(xG_p90) * 100,
passing_percentile = percent_rank(passes_p90) * 100,
pressing_percentile = percent_rank(pressures_p90) * 100
)
# Top performers by xG
top_by_xg <- player_stats %>%
arrange(desc(xG_p90)) %>%
select(player.name, team.name, matches, xG, xG_p90, xG_percentile) %>%
head(10)
print(top_by_xg)
Creating Player Comparison Visualizations
# Player Radar Charts for Women's Football
import matplotlib.pyplot as plt
import numpy as np
from math import pi
def create_radar_chart(player_data, player_name, metrics):
"""Create radar chart for player comparison"""
# Get player data
player = player_data[player_data["player"] == player_name].iloc[0]
# Get percentile values
values = [player[f"{m}_p90_pct"] for m in metrics]
values += values[:1] # Complete the polygon
# Set up radar chart
angles = [n / float(len(metrics)) * 2 * pi for n in range(len(metrics))]
angles += angles[:1]
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
# Plot
ax.plot(angles, values, linewidth=2, linestyle="solid", color="#1B5E20")
ax.fill(angles, values, alpha=0.3, color="#1B5E20")
# Labels
ax.set_xticks(angles[:-1])
ax.set_xticklabels(metrics)
ax.set_ylim(0, 100)
plt.title(f"Player Radar: {player_name}", size=14, y=1.1)
plt.tight_layout()
return fig
# Create radar for top player
metrics = ["xG", "shots", "pressures"]
fig = create_radar_chart(qualified, "A. Bonmati", metrics)
plt.show()
# Multi-player comparison
def compare_players_radar(data, players, metrics):
"""Compare multiple players on radar chart"""
angles = [n / float(len(metrics)) * 2 * pi for n in range(len(metrics))]
angles += angles[:1]
fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))
colors = ["#1B5E20", "#FF6B35", "#4169E1", "#9932CC"]
for i, player_name in enumerate(players):
player = data[data["player"] == player_name]
if len(player) == 0:
continue
values = [player[f"{m}_p90_pct"].values[0] for m in metrics]
values += values[:1]
ax.plot(angles, values, linewidth=2, label=player_name, color=colors[i])
ax.fill(angles, values, alpha=0.1, color=colors[i])
ax.set_xticks(angles[:-1])
ax.set_xticklabels(metrics)
ax.set_ylim(0, 100)
ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1))
plt.title("Player Comparison", size=14, y=1.1)
return fig
# Compare top midfielders
top_midfielders = ["A. Bonmati", "L. Bronze", "A. Russo"]
fig = compare_players_radar(qualified, top_midfielders, metrics)
plt.show()
# Player Radar Charts for Women's Football
library(tidyverse)
library(fmsb)
# Prepare data for radar chart
create_radar_data <- function(player_data, metrics, player_name) {
# Select player
player <- player_data %>%
filter(player.name == player_name)
# Get percentile values
values <- player %>%
select(all_of(paste0(metrics, "_percentile"))) %>%
as.numeric()
# Create radar format (needs max, min, then values)
radar_df <- rbind(
rep(100, length(metrics)), # Max
rep(0, length(metrics)), # Min
values
)
colnames(radar_df) <- metrics
return(as.data.frame(radar_df))
}
# Example radar chart
metrics <- c("xG", "passing", "pressing")
radar_data <- create_radar_data(player_stats, metrics, "Aitana Bonmati")
# Plot
radarchart(radar_data,
pcol = "#1B5E20",
pfcol = rgb(0.1, 0.4, 0.1, 0.5),
plwd = 2,
cglcol = "grey",
cglty = 1,
axislabcol = "grey",
vlcex = 0.8,
title = "Player Radar: Aitana Bonmati")
# Comparison radar
compare_players <- function(data, players, metrics) {
library(ggplot2)
comparison <- data %>%
filter(player.name %in% players) %>%
select(player.name, ends_with("_percentile")) %>%
pivot_longer(-player.name,
names_to = "metric",
values_to = "value")
ggplot(comparison, aes(x = metric, y = value,
group = player.name, color = player.name)) +
geom_polygon(fill = NA, linewidth = 1) +
coord_polar() +
theme_minimal() +
labs(title = "Player Comparison", color = "Player") +
theme(axis.text.x = element_text(size = 10))
}
Team Analysis and Tactical Patterns
Understanding team tactics in women's football requires analyzing patterns specific to the women's game, including pressing structures, build-up patterns, and set piece strategies.
# Team Tactical Analysis - Women's Football
from statsbombpy import sb
import pandas as pd
import numpy as np
# Load match data
wwc_matches = sb.matches(competition_id=72, season_id=107)
all_events = []
for match_id in wwc_matches["match_id"]:
events = sb.events(match_id=match_id)
all_events.append(events)
events_df = pd.concat(all_events, ignore_index=True)
# Extract locations
events_df["x"] = events_df["location"].apply(lambda x: x[0] if isinstance(x, list) else None)
events_df["y"] = events_df["location"].apply(lambda x: x[1] if isinstance(x, list) else None)
# Pressing analysis
def analyze_pressing(events):
"""Analyze team pressing patterns"""
pressures = events[events["type"] == "Pressure"].copy()
# Define zones
pressures["zone"] = pd.cut(
pressures["x"],
bins=[0, 40, 80, 120],
labels=["Defensive", "Middle", "Attacking"]
)
# Team aggregation
pressing_stats = pressures.groupby("team").agg({
"id": "count",
"zone": lambda x: (x == "Attacking").sum()
}).reset_index()
pressing_stats.columns = ["team", "total_pressures", "high_pressures"]
pressing_stats["high_press_pct"] = (
pressing_stats["high_pressures"] / pressing_stats["total_pressures"] * 100
)
return pressing_stats.sort_values("high_press_pct", ascending=False)
pressing_analysis = analyze_pressing(events_df)
print("Top Pressing Teams (High Press %):")
print(pressing_analysis[["team", "total_pressures", "high_press_pct"]].head(10))
# Build-up play analysis
def analyze_buildup(events):
"""Analyze team build-up patterns"""
# Passes in defensive third
passes = events[
(events["type"] == "Pass") &
(events["x"] < 40)
].copy()
# Pass length
passes["end_x"] = passes["pass_end_location"].apply(
lambda x: x[0] if isinstance(x, list) else None
)
passes["pass_length"] = np.sqrt(
(passes["end_x"] - passes["x"])**2
)
buildup_stats = passes.groupby("team").agg({
"id": "count",
"pass_length": ["mean", lambda x: (x < 15).sum(), lambda x: (x > 35).sum()]
}).reset_index()
buildup_stats.columns = ["team", "buildup_passes", "avg_length",
"short_passes", "long_passes"]
buildup_stats["short_pct"] = buildup_stats["short_passes"] / buildup_stats["buildup_passes"] * 100
buildup_stats["direct_pct"] = buildup_stats["long_passes"] / buildup_stats["buildup_passes"] * 100
# Classify style
buildup_stats["style"] = buildup_stats.apply(
lambda x: "Possession" if x["short_pct"] > 60
else ("Direct" if x["direct_pct"] > 30 else "Balanced"),
axis=1
)
return buildup_stats
buildup_analysis = analyze_buildup(events_df)
print("\nBuild-up Play Styles:")
print(buildup_analysis[["team", "short_pct", "direct_pct", "style"]])
# Team Tactical Analysis - Women's Football
library(tidyverse)
library(StatsBombR)
# Load match data
wwc_events <- get.matchFree(FreeMatches(Competitions = 72))
# Team pressing analysis
pressing_analysis <- wwc_events %>%
filter(type.name == "Pressure") %>%
mutate(
# Pitch zones (120x80)
zone_x = cut(location.x, breaks = c(0, 40, 80, 120),
labels = c("Defensive", "Middle", "Attacking")),
zone_y = cut(location.y, breaks = c(0, 27, 53, 80),
labels = c("Left", "Center", "Right"))
) %>%
group_by(team.name) %>%
summarise(
total_pressures = n(),
high_press = sum(zone_x == "Attacking"),
mid_press = sum(zone_x == "Middle"),
low_press = sum(zone_x == "Defensive"),
high_press_pct = high_press / total_pressures * 100,
# Pressing success
successful = sum(pressure_success == TRUE, na.rm = TRUE),
success_rate = successful / total_pressures * 100,
.groups = "drop"
) %>%
arrange(desc(high_press_pct))
print("Top Pressing Teams:")
print(pressing_analysis %>%
select(team.name, total_pressures, high_press_pct, success_rate) %>%
head(10))
# Build-up play analysis
buildup_analysis <- wwc_events %>%
filter(type.name == "Pass",
location.x < 40) %>% # Defensive third
group_by(team.name) %>%
summarise(
buildup_passes = n(),
short_passes = sum(pass.length < 15, na.rm = TRUE),
long_passes = sum(pass.length > 35, na.rm = TRUE),
# Direction
forward = sum(pass.end_location.x > location.x, na.rm = TRUE),
backward = sum(pass.end_location.x < location.x, na.rm = TRUE),
# Style indicators
short_pct = short_passes / buildup_passes * 100,
direct_pct = long_passes / buildup_passes * 100,
forward_pct = forward / buildup_passes * 100,
.groups = "drop"
)
# Classify playing styles
buildup_analysis <- buildup_analysis %>%
mutate(
style = case_when(
short_pct > 60 ~ "Possession-based",
direct_pct > 30 ~ "Direct",
TRUE ~ "Balanced"
)
)
print("\nBuild-up Play Styles:")
print(buildup_analysis %>% select(team.name, short_pct, direct_pct, style))
Recruitment and Scouting Analytics
With increasing investment in women's football, recruitment analytics has become crucial. The challenge lies in identifying talent across leagues with varying quality levels and limited historical data.
Recruitment Challenges in Women's Football
- League Quality Variation: Performance must be adjusted for league strength
- Limited Data History: Many players have shorter professional careers on record
- International vs. Club: Some players excel more in international tournaments
- Age Considerations: Career trajectories may differ from men's football
# Recruitment Scouting System for Women's Football
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
class WomensFootballScout:
"""Scouting and recruitment analytics for women's football"""
def __init__(self):
# Position-specific metric weights
self.position_weights = {
"Forward": {
"xG_p90": 0.30, "shots_p90": 0.20,
"dribbles_p90": 0.20, "pressures_p90": 0.15,
"key_passes_p90": 0.15
},
"Midfielder": {
"xG_p90": 0.15, "passes_p90": 0.25,
"key_passes_p90": 0.20, "pressures_p90": 0.20,
"interceptions_p90": 0.20
},
"Defender": {
"interceptions_p90": 0.25, "tackles_p90": 0.25,
"passes_p90": 0.20, "aerials_p90": 0.15,
"pressures_p90": 0.15
}
}
# League quality factors
self.league_strength = {
"WSL": 1.0,
"Liga F": 1.0,
"NWSL": 0.95,
"D1 Feminine": 0.95,
"Frauen-Bundesliga": 0.92,
"Serie A Femminile": 0.88,
"A-League Women": 0.80
}
def calculate_composite_score(self, player_data, position):
"""Calculate weighted composite score for position"""
weights = self.position_weights.get(position, self.position_weights["Midfielder"])
score = 0
for metric, weight in weights.items():
if metric in player_data.columns:
# Use percentile ranking
player_data[f"{metric}_pct"] = player_data[metric].rank(pct=True) * 100
score += player_data[f"{metric}_pct"] * weight
return score
def adjust_for_league(self, data):
"""Adjust statistics for league quality"""
data = data.copy()
data["league_factor"] = data["league"].map(self.league_strength).fillna(0.85)
data["adjusted_score"] = data["composite_score"] * data["league_factor"]
return data
def calculate_value_score(self, data):
"""Calculate player value considering age"""
data = data.copy()
# Age factors for women's football
def age_factor(age):
if age < 22: return 1.3 # High potential
elif age < 25: return 1.2 # Rising
elif age < 30: return 1.0 # Peak
elif age < 33: return 0.8 # Declining
else: return 0.6 # Veterans
data["age_factor"] = data["age"].apply(age_factor)
data["value_score"] = data["composite_score"] * data["age_factor"]
return data
def find_similar_players(self, target_name, all_players, features, n=10):
"""Find players similar to target using cosine similarity"""
# Prepare feature matrix
X = all_players[features].fillna(0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Find target index
target_idx = all_players[all_players["player"] == target_name].index[0]
target_vector = X_scaled[target_idx].reshape(1, -1)
# Calculate similarities
similarities = cosine_similarity(target_vector, X_scaled)[0]
all_players["similarity"] = similarities
return all_players.nlargest(n + 1, "similarity").iloc[1:] # Exclude self
# Usage example
scout = WomensFootballScout()
# Calculate scores for forwards
# forward_data = scout.calculate_composite_score(player_stats, "Forward")
# adjusted_data = scout.adjust_for_league(forward_data)
# valued_data = scout.calculate_value_score(adjusted_data)
print("Scouting system initialized")
print(f"Positions: {list(scout.position_weights.keys())}")
print(f"Leagues: {list(scout.league_strength.keys())}")
# Recruitment Scouting System for Women's Football
library(tidyverse)
# Create scouting framework
create_scouting_profile <- function(player_stats, position = "Forward") {
# Define position-specific weights
weights <- list(
Forward = c(xG = 0.3, shots = 0.2, dribbles = 0.2,
pressing = 0.15, key_passes = 0.15),
Midfielder = c(xG = 0.15, passes = 0.25, key_passes = 0.2,
pressing = 0.2, interceptions = 0.2),
Defender = c(interceptions = 0.25, tackles = 0.25,
passes = 0.2, aerials = 0.15, pressing = 0.15)
)
w <- weights[[position]]
# Calculate composite score
player_stats %>%
mutate(
composite_score =
xG_percentile * w["xG"] +
passing_percentile * w["passes"] +
pressing_percentile * w["pressing"]
# Add other metrics as available
) %>%
arrange(desc(composite_score))
}
# League quality adjustment
adjust_for_league <- function(stats, league_factors) {
# League strength factors (1.0 = baseline)
# Higher = stronger league
league_strength <- c(
"WSL" = 1.0,
"Liga F" = 1.0,
"NWSL" = 0.95,
"Division 1 Feminine" = 0.95,
"Frauen-Bundesliga" = 0.92,
"Serie A Femminile" = 0.88,
"A-League Women" = 0.80
)
stats %>%
mutate(
league_factor = league_strength[league],
adjusted_xG = xG_p90 * league_factor,
adjusted_score = composite_score * league_factor
)
}
# Age-based value assessment
calculate_player_value <- function(stats) {
stats %>%
mutate(
# Peak years typically 25-29 in women's football
age_factor = case_when(
age < 22 ~ 1.3, # High potential
age < 25 ~ 1.2, # Rising
age < 30 ~ 1.0, # Peak
age < 33 ~ 0.8, # Declining
TRUE ~ 0.6 # Veterans
),
# Combine quality and potential
value_score = composite_score * age_factor
)
}
# Similarity search for recruitment
find_similar_players <- function(target_player, all_players, n = 10) {
# Features for comparison
features <- c("xG_p90", "passes_p90", "pressures_p90",
"dribble_success", "pass_completion")
target_values <- target_player %>% select(all_of(features))
# Calculate Euclidean distance
all_players %>%
rowwise() %>%
mutate(
distance = sqrt(sum((c_across(all_of(features)) - target_values)^2))
) %>%
ungroup() %>%
arrange(distance) %>%
head(n)
}
Physical Performance Analysis
Physical performance analysis in women's football requires understanding the unique physiological characteristics of female athletes. While the principles are similar to men's football, the baseline values and training considerations differ.
Key Physical Differences
- Different baseline values for maximal sprint speed (typically 26-30 km/h vs 32-36 km/h in men's)
- Similar relative distances covered when normalized to physical capacity
- Menstrual cycle considerations for training load management
- Different injury risk profiles (higher ACL injury rates)
- Recovery patterns may differ due to hormonal factors
# Python: Physical performance analysis for women's football
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List
@dataclass
class PhysicalBenchmarks:
"""Physical performance benchmarks for women's football."""
# Position benchmarks (per 90 minutes)
position_benchmarks = {
"GK": {"total_distance": 5500, "high_speed": 100, "sprint": 30, "accelerations": 15},
"CB": {"total_distance": 9500, "high_speed": 300, "sprint": 80, "accelerations": 35},
"FB": {"total_distance": 10800, "high_speed": 600, "sprint": 150, "accelerations": 50},
"CM": {"total_distance": 11200, "high_speed": 500, "sprint": 120, "accelerations": 45},
"AM": {"total_distance": 10500, "high_speed": 550, "sprint": 140, "accelerations": 55},
"W": {"total_distance": 10200, "high_speed": 650, "sprint": 180, "accelerations": 60},
"ST": {"total_distance": 9800, "high_speed": 500, "sprint": 130, "accelerations": 45}
}
# Speed zones (km/h) - women's specific
speed_zones = {
1: (0, 7, "Walking"),
2: (7, 13, "Jogging"),
3: (13, 18, "Running"),
4: (18, 23, "High-speed running"),
5: (23, 30, "Sprinting")
}
class WomensPhysicalAnalyzer:
"""Analyze physical performance in women's football."""
def __init__(self):
self.benchmarks = PhysicalBenchmarks()
def analyze_match_performance(self, player_data: pd.DataFrame) -> pd.DataFrame:
"""Compare player physical output to benchmarks."""
df = player_data.copy()
# Get benchmarks for each position
df["benchmark_distance"] = df["position"].map(
lambda x: self.benchmarks.position_benchmarks.get(x, {}).get("total_distance", 10000)
)
df["benchmark_hsd"] = df["position"].map(
lambda x: self.benchmarks.position_benchmarks.get(x, {}).get("high_speed", 500)
)
df["benchmark_sprint"] = df["position"].map(
lambda x: self.benchmarks.position_benchmarks.get(x, {}).get("sprint", 100)
)
# Calculate percentages
df["distance_pct"] = df["total_distance"] / df["benchmark_distance"] * 100
df["hsd_pct"] = df["high_speed_distance"] / df["benchmark_hsd"] * 100
df["sprint_pct"] = df["sprint_distance"] / df["benchmark_sprint"] * 100
# Overall physical score
df["physical_score"] = (df["distance_pct"] + df["hsd_pct"] + df["sprint_pct"]) / 3
# Performance classification
df["performance_level"] = pd.cut(
df["physical_score"],
bins=[0, 80, 90, 100, 110, float("inf")],
labels=["Underperforming", "Below Average", "Average", "Above Average", "Exceptional"]
)
return df
def analyze_cycle_impact(self, physical_data: pd.DataFrame,
cycle_data: pd.DataFrame) -> pd.DataFrame:
"""Analyze performance variation across menstrual cycle phases."""
# Merge datasets
combined = physical_data.merge(cycle_data, on=["player_id", "date"], how="left")
# Define cycle phases
def get_phase(day):
if pd.isna(day):
return "Unknown"
if day <= 5:
return "Menstruation"
elif day <= 14:
return "Follicular"
elif day <= 21:
return "Ovulation"
else:
return "Luteal"
combined["cycle_phase"] = combined["day_in_cycle"].apply(get_phase)
# Aggregate by phase
phase_analysis = combined.groupby(["player_id", "cycle_phase"]).agg({
"total_distance": "mean",
"high_speed_distance": "mean",
"sprint_distance": "mean",
"rpe": "mean" # Rating of Perceived Exertion
}).reset_index()
return phase_analysis
def calculate_acl_risk(self, player_data: pd.DataFrame) -> pd.DataFrame:
"""Calculate ACL injury risk factors."""
df = player_data.copy()
# Risk factors
df["fatigue_risk"] = np.clip(
df["acute_load"] / df["chronic_load"] - 1, 0, 1
)
df["asymmetry_risk"] = np.abs(
df["left_leg_load"] - df["right_leg_load"]
) / df["total_load"]
df["deceleration_risk"] = df["high_decelerations"] / 50
# Age factor (higher risk after 25)
df["age_risk"] = np.where(df["age"] > 25, 0.2, 0)
# Composite score
df["acl_risk_score"] = (
0.3 * df["fatigue_risk"] +
0.3 * df["asymmetry_risk"] +
0.2 * np.clip(df["deceleration_risk"], 0, 1) +
0.2 * df["age_risk"]
)
df["risk_level"] = pd.cut(
df["acl_risk_score"],
bins=[0, 0.4, 0.7, 1.0],
labels=["Low", "Moderate", "High"]
)
return df
def training_load_recommendations(self, player_data: pd.DataFrame) -> Dict:
"""Generate training load recommendations."""
recommendations = {}
for _, player in player_data.iterrows():
player_id = player["player_id"]
# ACWR (Acute:Chronic Workload Ratio)
acwr = player["acute_load"] / player["chronic_load"] if player["chronic_load"] > 0 else 0
if acwr > 1.5:
recommendation = "Reduce load - high injury risk zone"
elif acwr > 1.3:
recommendation = "Caution - approaching high risk"
elif acwr < 0.8:
recommendation = "Can increase load - in safe zone"
else:
recommendation = "Maintain current load - optimal zone"
recommendations[player_id] = {
"acwr": acwr,
"recommendation": recommendation,
"cycle_phase": player.get("cycle_phase", "Unknown")
}
return recommendations
# Example usage
analyzer = WomensPhysicalAnalyzer()
print("Physical performance analyzer initialized")
print(f"Position benchmarks available: {list(analyzer.benchmarks.position_benchmarks.keys())}")# R: Physical performance analysis for women's football
library(tidyverse)
# Create reference benchmarks for women's football
create_physical_benchmarks <- function() {
# Position-based benchmarks (per 90 minutes)
benchmarks <- tribble(
~position, ~total_distance, ~high_speed_distance, ~sprint_distance, ~accelerations,
"GK", 5500, 100, 30, 15,
"CB", 9500, 300, 80, 35,
"FB", 10800, 600, 150, 50,
"CM", 11200, 500, 120, 45,
"AM", 10500, 550, 140, 55,
"W", 10200, 650, 180, 60,
"ST", 9800, 500, 130, 45
)
# Speed zone definitions (women's football specific)
speed_zones <- tribble(
~zone, ~min_speed, ~max_speed, ~description,
1, 0, 7, "Walking",
2, 7, 13, "Jogging",
3, 13, 18, "Running",
4, 18, 23, "High-speed running",
5, 23, 30, "Sprinting"
)
list(
position_benchmarks = benchmarks,
speed_zones = speed_zones
)
}
# Analyze match physical data
analyze_physical_performance <- function(player_data, benchmarks) {
player_data %>%
left_join(benchmarks$position_benchmarks, by = "position") %>%
mutate(
# Calculate percentage of benchmark
distance_pct = total_distance_actual / total_distance * 100,
hsd_pct = high_speed_actual / high_speed_distance * 100,
sprint_pct = sprint_actual / sprint_distance * 100,
# Overall physical score
physical_score = (distance_pct + hsd_pct + sprint_pct) / 3,
# Flag under/over performers
performance_level = case_when(
physical_score > 110 ~ "Exceptional",
physical_score > 100 ~ "Above Average",
physical_score > 90 ~ "Average",
physical_score > 80 ~ "Below Average",
TRUE ~ "Underperforming"
)
)
}
# Menstrual cycle tracking for load management
analyze_cycle_performance <- function(player_data, cycle_data) {
# Join with cycle phase information
combined <- player_data %>%
left_join(cycle_data, by = c("player_id", "date")) %>%
mutate(
cycle_phase = case_when(
day_in_cycle <= 5 ~ "Menstruation",
day_in_cycle <= 14 ~ "Follicular",
day_in_cycle <= 21 ~ "Ovulation",
TRUE ~ "Luteal"
)
)
# Analyze performance by phase
phase_analysis <- combined %>%
group_by(player_id, cycle_phase) %>%
summarise(
avg_distance = mean(total_distance),
avg_sprint = mean(sprint_distance),
avg_hsd = mean(high_speed_distance),
injury_events = sum(injury_flag, na.rm = TRUE),
.groups = "drop"
)
phase_analysis
}
# ACL injury risk assessment
calculate_acl_risk <- function(player_data) {
player_data %>%
mutate(
# Risk factors
fatigue_risk = cumulative_load_7d / baseline_load - 1,
asymmetry_risk = abs(left_leg_load - right_leg_load) / total_load,
deceleration_load = sum(high_decelerations),
# Composite risk score
acl_risk_score =
0.3 * pmin(fatigue_risk, 1) +
0.3 * asymmetry_risk +
0.2 * (deceleration_load / 50) +
0.2 * (age > 25) * 0.5,
risk_level = case_when(
acl_risk_score > 0.7 ~ "High",
acl_risk_score > 0.4 ~ "Moderate",
TRUE ~ "Low"
)
)
}
print("Physical performance analysis system ready!")Youth Development Analytics
Youth development in women's football presents unique analytical challenges. With the sport's rapid professionalization, identifying and developing talented young players has become increasingly important for clubs and federations.
# Python: Youth development pathway analysis
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Tuple
from sklearn.preprocessing import StandardScaler
@dataclass
class DevelopmentPathway:
"""Define development pathway stages for women's football."""
stages = {
"Foundation": {"age_range": (8, 11), "focus": "Technical fundamentals"},
"Talent": {"age_range": (12, 14), "focus": "Tactical awareness"},
"Youth": {"age_range": (15, 17), "focus": "Position specialization"},
"Senior Transition": {"age_range": (18, 21), "focus": "First team integration"},
"Elite": {"age_range": (22, 35), "focus": "Peak performance"}
}
@staticmethod
def get_stage(age: int) -> str:
if age < 12:
return "Foundation"
elif age < 15:
return "Talent"
elif age < 18:
return "Youth"
elif age < 22:
return "Senior Transition"
else:
return "Elite"
class YouthDevelopmentAnalyzer:
"""Analytics for youth player development."""
def __init__(self):
self.pathway = DevelopmentPathway()
self.scaler = StandardScaler()
def track_development(self, player_history: pd.DataFrame) -> pd.DataFrame:
"""Track player development trajectory over time."""
df = player_history.sort_values("date").copy()
# Technical growth rate
df["technical_growth"] = df["technical_score"].pct_change()
# Physical development tracking
df["height_velocity"] = df["height"].diff()
# Performance trend (rolling average)
df["performance_trend"] = df["match_rating"].rolling(10, min_periods=3).mean()
# Current development phase
df["development_phase"] = df["age"].apply(self.pathway.get_stage)
# Growth spurt detection
df["growth_spurt"] = df["height_velocity"] > 0.5 # >0.5cm per measurement period
return df
def identify_talent(self, youth_data: pd.DataFrame,
age_group: str) -> pd.DataFrame:
"""Identify high-potential players within age group."""
# Filter to age group
group_data = youth_data[youth_data["age_group"] == age_group].copy()
if len(group_data) < 5:
return group_data
# Metrics to evaluate
metrics = ["technical_score", "physical_score", "tactical_score"]
# Standardize within age group
group_data[metrics] = self.scaler.fit_transform(group_data[metrics])
# Composite potential score (weighted average of z-scores)
group_data["potential_score"] = (
group_data["technical_score"] * 0.4 +
group_data["physical_score"] * 0.3 +
group_data["tactical_score"] * 0.3
)
# Classification
group_data["talent_tier"] = pd.cut(
group_data["potential_score"],
bins=[-np.inf, 0, 1, 2, np.inf],
labels=["Needs Development", "Developing", "High Potential", "Elite Prospect"]
)
return group_data.sort_values("potential_score", ascending=False)
def adjust_for_maturity(self, player_data: pd.DataFrame) -> pd.DataFrame:
"""Adjust metrics for biological maturity timing."""
df = player_data.copy()
# Bio-banding: estimate biological vs chronological age
df["bio_age_offset"] = df["estimated_bio_age"] - df["chronological_age"]
# Classify maturity timing
df["maturity_status"] = pd.cut(
df["bio_age_offset"],
bins=[-np.inf, -1, 1, np.inf],
labels=["Late Developer", "On-time", "Early Developer"]
)
# Adjust physical metrics
df["adjusted_speed"] = df["max_speed"] - (df["bio_age_offset"] * 0.5)
df["adjusted_strength"] = df["strength_score"] - (df["bio_age_offset"] * 2)
# Flag late developers (often overlooked but may have higher ceiling)
df["late_developer_flag"] = df["maturity_status"] == "Late Developer"
return df
def predict_dropout_risk(self, player_data: pd.DataFrame) -> pd.DataFrame:
"""Predict risk of player dropping out of pathway."""
df = player_data.copy()
# Engagement score
df["engagement_score"] = (
df["attendance_rate"] * 0.3 +
df["enjoyment_rating"] * 0.3 +
df["progress_rating"] * 0.4
)
# Risk classification
df["dropout_risk"] = pd.cut(
df["engagement_score"],
bins=[0, 0.4, 0.6, 1.0],
labels=["High", "Moderate", "Low"]
)
# Early warning indicators
df["declining_engagement"] = df["engagement_score"] < df["engagement_score"].shift(3)
return df
def generate_development_report(self, player_id: str,
history: pd.DataFrame) -> Dict:
"""Generate comprehensive development report for player."""
player_history = history[history["player_id"] == player_id]
if len(player_history) == 0:
return {"error": "Player not found"}
latest = player_history.iloc[-1]
earliest = player_history.iloc[0]
return {
"player_id": player_id,
"current_age": latest["age"],
"current_phase": self.pathway.get_stage(latest["age"]),
"time_in_program": (latest["date"] - earliest["date"]).days / 365,
"technical_improvement": (
latest["technical_score"] - earliest["technical_score"]
) / earliest["technical_score"] * 100,
"physical_improvement": (
latest["physical_score"] - earliest["physical_score"]
) / earliest["physical_score"] * 100,
"current_talent_tier": latest.get("talent_tier", "Unknown"),
"maturity_status": latest.get("maturity_status", "Unknown"),
"dropout_risk": latest.get("dropout_risk", "Unknown"),
"recommendation": self._generate_recommendation(latest)
}
def _generate_recommendation(self, player_data: pd.Series) -> str:
"""Generate development recommendation."""
if player_data.get("talent_tier") == "Elite Prospect":
return "Consider accelerated pathway to senior team"
elif player_data.get("maturity_status") == "Late Developer":
return "Monitor closely - potential for late development surge"
elif player_data.get("dropout_risk") == "High":
return "Intervention needed - focus on engagement and enjoyment"
else:
return "Continue current development plan"
# Example usage
analyzer = YouthDevelopmentAnalyzer()
print("Youth development analyzer initialized")
print(f"Development stages: {list(analyzer.pathway.stages.keys())}")# R: Youth development pathway analysis
library(tidyverse)
# Define development pathway stages
create_development_framework <- function() {
pathway <- tribble(
~stage, ~age_range, ~focus_areas, ~key_metrics,
"Foundation", "8-11", "Technical fundamentals, coordination", "Ball mastery tests, coordination scores",
"Talent", "12-14", "Tactical awareness, physical development", "Decision making, growth tracking",
"Youth", "15-17", "Position specialization, competition", "Match stats, physical benchmarks",
"Senior Transition", "18-21", "First team integration", "Minutes, performance ratings",
"Elite", "22+", "Peak performance optimization", "Full analytics suite"
)
pathway
}
# Track player development over time
track_player_development <- function(player_history) {
player_history %>%
arrange(date) %>%
mutate(
# Technical development
technical_growth = (technical_score - lag(technical_score)) / lag(technical_score),
# Physical development
height_velocity = height - lag(height), # Growth spurt detection
physical_maturity = estimate_maturity(height, weight, age),
# Performance trajectory
performance_trend = zoo::rollmean(match_rating, k = 10, fill = NA),
# Development phase
current_phase = case_when(
age < 12 ~ "Foundation",
age < 15 ~ "Talent",
age < 18 ~ "Youth",
age < 22 ~ "Senior Transition",
TRUE ~ "Elite"
)
)
}
# Identify high-potential players
identify_talent <- function(youth_data, age_group) {
# Get age-appropriate benchmarks
benchmarks <- youth_data %>%
filter(age_group == !!age_group) %>%
summarise(across(where(is.numeric), list(mean = mean, sd = sd)))
# Score players relative to peers
youth_data %>%
filter(age_group == !!age_group) %>%
mutate(
# Standardized scores
technical_z = (technical_score - benchmarks$technical_score_mean) / benchmarks$technical_score_sd,
physical_z = (physical_score - benchmarks$physical_score_mean) / benchmarks$physical_score_sd,
tactical_z = (tactical_score - benchmarks$tactical_score_mean) / benchmarks$tactical_score_sd,
# Composite potential score
potential_score = (technical_z * 0.4 + physical_z * 0.3 + tactical_z * 0.3),
# Classification
talent_tier = case_when(
potential_score > 2 ~ "Elite Prospect",
potential_score > 1 ~ "High Potential",
potential_score > 0 ~ "Developing",
TRUE ~ "Needs Development"
)
) %>%
arrange(desc(potential_score))
}
# Maturity timing adjustment
adjust_for_maturity <- function(player_data) {
# Bio-banding approach
player_data %>%
mutate(
# Estimate biological age vs chronological age
bio_age_offset = estimated_bio_age - chronological_age,
# Adjust physical metrics for maturity
adjusted_speed = max_speed - (bio_age_offset * 0.5),
adjusted_strength = strength - (bio_age_offset * 2),
# Early/late developer classification
maturity_status = case_when(
bio_age_offset > 1 ~ "Early Developer",
bio_age_offset < -1 ~ "Late Developer",
TRUE ~ "On-time"
),
# Flag late developers for special attention
late_developer_flag = maturity_status == "Late Developer"
)
}
# Dropout risk prediction
predict_dropout_risk <- function(player_data) {
player_data %>%
mutate(
# Risk factors
engagement_score = attendance_rate * 0.3 + enjoyment_rating * 0.3 + progress_rating * 0.4,
# Early warning signs
declining_attendance = attendance_rate < lag(attendance_rate, 3),
stagnating_progress = performance_trend < lag(performance_trend, 5),
# Dropout risk score
dropout_risk = case_when(
engagement_score < 0.4 ~ "High",
engagement_score < 0.6 ~ "Moderate",
TRUE ~ "Low"
)
)
}
print("Youth development framework initialized!")Key Considerations for Youth Analytics
- Bio-banding: Group players by biological maturity, not just age
- Late developers: Don't overlook late developers who may have higher ceilings
- Relative Age Effect: Track birth month distribution to ensure fair selection
- Holistic development: Technical, tactical, physical, and psychological metrics
- Dropout prevention: Engagement and enjoyment are as important as performance
International vs. Club Performance
A unique aspect of women's football is the significant role international football plays in player visibility and recruitment. Many players excel more prominently in international competition, and understanding this dynamic is crucial for analysts.
# Python: International vs. Club performance analysis
import pandas as pd
import numpy as np
from typing import Dict, List
class InternationalAnalyzer:
"""Analyze international vs. club performance."""
def __init__(self, fifa_rankings: pd.DataFrame = None):
self.rankings = fifa_rankings
def compare_contexts(self, player_data: pd.DataFrame) -> pd.DataFrame:
"""Compare player performance in club vs. international."""
# Aggregate by context
context_stats = player_data.groupby(["player_id", "context"]).agg({
"match_id": "count",
"goals": "sum",
"assists": "sum",
"xG": "sum",
"xA": "sum",
"minutes": "sum",
"rating": "mean"
}).reset_index()
context_stats.columns = ["player_id", "context", "matches", "goals",
"assists", "xG", "xA", "minutes", "avg_rating"]
# Per 90 metrics
context_stats["goals_p90"] = context_stats["goals"] / context_stats["minutes"] * 90
context_stats["xG_p90"] = context_stats["xG"] / context_stats["minutes"] * 90
context_stats["xA_p90"] = context_stats["xA"] / context_stats["minutes"] * 90
# Pivot to compare
comparison = context_stats.pivot(
index="player_id",
columns="context",
values=["goals_p90", "xG_p90", "avg_rating"]
)
comparison.columns = ["_".join(col) for col in comparison.columns]
comparison = comparison.reset_index()
# Calculate differences
if "goals_p90_international" in comparison.columns and "goals_p90_club" in comparison.columns:
comparison["goal_difference"] = (
comparison["goals_p90_international"] - comparison["goals_p90_club"]
)
comparison["player_type"] = np.where(
comparison["goal_difference"] > 0.1, "International Performer",
np.where(comparison["goal_difference"] < -0.1, "Club Performer", "Consistent")
)
return comparison
def adjust_for_opponent(self, player_data: pd.DataFrame) -> pd.DataFrame:
"""Adjust statistics for opponent quality."""
if self.rankings is None:
return player_data
df = player_data.merge(
self.rankings[["team", "fifa_rank"]],
left_on="opponent",
right_on="team",
how="left"
)
# Opponent factor based on FIFA ranking
def get_opponent_factor(rank):
if pd.isna(rank):
return 0.9
if rank <= 10:
return 1.5
elif rank <= 25:
return 1.2
elif rank <= 50:
return 1.0
elif rank <= 100:
return 0.8
else:
return 0.6
df["opponent_factor"] = df["fifa_rank"].apply(get_opponent_factor)
df["adjusted_xG"] = df["xG"] * df["opponent_factor"]
df["adjusted_goals"] = df["goals"] * df["opponent_factor"]
return df
def analyze_tournament(self, tournament_data: pd.DataFrame) -> pd.DataFrame:
"""Analyze tournament-specific performance."""
# Define stage importance
stage_map = {
"Group": 1,
"Round of 16": 2,
"Quarter-final": 3,
"Semi-final": 4,
"Final": 5
}
df = tournament_data.copy()
df["stage_weight"] = df["match_type"].map(stage_map).fillna(1)
# Aggregate by player
player_stats = df.groupby(["player_id", "player_name"]).agg({
"goals": "sum",
"xG": "sum",
"rating": "mean"
}).reset_index()
# Knockout stage performance
knockout = df[df["stage_weight"] >= 2].groupby("player_id").agg({
"goals": "sum",
"rating": "mean"
}).reset_index()
knockout.columns = ["player_id", "knockout_goals", "knockout_rating"]
# Group stage performance
group = df[df["stage_weight"] == 1].groupby("player_id").agg({
"rating": "mean"
}).reset_index()
group.columns = ["player_id", "group_rating"]
# Merge
player_stats = player_stats.merge(knockout, on="player_id", how="left")
player_stats = player_stats.merge(group, on="player_id", how="left")
# Clutch factor
player_stats["clutch_factor"] = (
player_stats["knockout_rating"].fillna(0) -
player_stats["group_rating"].fillna(0)
)
return player_stats.sort_values("clutch_factor", ascending=False)
def analyze_senior_pathway(self, youth_data: pd.DataFrame) -> pd.DataFrame:
"""Analyze pathway from youth to senior international team."""
pathway = youth_data.groupby("player_id").agg({
"age": ["min", "max"],
"team_level": lambda x: {
"u17_caps": (x == "U17").sum(),
"u19_caps": (x == "U19").sum(),
"u21_caps": (x == "U21").sum(),
"senior_caps": (x == "Senior").sum()
}
}).reset_index()
# Flatten
pathway.columns = ["player_id", "first_age", "last_age", "cap_counts"]
# Extract cap counts
pathway["u17_caps"] = pathway["cap_counts"].apply(lambda x: x["u17_caps"])
pathway["u19_caps"] = pathway["cap_counts"].apply(lambda x: x["u19_caps"])
pathway["senior_caps"] = pathway["cap_counts"].apply(lambda x: x["senior_caps"])
# Made senior team?
pathway["made_senior"] = pathway["senior_caps"] > 0
# Progression rate
pathway["progression_rate"] = np.where(
pathway["made_senior"],
1 / (pathway["last_age"] - pathway["first_age"] + 1),
0
)
return pathway
# Example usage
analyzer = InternationalAnalyzer()
print("International vs. Club analyzer initialized")# R: International vs. Club performance analysis
library(tidyverse)
# Compare player performance across contexts
analyze_context_performance <- function(player_data) {
player_data %>%
group_by(player_id, context) %>% # context = "club" or "international"
summarise(
matches = n(),
goals = sum(goals),
assists = sum(assists),
xG = sum(xG),
xA = sum(xA),
avg_rating = mean(rating),
# Per 90 metrics
goals_p90 = sum(goals) / sum(minutes) * 90,
xG_p90 = sum(xG) / sum(minutes) * 90,
xA_p90 = sum(xA) / sum(minutes) * 90,
.groups = "drop"
) %>%
pivot_wider(
id_cols = player_id,
names_from = context,
values_from = c(goals_p90, xG_p90, xA_p90, avg_rating)
) %>%
mutate(
# Performance difference
goal_diff = goals_p90_international - goals_p90_club,
xG_diff = xG_p90_international - xG_p90_club,
# Classify player type
player_type = case_when(
goal_diff > 0.1 ~ "International Performer",
goal_diff < -0.1 ~ "Club Performer",
TRUE ~ "Consistent"
)
)
}
# Analyze opponent quality adjustment
adjust_for_opponent_quality <- function(player_data, team_rankings) {
player_data %>%
left_join(team_rankings, by = c("opponent" = "team")) %>%
mutate(
# Opponent strength factor (FIFA ranking-based)
opponent_factor = case_when(
fifa_rank <= 10 ~ 1.5, # Elite opposition
fifa_rank <= 25 ~ 1.2, # Strong opposition
fifa_rank <= 50 ~ 1.0, # Average opposition
fifa_rank <= 100 ~ 0.8, # Weak opposition
TRUE ~ 0.6 # Very weak opposition
),
# Adjusted metrics
adjusted_xG = xG * opponent_factor,
adjusted_goals = goals * opponent_factor
)
}
# Tournament performance analysis
analyze_tournament_performance <- function(tournament_data) {
tournament_data %>%
mutate(
# Tournament stage
stage = case_when(
match_type == "Final" ~ 5,
match_type == "Semi-final" ~ 4,
match_type == "Quarter-final" ~ 3,
match_type == "Round of 16" ~ 2,
TRUE ~ 1 # Group stage
)
) %>%
group_by(player_id, player_name) %>%
summarise(
# Overall stats
total_goals = sum(goals),
total_xG = sum(xG),
# Big game performance
knockout_goals = sum(goals[stage >= 2]),
knockout_xG = sum(xG[stage >= 2]),
# Clutch factor
high_pressure_rating = mean(rating[stage >= 3], na.rm = TRUE),
group_stage_rating = mean(rating[stage == 1], na.rm = TRUE),
clutch_factor = high_pressure_rating - group_stage_rating,
.groups = "drop"
) %>%
arrange(desc(clutch_factor))
}
# National team pipeline analysis
analyze_pathway_to_senior <- function(youth_international_data) {
youth_international_data %>%
arrange(player_id, age) %>%
group_by(player_id) %>%
summarise(
# Youth international history
u17_caps = sum(team_level == "U17"),
u19_caps = sum(team_level == "U19"),
u21_caps = sum(team_level == "U21"),
senior_caps = sum(team_level == "Senior"),
# Age of first senior cap
first_senior_age = min(age[team_level == "Senior"], na.rm = TRUE),
# Made senior team?
made_senior = senior_caps > 0,
.groups = "drop"
) %>%
mutate(
# Pathway analysis
pathway_length = first_senior_age - 17,
pathway_type = case_when(
first_senior_age < 20 ~ "Fast Track",
first_senior_age < 23 ~ "Standard",
first_senior_age < 26 ~ "Late Bloomer",
TRUE ~ "Never Progressed"
)
)
}
print("International performance analysis ready!")- Visibility factor: World Cups and continental championships often provide the main exposure for players from smaller leagues
- Opponent quality: International matches against lower-ranked nations may inflate statistics
- Team dynamics: Some players thrive in national team systems that differ from their clubs
- Sample size: International data is limited (10-15 matches per year maximum)
- Tournament performance: Big tournament performers may command premium valuations
Growth and Opportunities in Women's Football Analytics
The women's football analytics landscape presents unique opportunities for analysts and data scientists. The rapid professionalization of the game means there's significant room to make an impact.
- Clubs building dedicated women's analytics departments
- Federations investing in national team analytics
- Media companies seeking women's football content
- Data providers expanding coverage
- Academic research gaining momentum
- Women's football-specific xG models
- Physical performance benchmarks
- Youth development pathways
- Cross-league player comparison
- Commercial analytics for growing the game
# Analyzing Growth of Women's Football
import pandas as pd
import matplotlib.pyplot as plt
# Sample data: Growth metrics over time
growth_data = pd.DataFrame({
"year": [2019, 2020, 2021, 2022, 2023] * 2,
"metric": ["WSL Average Attendance"] * 5 + ["NWSL Average Attendance"] * 5,
"value": [3048, 2847, 3523, 6744, 8134, 7337, 0, 7843, 10628, 11276]
})
# Filter out COVID year
growth_data = growth_data[growth_data["value"] > 0]
# Calculate growth rates
def calculate_growth(group):
group = group.sort_values("year")
group["yoy_growth"] = group["value"].pct_change() * 100
group["cumulative_growth"] = (
(group["value"] - group["value"].iloc[0]) / group["value"].iloc[0] * 100
)
return group
growth_analysis = growth_data.groupby("metric").apply(calculate_growth).reset_index(drop=True)
print("Growth Analysis:")
print(growth_analysis)
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
for metric in growth_data["metric"].unique():
data = growth_data[growth_data["metric"] == metric]
ax.plot(data["year"], data["value"], marker="o",
linewidth=2, markersize=8, label=metric)
ax.set_xlabel("Year")
ax.set_ylabel("Average Attendance")
ax.set_title("Growth of Women's Football Attendance")
ax.legend()
ax.grid(True, alpha=0.3)
# Format y-axis
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: format(int(x), ",")))
plt.tight_layout()
plt.show()
# Investment and coverage growth
investment_data = pd.DataFrame({
"category": ["Broadcast Deals (M)", "Club Budgets (avg M)",
"Data Coverage (leagues)", "Analytics Staff (per club)"],
"y2019": [5, 2, 5, 0.5],
"y2023": [50, 8, 15, 2.5]
})
investment_data["growth_pct"] = (
(investment_data["y2023"] - investment_data["y2019"]) /
investment_data["y2019"] * 100
)
print("\nWomen's Football Growth 2019-2023:")
print(investment_data.to_string(index=False))
# Analyzing Growth of Women's Football
library(tidyverse)
# Sample data: Growth metrics over time
growth_data <- tribble(
~year, ~metric, ~value,
2019, "WSL Average Attendance", 3048,
2020, "WSL Average Attendance", 2847,
2021, "WSL Average Attendance", 3523,
2022, "WSL Average Attendance", 6744,
2023, "WSL Average Attendance", 8134,
2019, "NWSL Average Attendance", 7337,
2020, "NWSL Average Attendance", 0,
2021, "NWSL Average Attendance", 7843,
2022, "NWSL Average Attendance", 10628,
2023, "NWSL Average Attendance", 11276
)
# Calculate growth rates
growth_analysis <- growth_data %>%
filter(value > 0) %>%
group_by(metric) %>%
arrange(year) %>%
mutate(
yoy_growth = (value - lag(value)) / lag(value) * 100,
cumulative_growth = (value - first(value)) / first(value) * 100
)
# Visualization
ggplot(growth_data %>% filter(value > 0),
aes(x = year, y = value, color = metric)) +
geom_line(linewidth = 1.5) +
geom_point(size = 3) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Growth of Women's Football Attendance",
x = "Year",
y = "Average Attendance",
color = "League"
) +
theme_minimal() +
theme(legend.position = "bottom")
# Investment and coverage growth
investment_data <- tribble(
~category, ~y2019, ~y2023, ~growth_pct,
"Broadcast Deals (M)", 5, 50, 900,
"Club Budgets (avg M)", 2, 8, 300,
"Data Coverage (leagues)", 5, 15, 200,
"Analytics Staff (per club)", 0.5, 2.5, 400
)
print("Women's Football Growth 2019-2023:")
print(investment_data)
Resources for Women's Football Analytics
- StatsBomb Open Data Free WWC and select league data
- FBref Comprehensive women's league stats
- Wyscout Video and data platform
- Women's Football Analytics Twitter/X community
- Women in Football organizations
- STATS Perform Women's Football Podcast
- OptaPro Forum sessions on women's football
- Club analyst positions (growing)
- Federation analytics roles
- Media and journalism
- Data provider positions
Practice Exercises
Exercise 44.1: Build a Women's xG Model
Using StatsBomb's free Women's World Cup data, build an expected goals model specifically calibrated for women's football. Compare your model's predictions to StatsBomb's xG values and analyze any systematic differences.
- Consider whether shot distance and angle relationships differ
- Analyze header conversion rates compared to men's football benchmarks
- Test if goalkeeper characteristics affect xG differently
Exercise 44.2: Cross-League Player Comparison
Develop a system to compare players across different women's leagues (e.g., WSL, NWSL, Liga F). Account for league quality differences and create adjusted metrics that allow fair comparison.
- Use international match performance as a common baseline
- Create league strength coefficients based on UEFA/FIFA rankings
- Consider opponent quality in domestic matches
Exercise 44.3: Team Pressing Analysis
Analyze pressing patterns for teams in the Women's World Cup. Identify the most effective pressing teams and determine what tactical factors contribute to pressing success.
- Calculate PPDA (Passes Per Defensive Action)
- Measure high press frequency and success rate
- Correlate pressing metrics with match outcomes
Summary
Key Takeaways
- Growing Data Ecosystem: Women's football data coverage has expanded significantly, with major providers now covering top leagues and tournaments
- Unique Considerations: Analytics models should be calibrated specifically for women's football, not simply adapted from men's football
- Recruitment Opportunities: The professionalization of women's football creates demand for sophisticated player evaluation and scouting systems
- Career Growth: Analysts have significant opportunities to make an impact in a rapidly developing field
- Community Building: Contributing to women's football analytics helps build the sport and creates pathways for future analysts
Women's football analytics represents one of the most exciting frontiers in the field. With increasing investment, growing data availability, and passionate communities, analysts have unprecedented opportunities to contribute to the development of the women's game while building rewarding careers.