Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples

0 of 60 chapters completed (0%)

Introduction to Women's Football Analytics

Women's football has experienced explosive growth over the past decade, with record-breaking attendance figures, increased broadcast coverage, and growing investment from clubs and federations. Analytics has become an essential tool for teams looking to gain competitive advantages in this rapidly professionalizing environment.

The Growth of Women's Football Analytics

While women's football has historically had less data coverage, the landscape is rapidly changing. Major data providers now cover top women's leagues, creating unprecedented opportunities for analysts to contribute to the development of the sport.

Loading free women's football data from StatsBomb

# Loading Women's Football Data from StatsBomb
from statsbombpy import sb
import pandas as pd

# Get available competitions
competitions = sb.competitions()

# Filter for women's competitions
womens_comps = competitions[
    competitions["competition_name"].str.contains(
        "Women|WSL|NWSL|FAWSL", case=False, na=False
    )
]

print(womens_comps[["competition_name", "season_name"]])

# Load Women's World Cup 2023 data
wwc_matches = sb.matches(competition_id=72, season_id=107)

# Load event data for matches
all_events = []
for match_id in wwc_matches["match_id"]:
    events = sb.events(match_id=match_id)
    all_events.append(events)

wwc_events = pd.concat(all_events, ignore_index=True)

# Summarize available data
print(f"Total matches: {len(wwc_matches)}")
print(f"Total events: {len(wwc_events)}")

# Event type breakdown
event_summary = wwc_events["type"].value_counts().head(10)
print(event_summary)

# Loading Women's Football Data from StatsBomb
library(StatsBombR)
library(tidyverse)

# StatsBomb provides free women's football data
# Get available competitions
competitions <- FreeCompetitions()

# Filter for women's competitions
womens_comps <- competitions %>%
  filter(str_detect(competition_name, "Women|WSL|NWSL|FAWSL"))

print(womens_comps %>% select(competition_name, season_name))

# Load Women's World Cup 2023 data
wwc_matches <- FreeMatches(Competitions = 72)

# Load all event data
wwc_events <- get.matchFree(wwc_matches)

# Summarize available data
cat("Total matches:", nrow(wwc_matches), "\n")
cat("Total events:", nrow(wwc_events), "\n")

# Event type breakdown
event_summary <- wwc_events %>%
  count(type.name, sort = TRUE) %>%
  head(10)

print(event_summary)

The Women's Football Data Landscape

Understanding the data ecosystem for women's football is crucial for analysts working in this space. While coverage has expanded significantly, there are still important differences compared to men's football data availability.

Well-Covered Leagues

English WSL (Women's Super League)
Spanish Liga F
German Frauen-Bundesliga
French Division 1 Feminine
NWSL (USA)
UEFA Women's Champions League
FIFA Women's World Cup

Major Data Providers

StatsBomb - Event data, free WWC data
Opta - Event data for top leagues
Wyscout - Video and event data
Second Spectrum - Tracking data (limited)
SkillCorner - Broadcast tracking
FBref - Free aggregated statistics

Scraping women's football statistics from public sources

# Scraping Women's Football Stats from FBref
import pandas as pd

# FBref URLs for women's leagues
WSL_URL = "https://fbref.com/en/comps/189/2023-2024/2023-2024-Womens-Super-League-Stats"
LIGA_F_URL = "https://fbref.com/en/comps/230/Liga-F-Stats"
NWSL_URL = "https://fbref.com/en/comps/182/NWSL-Stats"

def get_league_table(url):
    """Scrape league standings from FBref"""
    tables = pd.read_html(url)
    # League table is typically the first table
    standings = tables[0]
    return standings

# Get WSL standings
wsl_standings = get_league_table(WSL_URL)
print("WSL Standings:")
print(wsl_standings[["Squad", "MP", "W", "D", "L", "GF", "GA", "GD", "Pts"]].head())

# Alternative: Using soccerdata library
try:
    import soccerdata as sd

    # Initialize FBref reader for women's data
    fbref = sd.FBref(leagues=["ENG-WSL"], seasons=["2023-2024"])

    # Get squad statistics
    squad_stats = fbref.read_team_season_stats()
    print(squad_stats.head())

except ImportError:
    print("Install soccerdata: pip install soccerdata")

# Custom scraping function for player stats
def get_player_stats(url):
    """Extract player statistics from FBref page"""
    tables = pd.read_html(url)
    for table in tables:
        if "Player" in table.columns:
            return table
    return None

# Compare goal-scoring across leagues
def compare_leagues(urls_dict):
    """Compare statistics across multiple women's leagues"""
    comparison = []
    for league, url in urls_dict.items():
        try:
            stats = get_player_stats(url)
            if stats is not None:
                comparison.append({
                    "League": league,
                    "Avg_Goals": stats["Gls"].mean(),
                    "Max_Goals": stats["Gls"].max()
                })
        except:
            pass
    return pd.DataFrame(comparison)

# Scraping Women's Football Stats from FBref
library(worldfootballR)
library(tidyverse)

# Get WSL (Women's Super League) standings
wsl_standings <- fb_season_team_stats(
  country = "ENG",
  gender = "F",
  season_end_year = 2024,
  tier = "1st",
  stat_type = "league_table"
)

print(wsl_standings %>% select(Squad, MP, W, D, L, GF, GA, GD, Pts))

# Get player stats from Liga F
ligaf_stats <- fb_season_team_stats(
  country = "ESP",
  gender = "F",
  season_end_year = 2024,
  tier = "1st",
  stat_type = "standard"
)

# Top scorers analysis
top_scorers <- ligaf_stats %>%
  arrange(desc(Gls)) %>%
  select(Squad, Player, Gls, Ast, xG, xAG) %>%
  head(10)

print(top_scorers)

# Get NWSL data
nwsl_stats <- fb_season_team_stats(
  country = "USA",
  gender = "F",
  season_end_year = 2024,
  tier = "1st",
  stat_type = "standard"
)

# Compare leagues
compare_leagues <- function(data, league_name) {
  data %>%
    summarise(
      League = league_name,
      Avg_Goals = mean(Gls, na.rm = TRUE),
      Avg_xG = mean(xG, na.rm = TRUE),
      Max_Goals = max(Gls, na.rm = TRUE)
    )
}

Analytical Considerations for Women's Football

While the fundamental principles of football analytics apply across both men's and women's football, there are important considerations and nuances that analysts should be aware of when working with women's football data.

Important Considerations

Women's football should be analyzed on its own terms, not simply compared to men's football. Metrics and models should be calibrated specifically for women's football data, and insights should be contextualized within the women's game.

Key Analytical Differences

Physical Metrics

Different baseline values for speed, distance, acceleration
Pitch dimensions may vary (some leagues use smaller pitches)
Ball size and weight standardization differences
Goalkeeper reach and diving ranges differ

Statistical Baselines

League-specific xG models needed
Different goal-scoring rates and patterns
Set piece conversion rates vary
Pressing intensity benchmarks differ

Building a women's football-specific expected goals model

# Building Women's Football Specific xG Model
from statsbombpy import sb
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss, roc_auc_score

# Load Women's World Cup data
wwc_matches = sb.matches(competition_id=72, season_id=107)

# Collect all shots
all_shots = []
for match_id in wwc_matches["match_id"]:
    events = sb.events(match_id=match_id)
    shots = events[events["type"] == "Shot"]
    all_shots.append(shots)

shots_df = pd.concat(all_shots, ignore_index=True)

# Extract shot location
shots_df["x"] = shots_df["location"].apply(lambda x: x[0] if x else None)
shots_df["y"] = shots_df["location"].apply(lambda x: x[1] if x else None)

# Calculate features
shots_df["distance"] = np.sqrt(
    (120 - shots_df["x"])**2 + (40 - shots_df["y"])**2
)
shots_df["angle"] = np.arctan2(
    np.abs(40 - shots_df["y"]),
    120 - shots_df["x"]
) * 180 / np.pi

# Binary features
shots_df["is_header"] = (shots_df["shot_body_part"] == "Head").astype(int)
shots_df["is_first_time"] = shots_df["shot_first_time"].fillna(False).astype(int)
shots_df["is_open_play"] = (shots_df["shot_type"] == "Open Play").astype(int)
shots_df["is_goal"] = (shots_df["shot_outcome"] == "Goal").astype(int)

# Baseline statistics
print("Women's WWC Shot Statistics:")
print(f"Total shots: {len(shots_df)}")
print(f"Conversion rate: {shots_df['is_goal'].mean():.3f}")
print(f"Header conversion: {shots_df[shots_df['is_header']==1]['is_goal'].mean():.3f}")

# Prepare model features
features = ["distance", "angle", "is_header", "is_first_time", "is_open_play"]
X = shots_df[features].dropna()
y = shots_df.loc[X.index, "is_goal"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train xG model
xg_model = LogisticRegression(max_iter=1000)
xg_model.fit(X_train, y_train)

# Predictions
shots_df.loc[X.index, "xG_custom"] = xg_model.predict_proba(X)[:, 1]

# Evaluate
y_pred = xg_model.predict_proba(X_test)[:, 1]
print(f"\nModel Performance:")
print(f"Brier Score: {brier_score_loss(y_test, y_pred):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred):.4f}")

# Compare with StatsBomb xG
correlation = shots_df[["xG_custom", "shot_statsbomb_xg"]].corr()
print(f"\nCorrelation with StatsBomb xG: {correlation.iloc[0,1]:.3f}")

# Building Women's Football Specific xG Model
library(tidyverse)
library(StatsBombR)

# Load Women's World Cup shot data
wwc_matches <- FreeMatches(Competitions = 72)
wwc_events <- get.matchFree(wwc_matches)

# Filter shots and prepare features
shots <- wwc_events %>%
  filter(type.name == "Shot") %>%
  mutate(
    # Calculate distance and angle
    distance = sqrt((120 - location.x)^2 + (40 - location.y)^2),
    angle = atan2(abs(40 - location.y), 120 - location.x) * 180 / pi,

    # Shot type features
    is_header = shot.body_part.name == "Head",
    is_first_time = shot.first_time == TRUE,

    # Situation features
    is_open_play = shot.type.name == "Open Play",
    is_penalty = shot.type.name == "Penalty",
    is_freekick = shot.type.name == "Free Kick",

    # Outcome
    is_goal = shot.outcome.name == "Goal"
  )

# Calculate baseline conversion rates
baseline_stats <- shots %>%
  summarise(
    total_shots = n(),
    total_goals = sum(is_goal),
    conversion_rate = mean(is_goal),

    # By type
    header_conversion = mean(is_goal[is_header], na.rm = TRUE),
    open_play_conversion = mean(is_goal[is_open_play], na.rm = TRUE),
    penalty_conversion = mean(is_goal[is_penalty], na.rm = TRUE)
  )

print(baseline_stats)

# Build logistic regression xG model
xg_model <- glm(
  is_goal ~ distance + angle + is_header + is_first_time +
            is_open_play + is_freekick,
  data = shots,
  family = binomial()
)

summary(xg_model)

# Add xG predictions
shots$xG_custom <- predict(xg_model, type = "response")

# Compare with StatsBomb xG
comparison <- shots %>%
  filter(!is.na(shot.statsbomb_xg)) %>%
  summarise(
    correlation = cor(xG_custom, shot.statsbomb_xg),
    mean_difference = mean(xG_custom - shot.statsbomb_xg)
  )

cat("Correlation with StatsBomb xG:", comparison$correlation, "\n")

Player Evaluation in Women's Football

Player evaluation in women's football requires understanding the context of the women's game, including league quality differences, international experience, and the relatively smaller talent pool compared to men's football.

Comprehensive player evaluation framework for women's football

# Player Evaluation Framework for Women's Football
from statsbombpy import sb
import pandas as pd
import numpy as np

# Load Women's World Cup data
wwc_matches = sb.matches(competition_id=72, season_id=107)

# Collect all events
all_events = []
for match_id in wwc_matches["match_id"]:
    events = sb.events(match_id=match_id)
    events["match_id"] = match_id
    all_events.append(events)

events_df = pd.concat(all_events, ignore_index=True)

# Calculate player statistics
def calculate_player_stats(events):
    """Calculate comprehensive player statistics"""

    # Group by player
    player_stats = events.groupby(["player", "team"]).agg({
        "match_id": "nunique",
        "type": "count"
    }).reset_index()
    player_stats.columns = ["player", "team", "matches", "events"]

    # Shots and goals
    shots = events[events["type"] == "Shot"].groupby("player").agg({
        "shot_statsbomb_xg": ["count", "sum"],
        "shot_outcome": lambda x: (x == "Goal").sum()
    }).reset_index()
    shots.columns = ["player", "shots", "xG", "goals"]

    # Passes
    passes = events[events["type"] == "Pass"].groupby("player").agg({
        "id": "count",
        "pass_outcome": lambda x: x.isna().mean(),  # Completion rate
        "pass_shot_assist": "sum"
    }).reset_index()
    passes.columns = ["player", "passes", "pass_completion", "key_passes"]

    # Defensive actions
    defensive = events[events["type"].isin(["Pressure", "Interception"])].groupby("player").agg({
        "type": [
            lambda x: (x == "Pressure").sum(),
            lambda x: (x == "Interception").sum()
        ]
    }).reset_index()
    defensive.columns = ["player", "pressures", "interceptions"]

    # Merge all stats
    player_stats = player_stats.merge(shots, on="player", how="left")
    player_stats = player_stats.merge(passes, on="player", how="left")
    player_stats = player_stats.merge(defensive, on="player", how="left")

    return player_stats.fillna(0)

player_stats = calculate_player_stats(events_df)

# Calculate per-90 metrics (estimate minutes)
player_stats["est_minutes"] = player_stats["matches"] * 75
player_stats["xG_p90"] = player_stats["xG"] / player_stats["est_minutes"] * 90
player_stats["shots_p90"] = player_stats["shots"] / player_stats["est_minutes"] * 90
player_stats["pressures_p90"] = player_stats["pressures"] / player_stats["est_minutes"] * 90

# Filter for players with sufficient playing time
qualified = player_stats[player_stats["matches"] >= 3].copy()

# Percentile rankings
for col in ["xG_p90", "shots_p90", "pressures_p90"]:
    qualified[f"{col}_pct"] = qualified[col].rank(pct=True) * 100

# Top performers
top_by_xg = qualified.nlargest(10, "xG_p90")[
    ["player", "team", "matches", "xG", "xG_p90", "xG_p90_pct"]
]
print("Top Performers by xG per 90:")
print(top_by_xg.to_string(index=False))

# Player Evaluation Framework for Women's Football
library(tidyverse)
library(StatsBombR)

# Load FAWSL data (if available) or use WWC as proxy
wwc_events <- get.matchFree(FreeMatches(Competitions = 72))

# Calculate per-90 metrics for players
player_stats <- wwc_events %>%
  filter(type.name %in% c("Pass", "Shot", "Dribble", "Ball Receipt*",
                          "Carry", "Pressure", "Duel", "Interception")) %>%
  group_by(player.id, player.name, team.name) %>%
  summarise(
    matches = n_distinct(match_id),
    minutes = sum(ifelse(type.name == "Starting XI", 90, 0), na.rm = TRUE),

    # Attacking
    shots = sum(type.name == "Shot"),
    goals = sum(type.name == "Shot" & shot.outcome.name == "Goal", na.rm = TRUE),
    xG = sum(shot.statsbomb_xg, na.rm = TRUE),

    # Passing
    passes = sum(type.name == "Pass"),
    pass_completion = mean(is.na(pass.outcome.name[type.name == "Pass"])),
    key_passes = sum(pass.shot_assist == TRUE, na.rm = TRUE),

    # Dribbling
    dribbles = sum(type.name == "Dribble"),
    dribble_success = mean(dribble.outcome.name == "Complete", na.rm = TRUE),

    # Defensive
    pressures = sum(type.name == "Pressure"),
    interceptions = sum(type.name == "Interception"),

    .groups = "drop"
  )

# Calculate per-90 stats (estimate minutes from events)
player_stats <- player_stats %>%
  mutate(
    est_minutes = matches * 75,  # Rough estimate
    shots_p90 = shots / est_minutes * 90,
    xG_p90 = xG / est_minutes * 90,
    passes_p90 = passes / est_minutes * 90,
    pressures_p90 = pressures / est_minutes * 90
  )

# Percentile ranking within tournament
player_stats <- player_stats %>%
  filter(matches >= 3) %>%  # Minimum appearances
  mutate(
    xG_percentile = percent_rank(xG_p90) * 100,
    passing_percentile = percent_rank(passes_p90) * 100,
    pressing_percentile = percent_rank(pressures_p90) * 100
  )

# Top performers by xG
top_by_xg <- player_stats %>%
  arrange(desc(xG_p90)) %>%
  select(player.name, team.name, matches, xG, xG_p90, xG_percentile) %>%
  head(10)

print(top_by_xg)

Creating Player Comparison Visualizations

Creating radar charts for women's football player comparisons

# Player Radar Charts for Women's Football
import matplotlib.pyplot as plt
import numpy as np
from math import pi

def create_radar_chart(player_data, player_name, metrics):
    """Create radar chart for player comparison"""

    # Get player data
    player = player_data[player_data["player"] == player_name].iloc[0]

    # Get percentile values
    values = [player[f"{m}_p90_pct"] for m in metrics]
    values += values[:1]  # Complete the polygon

    # Set up radar chart
    angles = [n / float(len(metrics)) * 2 * pi for n in range(len(metrics))]
    angles += angles[:1]

    fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

    # Plot
    ax.plot(angles, values, linewidth=2, linestyle="solid", color="#1B5E20")
    ax.fill(angles, values, alpha=0.3, color="#1B5E20")

    # Labels
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(metrics)
    ax.set_ylim(0, 100)

    plt.title(f"Player Radar: {player_name}", size=14, y=1.1)
    plt.tight_layout()
    return fig

# Create radar for top player
metrics = ["xG", "shots", "pressures"]
fig = create_radar_chart(qualified, "A. Bonmati", metrics)
plt.show()

# Multi-player comparison
def compare_players_radar(data, players, metrics):
    """Compare multiple players on radar chart"""

    angles = [n / float(len(metrics)) * 2 * pi for n in range(len(metrics))]
    angles += angles[:1]

    fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))

    colors = ["#1B5E20", "#FF6B35", "#4169E1", "#9932CC"]

    for i, player_name in enumerate(players):
        player = data[data["player"] == player_name]
        if len(player) == 0:
            continue

        values = [player[f"{m}_p90_pct"].values[0] for m in metrics]
        values += values[:1]

        ax.plot(angles, values, linewidth=2, label=player_name, color=colors[i])
        ax.fill(angles, values, alpha=0.1, color=colors[i])

    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(metrics)
    ax.set_ylim(0, 100)
    ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1))

    plt.title("Player Comparison", size=14, y=1.1)
    return fig

# Compare top midfielders
top_midfielders = ["A. Bonmati", "L. Bronze", "A. Russo"]
fig = compare_players_radar(qualified, top_midfielders, metrics)
plt.show()

# Player Radar Charts for Women's Football
library(tidyverse)
library(fmsb)

# Prepare data for radar chart
create_radar_data <- function(player_data, metrics, player_name) {
  # Select player
  player <- player_data %>%
    filter(player.name == player_name)

  # Get percentile values
  values <- player %>%
    select(all_of(paste0(metrics, "_percentile"))) %>%
    as.numeric()

  # Create radar format (needs max, min, then values)
  radar_df <- rbind(
    rep(100, length(metrics)),  # Max
    rep(0, length(metrics)),    # Min
    values
  )

  colnames(radar_df) <- metrics
  return(as.data.frame(radar_df))
}

# Example radar chart
metrics <- c("xG", "passing", "pressing")
radar_data <- create_radar_data(player_stats, metrics, "Aitana Bonmati")

# Plot
radarchart(radar_data,
           pcol = "#1B5E20",
           pfcol = rgb(0.1, 0.4, 0.1, 0.5),
           plwd = 2,
           cglcol = "grey",
           cglty = 1,
           axislabcol = "grey",
           vlcex = 0.8,
           title = "Player Radar: Aitana Bonmati")

# Comparison radar
compare_players <- function(data, players, metrics) {
  library(ggplot2)

  comparison <- data %>%
    filter(player.name %in% players) %>%
    select(player.name, ends_with("_percentile")) %>%
    pivot_longer(-player.name,
                 names_to = "metric",
                 values_to = "value")

  ggplot(comparison, aes(x = metric, y = value,
                         group = player.name, color = player.name)) +
    geom_polygon(fill = NA, linewidth = 1) +
    coord_polar() +
    theme_minimal() +
    labs(title = "Player Comparison", color = "Player") +
    theme(axis.text.x = element_text(size = 10))
}

Team Analysis and Tactical Patterns

Understanding team tactics in women's football requires analyzing patterns specific to the women's game, including pressing structures, build-up patterns, and set piece strategies.

Analyzing tactical patterns in women's football

# Team Tactical Analysis - Women's Football
from statsbombpy import sb
import pandas as pd
import numpy as np

# Load match data
wwc_matches = sb.matches(competition_id=72, season_id=107)
all_events = []
for match_id in wwc_matches["match_id"]:
    events = sb.events(match_id=match_id)
    all_events.append(events)
events_df = pd.concat(all_events, ignore_index=True)

# Extract locations
events_df["x"] = events_df["location"].apply(lambda x: x[0] if isinstance(x, list) else None)
events_df["y"] = events_df["location"].apply(lambda x: x[1] if isinstance(x, list) else None)

# Pressing analysis
def analyze_pressing(events):
    """Analyze team pressing patterns"""
    pressures = events[events["type"] == "Pressure"].copy()

    # Define zones
    pressures["zone"] = pd.cut(
        pressures["x"],
        bins=[0, 40, 80, 120],
        labels=["Defensive", "Middle", "Attacking"]
    )

    # Team aggregation
    pressing_stats = pressures.groupby("team").agg({
        "id": "count",
        "zone": lambda x: (x == "Attacking").sum()
    }).reset_index()

    pressing_stats.columns = ["team", "total_pressures", "high_pressures"]
    pressing_stats["high_press_pct"] = (
        pressing_stats["high_pressures"] / pressing_stats["total_pressures"] * 100
    )

    return pressing_stats.sort_values("high_press_pct", ascending=False)

pressing_analysis = analyze_pressing(events_df)
print("Top Pressing Teams (High Press %):")
print(pressing_analysis[["team", "total_pressures", "high_press_pct"]].head(10))

# Build-up play analysis
def analyze_buildup(events):
    """Analyze team build-up patterns"""
    # Passes in defensive third
    passes = events[
        (events["type"] == "Pass") &
        (events["x"] < 40)
    ].copy()

    # Pass length
    passes["end_x"] = passes["pass_end_location"].apply(
        lambda x: x[0] if isinstance(x, list) else None
    )
    passes["pass_length"] = np.sqrt(
        (passes["end_x"] - passes["x"])**2
    )

    buildup_stats = passes.groupby("team").agg({
        "id": "count",
        "pass_length": ["mean", lambda x: (x < 15).sum(), lambda x: (x > 35).sum()]
    }).reset_index()

    buildup_stats.columns = ["team", "buildup_passes", "avg_length",
                             "short_passes", "long_passes"]

    buildup_stats["short_pct"] = buildup_stats["short_passes"] / buildup_stats["buildup_passes"] * 100
    buildup_stats["direct_pct"] = buildup_stats["long_passes"] / buildup_stats["buildup_passes"] * 100

    # Classify style
    buildup_stats["style"] = buildup_stats.apply(
        lambda x: "Possession" if x["short_pct"] > 60
                  else ("Direct" if x["direct_pct"] > 30 else "Balanced"),
        axis=1
    )

    return buildup_stats

buildup_analysis = analyze_buildup(events_df)
print("\nBuild-up Play Styles:")
print(buildup_analysis[["team", "short_pct", "direct_pct", "style"]])

# Team Tactical Analysis - Women's Football
library(tidyverse)
library(StatsBombR)

# Load match data
wwc_events <- get.matchFree(FreeMatches(Competitions = 72))

# Team pressing analysis
pressing_analysis <- wwc_events %>%
  filter(type.name == "Pressure") %>%
  mutate(
    # Pitch zones (120x80)
    zone_x = cut(location.x, breaks = c(0, 40, 80, 120),
                 labels = c("Defensive", "Middle", "Attacking")),
    zone_y = cut(location.y, breaks = c(0, 27, 53, 80),
                 labels = c("Left", "Center", "Right"))
  ) %>%
  group_by(team.name) %>%
  summarise(
    total_pressures = n(),
    high_press = sum(zone_x == "Attacking"),
    mid_press = sum(zone_x == "Middle"),
    low_press = sum(zone_x == "Defensive"),
    high_press_pct = high_press / total_pressures * 100,

    # Pressing success
    successful = sum(pressure_success == TRUE, na.rm = TRUE),
    success_rate = successful / total_pressures * 100,

    .groups = "drop"
  ) %>%
  arrange(desc(high_press_pct))

print("Top Pressing Teams:")
print(pressing_analysis %>%
        select(team.name, total_pressures, high_press_pct, success_rate) %>%
        head(10))

# Build-up play analysis
buildup_analysis <- wwc_events %>%
  filter(type.name == "Pass",
         location.x < 40) %>%  # Defensive third
  group_by(team.name) %>%
  summarise(
    buildup_passes = n(),
    short_passes = sum(pass.length < 15, na.rm = TRUE),
    long_passes = sum(pass.length > 35, na.rm = TRUE),

    # Direction
    forward = sum(pass.end_location.x > location.x, na.rm = TRUE),
    backward = sum(pass.end_location.x < location.x, na.rm = TRUE),

    # Style indicators
    short_pct = short_passes / buildup_passes * 100,
    direct_pct = long_passes / buildup_passes * 100,
    forward_pct = forward / buildup_passes * 100,

    .groups = "drop"
  )

# Classify playing styles
buildup_analysis <- buildup_analysis %>%
  mutate(
    style = case_when(
      short_pct > 60 ~ "Possession-based",
      direct_pct > 30 ~ "Direct",
      TRUE ~ "Balanced"
    )
  )

print("\nBuild-up Play Styles:")
print(buildup_analysis %>% select(team.name, short_pct, direct_pct, style))

Recruitment and Scouting Analytics

With increasing investment in women's football, recruitment analytics has become crucial. The challenge lies in identifying talent across leagues with varying quality levels and limited historical data.

Recruitment Challenges in Women's Football

League Quality Variation: Performance must be adjusted for league strength
Limited Data History: Many players have shorter professional careers on record
International vs. Club: Some players excel more in international tournaments
Age Considerations: Career trajectories may differ from men's football

Building a recruitment and scouting system for women's football

# Recruitment Scouting System for Women's Football
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

class WomensFootballScout:
    """Scouting and recruitment analytics for women's football"""

    def __init__(self):
        # Position-specific metric weights
        self.position_weights = {
            "Forward": {
                "xG_p90": 0.30, "shots_p90": 0.20,
                "dribbles_p90": 0.20, "pressures_p90": 0.15,
                "key_passes_p90": 0.15
            },
            "Midfielder": {
                "xG_p90": 0.15, "passes_p90": 0.25,
                "key_passes_p90": 0.20, "pressures_p90": 0.20,
                "interceptions_p90": 0.20
            },
            "Defender": {
                "interceptions_p90": 0.25, "tackles_p90": 0.25,
                "passes_p90": 0.20, "aerials_p90": 0.15,
                "pressures_p90": 0.15
            }
        }

        # League quality factors
        self.league_strength = {
            "WSL": 1.0,
            "Liga F": 1.0,
            "NWSL": 0.95,
            "D1 Feminine": 0.95,
            "Frauen-Bundesliga": 0.92,
            "Serie A Femminile": 0.88,
            "A-League Women": 0.80
        }

    def calculate_composite_score(self, player_data, position):
        """Calculate weighted composite score for position"""
        weights = self.position_weights.get(position, self.position_weights["Midfielder"])

        score = 0
        for metric, weight in weights.items():
            if metric in player_data.columns:
                # Use percentile ranking
                player_data[f"{metric}_pct"] = player_data[metric].rank(pct=True) * 100
                score += player_data[f"{metric}_pct"] * weight

        return score

    def adjust_for_league(self, data):
        """Adjust statistics for league quality"""
        data = data.copy()
        data["league_factor"] = data["league"].map(self.league_strength).fillna(0.85)
        data["adjusted_score"] = data["composite_score"] * data["league_factor"]
        return data

    def calculate_value_score(self, data):
        """Calculate player value considering age"""
        data = data.copy()

        # Age factors for women's football
        def age_factor(age):
            if age < 22: return 1.3    # High potential
            elif age < 25: return 1.2  # Rising
            elif age < 30: return 1.0  # Peak
            elif age < 33: return 0.8  # Declining
            else: return 0.6           # Veterans

        data["age_factor"] = data["age"].apply(age_factor)
        data["value_score"] = data["composite_score"] * data["age_factor"]
        return data

    def find_similar_players(self, target_name, all_players, features, n=10):
        """Find players similar to target using cosine similarity"""
        # Prepare feature matrix
        X = all_players[features].fillna(0)
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)

        # Find target index
        target_idx = all_players[all_players["player"] == target_name].index[0]
        target_vector = X_scaled[target_idx].reshape(1, -1)

        # Calculate similarities
        similarities = cosine_similarity(target_vector, X_scaled)[0]
        all_players["similarity"] = similarities

        return all_players.nlargest(n + 1, "similarity").iloc[1:]  # Exclude self

# Usage example
scout = WomensFootballScout()

# Calculate scores for forwards
# forward_data = scout.calculate_composite_score(player_stats, "Forward")
# adjusted_data = scout.adjust_for_league(forward_data)
# valued_data = scout.calculate_value_score(adjusted_data)

print("Scouting system initialized")
print(f"Positions: {list(scout.position_weights.keys())}")
print(f"Leagues: {list(scout.league_strength.keys())}")

# Recruitment Scouting System for Women's Football
library(tidyverse)

# Create scouting framework
create_scouting_profile <- function(player_stats, position = "Forward") {

  # Define position-specific weights
  weights <- list(
    Forward = c(xG = 0.3, shots = 0.2, dribbles = 0.2,
                pressing = 0.15, key_passes = 0.15),
    Midfielder = c(xG = 0.15, passes = 0.25, key_passes = 0.2,
                   pressing = 0.2, interceptions = 0.2),
    Defender = c(interceptions = 0.25, tackles = 0.25,
                 passes = 0.2, aerials = 0.15, pressing = 0.15)
  )

  w <- weights[[position]]

  # Calculate composite score
  player_stats %>%
    mutate(
      composite_score =
        xG_percentile * w["xG"] +
        passing_percentile * w["passes"] +
        pressing_percentile * w["pressing"]
      # Add other metrics as available
    ) %>%
    arrange(desc(composite_score))
}

# League quality adjustment
adjust_for_league <- function(stats, league_factors) {
  # League strength factors (1.0 = baseline)
  # Higher = stronger league
  league_strength <- c(
    "WSL" = 1.0,
    "Liga F" = 1.0,
    "NWSL" = 0.95,
    "Division 1 Feminine" = 0.95,
    "Frauen-Bundesliga" = 0.92,
    "Serie A Femminile" = 0.88,
    "A-League Women" = 0.80
  )

  stats %>%
    mutate(
      league_factor = league_strength[league],
      adjusted_xG = xG_p90 * league_factor,
      adjusted_score = composite_score * league_factor
    )
}

# Age-based value assessment
calculate_player_value <- function(stats) {
  stats %>%
    mutate(
      # Peak years typically 25-29 in women's football
      age_factor = case_when(
        age < 22 ~ 1.3,   # High potential
        age < 25 ~ 1.2,   # Rising
        age < 30 ~ 1.0,   # Peak
        age < 33 ~ 0.8,   # Declining
        TRUE ~ 0.6        # Veterans
      ),

      # Combine quality and potential
      value_score = composite_score * age_factor
    )
}

# Similarity search for recruitment
find_similar_players <- function(target_player, all_players, n = 10) {

  # Features for comparison
  features <- c("xG_p90", "passes_p90", "pressures_p90",
                "dribble_success", "pass_completion")

  target_values <- target_player %>% select(all_of(features))

  # Calculate Euclidean distance
  all_players %>%
    rowwise() %>%
    mutate(
      distance = sqrt(sum((c_across(all_of(features)) - target_values)^2))
    ) %>%
    ungroup() %>%
    arrange(distance) %>%
    head(n)
}

Physical Performance Analysis

Physical performance analysis in women's football requires understanding the unique physiological characteristics of female athletes. While the principles are similar to men's football, the baseline values and training considerations differ.

Key Physical Differences

Different baseline values for maximal sprint speed (typically 26-30 km/h vs 32-36 km/h in men's)
Similar relative distances covered when normalized to physical capacity
Menstrual cycle considerations for training load management
Different injury risk profiles (higher ACL injury rates)
Recovery patterns may differ due to hormonal factors

womens_physical_analysis

# Python: Physical performance analysis for women's football
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List

@dataclass
class PhysicalBenchmarks:
    """Physical performance benchmarks for women's football."""

    # Position benchmarks (per 90 minutes)
    position_benchmarks = {
        "GK": {"total_distance": 5500, "high_speed": 100, "sprint": 30, "accelerations": 15},
        "CB": {"total_distance": 9500, "high_speed": 300, "sprint": 80, "accelerations": 35},
        "FB": {"total_distance": 10800, "high_speed": 600, "sprint": 150, "accelerations": 50},
        "CM": {"total_distance": 11200, "high_speed": 500, "sprint": 120, "accelerations": 45},
        "AM": {"total_distance": 10500, "high_speed": 550, "sprint": 140, "accelerations": 55},
        "W": {"total_distance": 10200, "high_speed": 650, "sprint": 180, "accelerations": 60},
        "ST": {"total_distance": 9800, "high_speed": 500, "sprint": 130, "accelerations": 45}
    }

    # Speed zones (km/h) - women's specific
    speed_zones = {
        1: (0, 7, "Walking"),
        2: (7, 13, "Jogging"),
        3: (13, 18, "Running"),
        4: (18, 23, "High-speed running"),
        5: (23, 30, "Sprinting")
    }

class WomensPhysicalAnalyzer:
    """Analyze physical performance in women's football."""

    def __init__(self):
        self.benchmarks = PhysicalBenchmarks()

    def analyze_match_performance(self, player_data: pd.DataFrame) -> pd.DataFrame:
        """Compare player physical output to benchmarks."""

        df = player_data.copy()

        # Get benchmarks for each position
        df["benchmark_distance"] = df["position"].map(
            lambda x: self.benchmarks.position_benchmarks.get(x, {}).get("total_distance", 10000)
        )
        df["benchmark_hsd"] = df["position"].map(
            lambda x: self.benchmarks.position_benchmarks.get(x, {}).get("high_speed", 500)
        )
        df["benchmark_sprint"] = df["position"].map(
            lambda x: self.benchmarks.position_benchmarks.get(x, {}).get("sprint", 100)
        )

        # Calculate percentages
        df["distance_pct"] = df["total_distance"] / df["benchmark_distance"] * 100
        df["hsd_pct"] = df["high_speed_distance"] / df["benchmark_hsd"] * 100
        df["sprint_pct"] = df["sprint_distance"] / df["benchmark_sprint"] * 100

        # Overall physical score
        df["physical_score"] = (df["distance_pct"] + df["hsd_pct"] + df["sprint_pct"]) / 3

        # Performance classification
        df["performance_level"] = pd.cut(
            df["physical_score"],
            bins=[0, 80, 90, 100, 110, float("inf")],
            labels=["Underperforming", "Below Average", "Average", "Above Average", "Exceptional"]
        )

        return df

    def analyze_cycle_impact(self, physical_data: pd.DataFrame,
                            cycle_data: pd.DataFrame) -> pd.DataFrame:
        """Analyze performance variation across menstrual cycle phases."""

        # Merge datasets
        combined = physical_data.merge(cycle_data, on=["player_id", "date"], how="left")

        # Define cycle phases
        def get_phase(day):
            if pd.isna(day):
                return "Unknown"
            if day <= 5:
                return "Menstruation"
            elif day <= 14:
                return "Follicular"
            elif day <= 21:
                return "Ovulation"
            else:
                return "Luteal"

        combined["cycle_phase"] = combined["day_in_cycle"].apply(get_phase)

        # Aggregate by phase
        phase_analysis = combined.groupby(["player_id", "cycle_phase"]).agg({
            "total_distance": "mean",
            "high_speed_distance": "mean",
            "sprint_distance": "mean",
            "rpe": "mean"  # Rating of Perceived Exertion
        }).reset_index()

        return phase_analysis

    def calculate_acl_risk(self, player_data: pd.DataFrame) -> pd.DataFrame:
        """Calculate ACL injury risk factors."""

        df = player_data.copy()

        # Risk factors
        df["fatigue_risk"] = np.clip(
            df["acute_load"] / df["chronic_load"] - 1, 0, 1
        )

        df["asymmetry_risk"] = np.abs(
            df["left_leg_load"] - df["right_leg_load"]
        ) / df["total_load"]

        df["deceleration_risk"] = df["high_decelerations"] / 50

        # Age factor (higher risk after 25)
        df["age_risk"] = np.where(df["age"] > 25, 0.2, 0)

        # Composite score
        df["acl_risk_score"] = (
            0.3 * df["fatigue_risk"] +
            0.3 * df["asymmetry_risk"] +
            0.2 * np.clip(df["deceleration_risk"], 0, 1) +
            0.2 * df["age_risk"]
        )

        df["risk_level"] = pd.cut(
            df["acl_risk_score"],
            bins=[0, 0.4, 0.7, 1.0],
            labels=["Low", "Moderate", "High"]
        )

        return df

    def training_load_recommendations(self, player_data: pd.DataFrame) -> Dict:
        """Generate training load recommendations."""

        recommendations = {}

        for _, player in player_data.iterrows():
            player_id = player["player_id"]

            # ACWR (Acute:Chronic Workload Ratio)
            acwr = player["acute_load"] / player["chronic_load"] if player["chronic_load"] > 0 else 0

            if acwr > 1.5:
                recommendation = "Reduce load - high injury risk zone"
            elif acwr > 1.3:
                recommendation = "Caution - approaching high risk"
            elif acwr < 0.8:
                recommendation = "Can increase load - in safe zone"
            else:
                recommendation = "Maintain current load - optimal zone"

            recommendations[player_id] = {
                "acwr": acwr,
                "recommendation": recommendation,
                "cycle_phase": player.get("cycle_phase", "Unknown")
            }

        return recommendations

# Example usage
analyzer = WomensPhysicalAnalyzer()
print("Physical performance analyzer initialized")
print(f"Position benchmarks available: {list(analyzer.benchmarks.position_benchmarks.keys())}")
# R: Physical performance analysis for women's football
library(tidyverse)

# Create reference benchmarks for women's football
create_physical_benchmarks <- function() {

    # Position-based benchmarks (per 90 minutes)
    benchmarks <- tribble(
        ~position, ~total_distance, ~high_speed_distance, ~sprint_distance, ~accelerations,
        "GK", 5500, 100, 30, 15,
        "CB", 9500, 300, 80, 35,
        "FB", 10800, 600, 150, 50,
        "CM", 11200, 500, 120, 45,
        "AM", 10500, 550, 140, 55,
        "W", 10200, 650, 180, 60,
        "ST", 9800, 500, 130, 45
    )

    # Speed zone definitions (women's football specific)
    speed_zones <- tribble(
        ~zone, ~min_speed, ~max_speed, ~description,
        1, 0, 7, "Walking",
        2, 7, 13, "Jogging",
        3, 13, 18, "Running",
        4, 18, 23, "High-speed running",
        5, 23, 30, "Sprinting"
    )

    list(
        position_benchmarks = benchmarks,
        speed_zones = speed_zones
    )
}

# Analyze match physical data
analyze_physical_performance <- function(player_data, benchmarks) {

    player_data %>%
        left_join(benchmarks$position_benchmarks, by = "position") %>%
        mutate(
            # Calculate percentage of benchmark
            distance_pct = total_distance_actual / total_distance * 100,
            hsd_pct = high_speed_actual / high_speed_distance * 100,
            sprint_pct = sprint_actual / sprint_distance * 100,

            # Overall physical score
            physical_score = (distance_pct + hsd_pct + sprint_pct) / 3,

            # Flag under/over performers
            performance_level = case_when(
                physical_score > 110 ~ "Exceptional",
                physical_score > 100 ~ "Above Average",
                physical_score > 90 ~ "Average",
                physical_score > 80 ~ "Below Average",
                TRUE ~ "Underperforming"
            )
        )
}

# Menstrual cycle tracking for load management
analyze_cycle_performance <- function(player_data, cycle_data) {

    # Join with cycle phase information
    combined <- player_data %>%
        left_join(cycle_data, by = c("player_id", "date")) %>%
        mutate(
            cycle_phase = case_when(
                day_in_cycle <= 5 ~ "Menstruation",
                day_in_cycle <= 14 ~ "Follicular",
                day_in_cycle <= 21 ~ "Ovulation",
                TRUE ~ "Luteal"
            )
        )

    # Analyze performance by phase
    phase_analysis <- combined %>%
        group_by(player_id, cycle_phase) %>%
        summarise(
            avg_distance = mean(total_distance),
            avg_sprint = mean(sprint_distance),
            avg_hsd = mean(high_speed_distance),
            injury_events = sum(injury_flag, na.rm = TRUE),
            .groups = "drop"
        )

    phase_analysis
}

# ACL injury risk assessment
calculate_acl_risk <- function(player_data) {

    player_data %>%
        mutate(
            # Risk factors
            fatigue_risk = cumulative_load_7d / baseline_load - 1,
            asymmetry_risk = abs(left_leg_load - right_leg_load) / total_load,
            deceleration_load = sum(high_decelerations),

            # Composite risk score
            acl_risk_score =
                0.3 * pmin(fatigue_risk, 1) +
                0.3 * asymmetry_risk +
                0.2 * (deceleration_load / 50) +
                0.2 * (age > 25) * 0.5,

            risk_level = case_when(
                acl_risk_score > 0.7 ~ "High",
                acl_risk_score > 0.4 ~ "Moderate",
                TRUE ~ "Low"
            )
        )
}

print("Physical performance analysis system ready!")

Youth Development Analytics

Youth development in women's football presents unique analytical challenges. With the sport's rapid professionalization, identifying and developing talented young players has become increasingly important for clubs and federations.

youth_development_analytics

# Python: Youth development pathway analysis
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Tuple
from sklearn.preprocessing import StandardScaler

@dataclass
class DevelopmentPathway:
    """Define development pathway stages for women's football."""

    stages = {
        "Foundation": {"age_range": (8, 11), "focus": "Technical fundamentals"},
        "Talent": {"age_range": (12, 14), "focus": "Tactical awareness"},
        "Youth": {"age_range": (15, 17), "focus": "Position specialization"},
        "Senior Transition": {"age_range": (18, 21), "focus": "First team integration"},
        "Elite": {"age_range": (22, 35), "focus": "Peak performance"}
    }

    @staticmethod
    def get_stage(age: int) -> str:
        if age < 12:
            return "Foundation"
        elif age < 15:
            return "Talent"
        elif age < 18:
            return "Youth"
        elif age < 22:
            return "Senior Transition"
        else:
            return "Elite"

class YouthDevelopmentAnalyzer:
    """Analytics for youth player development."""

    def __init__(self):
        self.pathway = DevelopmentPathway()
        self.scaler = StandardScaler()

    def track_development(self, player_history: pd.DataFrame) -> pd.DataFrame:
        """Track player development trajectory over time."""

        df = player_history.sort_values("date").copy()

        # Technical growth rate
        df["technical_growth"] = df["technical_score"].pct_change()

        # Physical development tracking
        df["height_velocity"] = df["height"].diff()

        # Performance trend (rolling average)
        df["performance_trend"] = df["match_rating"].rolling(10, min_periods=3).mean()

        # Current development phase
        df["development_phase"] = df["age"].apply(self.pathway.get_stage)

        # Growth spurt detection
        df["growth_spurt"] = df["height_velocity"] > 0.5  # >0.5cm per measurement period

        return df

    def identify_talent(self, youth_data: pd.DataFrame,
                       age_group: str) -> pd.DataFrame:
        """Identify high-potential players within age group."""

        # Filter to age group
        group_data = youth_data[youth_data["age_group"] == age_group].copy()

        if len(group_data) < 5:
            return group_data

        # Metrics to evaluate
        metrics = ["technical_score", "physical_score", "tactical_score"]

        # Standardize within age group
        group_data[metrics] = self.scaler.fit_transform(group_data[metrics])

        # Composite potential score (weighted average of z-scores)
        group_data["potential_score"] = (
            group_data["technical_score"] * 0.4 +
            group_data["physical_score"] * 0.3 +
            group_data["tactical_score"] * 0.3
        )

        # Classification
        group_data["talent_tier"] = pd.cut(
            group_data["potential_score"],
            bins=[-np.inf, 0, 1, 2, np.inf],
            labels=["Needs Development", "Developing", "High Potential", "Elite Prospect"]
        )

        return group_data.sort_values("potential_score", ascending=False)

    def adjust_for_maturity(self, player_data: pd.DataFrame) -> pd.DataFrame:
        """Adjust metrics for biological maturity timing."""

        df = player_data.copy()

        # Bio-banding: estimate biological vs chronological age
        df["bio_age_offset"] = df["estimated_bio_age"] - df["chronological_age"]

        # Classify maturity timing
        df["maturity_status"] = pd.cut(
            df["bio_age_offset"],
            bins=[-np.inf, -1, 1, np.inf],
            labels=["Late Developer", "On-time", "Early Developer"]
        )

        # Adjust physical metrics
        df["adjusted_speed"] = df["max_speed"] - (df["bio_age_offset"] * 0.5)
        df["adjusted_strength"] = df["strength_score"] - (df["bio_age_offset"] * 2)

        # Flag late developers (often overlooked but may have higher ceiling)
        df["late_developer_flag"] = df["maturity_status"] == "Late Developer"

        return df

    def predict_dropout_risk(self, player_data: pd.DataFrame) -> pd.DataFrame:
        """Predict risk of player dropping out of pathway."""

        df = player_data.copy()

        # Engagement score
        df["engagement_score"] = (
            df["attendance_rate"] * 0.3 +
            df["enjoyment_rating"] * 0.3 +
            df["progress_rating"] * 0.4
        )

        # Risk classification
        df["dropout_risk"] = pd.cut(
            df["engagement_score"],
            bins=[0, 0.4, 0.6, 1.0],
            labels=["High", "Moderate", "Low"]
        )

        # Early warning indicators
        df["declining_engagement"] = df["engagement_score"] < df["engagement_score"].shift(3)

        return df

    def generate_development_report(self, player_id: str,
                                    history: pd.DataFrame) -> Dict:
        """Generate comprehensive development report for player."""

        player_history = history[history["player_id"] == player_id]

        if len(player_history) == 0:
            return {"error": "Player not found"}

        latest = player_history.iloc[-1]
        earliest = player_history.iloc[0]

        return {
            "player_id": player_id,
            "current_age": latest["age"],
            "current_phase": self.pathway.get_stage(latest["age"]),
            "time_in_program": (latest["date"] - earliest["date"]).days / 365,
            "technical_improvement": (
                latest["technical_score"] - earliest["technical_score"]
            ) / earliest["technical_score"] * 100,
            "physical_improvement": (
                latest["physical_score"] - earliest["physical_score"]
            ) / earliest["physical_score"] * 100,
            "current_talent_tier": latest.get("talent_tier", "Unknown"),
            "maturity_status": latest.get("maturity_status", "Unknown"),
            "dropout_risk": latest.get("dropout_risk", "Unknown"),
            "recommendation": self._generate_recommendation(latest)
        }

    def _generate_recommendation(self, player_data: pd.Series) -> str:
        """Generate development recommendation."""

        if player_data.get("talent_tier") == "Elite Prospect":
            return "Consider accelerated pathway to senior team"
        elif player_data.get("maturity_status") == "Late Developer":
            return "Monitor closely - potential for late development surge"
        elif player_data.get("dropout_risk") == "High":
            return "Intervention needed - focus on engagement and enjoyment"
        else:
            return "Continue current development plan"

# Example usage
analyzer = YouthDevelopmentAnalyzer()
print("Youth development analyzer initialized")
print(f"Development stages: {list(analyzer.pathway.stages.keys())}")
# R: Youth development pathway analysis
library(tidyverse)

# Define development pathway stages
create_development_framework <- function() {

    pathway <- tribble(
        ~stage, ~age_range, ~focus_areas, ~key_metrics,
        "Foundation", "8-11", "Technical fundamentals, coordination", "Ball mastery tests, coordination scores",
        "Talent", "12-14", "Tactical awareness, physical development", "Decision making, growth tracking",
        "Youth", "15-17", "Position specialization, competition", "Match stats, physical benchmarks",
        "Senior Transition", "18-21", "First team integration", "Minutes, performance ratings",
        "Elite", "22+", "Peak performance optimization", "Full analytics suite"
    )

    pathway
}

# Track player development over time
track_player_development <- function(player_history) {

    player_history %>%
        arrange(date) %>%
        mutate(
            # Technical development
            technical_growth = (technical_score - lag(technical_score)) / lag(technical_score),

            # Physical development
            height_velocity = height - lag(height),  # Growth spurt detection
            physical_maturity = estimate_maturity(height, weight, age),

            # Performance trajectory
            performance_trend = zoo::rollmean(match_rating, k = 10, fill = NA),

            # Development phase
            current_phase = case_when(
                age < 12 ~ "Foundation",
                age < 15 ~ "Talent",
                age < 18 ~ "Youth",
                age < 22 ~ "Senior Transition",
                TRUE ~ "Elite"
            )
        )
}

# Identify high-potential players
identify_talent <- function(youth_data, age_group) {

    # Get age-appropriate benchmarks
    benchmarks <- youth_data %>%
        filter(age_group == !!age_group) %>%
        summarise(across(where(is.numeric), list(mean = mean, sd = sd)))

    # Score players relative to peers
    youth_data %>%
        filter(age_group == !!age_group) %>%
        mutate(
            # Standardized scores
            technical_z = (technical_score - benchmarks$technical_score_mean) / benchmarks$technical_score_sd,
            physical_z = (physical_score - benchmarks$physical_score_mean) / benchmarks$physical_score_sd,
            tactical_z = (tactical_score - benchmarks$tactical_score_mean) / benchmarks$tactical_score_sd,

            # Composite potential score
            potential_score = (technical_z * 0.4 + physical_z * 0.3 + tactical_z * 0.3),

            # Classification
            talent_tier = case_when(
                potential_score > 2 ~ "Elite Prospect",
                potential_score > 1 ~ "High Potential",
                potential_score > 0 ~ "Developing",
                TRUE ~ "Needs Development"
            )
        ) %>%
        arrange(desc(potential_score))
}

# Maturity timing adjustment
adjust_for_maturity <- function(player_data) {

    # Bio-banding approach
    player_data %>%
        mutate(
            # Estimate biological age vs chronological age
            bio_age_offset = estimated_bio_age - chronological_age,

            # Adjust physical metrics for maturity
            adjusted_speed = max_speed - (bio_age_offset * 0.5),
            adjusted_strength = strength - (bio_age_offset * 2),

            # Early/late developer classification
            maturity_status = case_when(
                bio_age_offset > 1 ~ "Early Developer",
                bio_age_offset < -1 ~ "Late Developer",
                TRUE ~ "On-time"
            ),

            # Flag late developers for special attention
            late_developer_flag = maturity_status == "Late Developer"
        )
}

# Dropout risk prediction
predict_dropout_risk <- function(player_data) {

    player_data %>%
        mutate(
            # Risk factors
            engagement_score = attendance_rate * 0.3 + enjoyment_rating * 0.3 + progress_rating * 0.4,

            # Early warning signs
            declining_attendance = attendance_rate < lag(attendance_rate, 3),
            stagnating_progress = performance_trend < lag(performance_trend, 5),

            # Dropout risk score
            dropout_risk = case_when(
                engagement_score < 0.4 ~ "High",
                engagement_score < 0.6 ~ "Moderate",
                TRUE ~ "Low"
            )
        )
}

print("Youth development framework initialized!")

Key Considerations for Youth Analytics

Bio-banding: Group players by biological maturity, not just age
Late developers: Don't overlook late developers who may have higher ceilings
Relative Age Effect: Track birth month distribution to ensure fair selection
Holistic development: Technical, tactical, physical, and psychological metrics
Dropout prevention: Engagement and enjoyment are as important as performance

International vs. Club Performance

A unique aspect of women's football is the significant role international football plays in player visibility and recruitment. Many players excel more prominently in international competition, and understanding this dynamic is crucial for analysts.

international_club_analysis

# Python: International vs. Club performance analysis
import pandas as pd
import numpy as np
from typing import Dict, List

class InternationalAnalyzer:
    """Analyze international vs. club performance."""

    def __init__(self, fifa_rankings: pd.DataFrame = None):
        self.rankings = fifa_rankings

    def compare_contexts(self, player_data: pd.DataFrame) -> pd.DataFrame:
        """Compare player performance in club vs. international."""

        # Aggregate by context
        context_stats = player_data.groupby(["player_id", "context"]).agg({
            "match_id": "count",
            "goals": "sum",
            "assists": "sum",
            "xG": "sum",
            "xA": "sum",
            "minutes": "sum",
            "rating": "mean"
        }).reset_index()

        context_stats.columns = ["player_id", "context", "matches", "goals",
                                 "assists", "xG", "xA", "minutes", "avg_rating"]

        # Per 90 metrics
        context_stats["goals_p90"] = context_stats["goals"] / context_stats["minutes"] * 90
        context_stats["xG_p90"] = context_stats["xG"] / context_stats["minutes"] * 90
        context_stats["xA_p90"] = context_stats["xA"] / context_stats["minutes"] * 90

        # Pivot to compare
        comparison = context_stats.pivot(
            index="player_id",
            columns="context",
            values=["goals_p90", "xG_p90", "avg_rating"]
        )

        comparison.columns = ["_".join(col) for col in comparison.columns]
        comparison = comparison.reset_index()

        # Calculate differences
        if "goals_p90_international" in comparison.columns and "goals_p90_club" in comparison.columns:
            comparison["goal_difference"] = (
                comparison["goals_p90_international"] - comparison["goals_p90_club"]
            )

            comparison["player_type"] = np.where(
                comparison["goal_difference"] > 0.1, "International Performer",
                np.where(comparison["goal_difference"] < -0.1, "Club Performer", "Consistent")
            )

        return comparison

    def adjust_for_opponent(self, player_data: pd.DataFrame) -> pd.DataFrame:
        """Adjust statistics for opponent quality."""

        if self.rankings is None:
            return player_data

        df = player_data.merge(
            self.rankings[["team", "fifa_rank"]],
            left_on="opponent",
            right_on="team",
            how="left"
        )

        # Opponent factor based on FIFA ranking
        def get_opponent_factor(rank):
            if pd.isna(rank):
                return 0.9
            if rank <= 10:
                return 1.5
            elif rank <= 25:
                return 1.2
            elif rank <= 50:
                return 1.0
            elif rank <= 100:
                return 0.8
            else:
                return 0.6

        df["opponent_factor"] = df["fifa_rank"].apply(get_opponent_factor)
        df["adjusted_xG"] = df["xG"] * df["opponent_factor"]
        df["adjusted_goals"] = df["goals"] * df["opponent_factor"]

        return df

    def analyze_tournament(self, tournament_data: pd.DataFrame) -> pd.DataFrame:
        """Analyze tournament-specific performance."""

        # Define stage importance
        stage_map = {
            "Group": 1,
            "Round of 16": 2,
            "Quarter-final": 3,
            "Semi-final": 4,
            "Final": 5
        }

        df = tournament_data.copy()
        df["stage_weight"] = df["match_type"].map(stage_map).fillna(1)

        # Aggregate by player
        player_stats = df.groupby(["player_id", "player_name"]).agg({
            "goals": "sum",
            "xG": "sum",
            "rating": "mean"
        }).reset_index()

        # Knockout stage performance
        knockout = df[df["stage_weight"] >= 2].groupby("player_id").agg({
            "goals": "sum",
            "rating": "mean"
        }).reset_index()
        knockout.columns = ["player_id", "knockout_goals", "knockout_rating"]

        # Group stage performance
        group = df[df["stage_weight"] == 1].groupby("player_id").agg({
            "rating": "mean"
        }).reset_index()
        group.columns = ["player_id", "group_rating"]

        # Merge
        player_stats = player_stats.merge(knockout, on="player_id", how="left")
        player_stats = player_stats.merge(group, on="player_id", how="left")

        # Clutch factor
        player_stats["clutch_factor"] = (
            player_stats["knockout_rating"].fillna(0) -
            player_stats["group_rating"].fillna(0)
        )

        return player_stats.sort_values("clutch_factor", ascending=False)

    def analyze_senior_pathway(self, youth_data: pd.DataFrame) -> pd.DataFrame:
        """Analyze pathway from youth to senior international team."""

        pathway = youth_data.groupby("player_id").agg({
            "age": ["min", "max"],
            "team_level": lambda x: {
                "u17_caps": (x == "U17").sum(),
                "u19_caps": (x == "U19").sum(),
                "u21_caps": (x == "U21").sum(),
                "senior_caps": (x == "Senior").sum()
            }
        }).reset_index()

        # Flatten
        pathway.columns = ["player_id", "first_age", "last_age", "cap_counts"]

        # Extract cap counts
        pathway["u17_caps"] = pathway["cap_counts"].apply(lambda x: x["u17_caps"])
        pathway["u19_caps"] = pathway["cap_counts"].apply(lambda x: x["u19_caps"])
        pathway["senior_caps"] = pathway["cap_counts"].apply(lambda x: x["senior_caps"])

        # Made senior team?
        pathway["made_senior"] = pathway["senior_caps"] > 0

        # Progression rate
        pathway["progression_rate"] = np.where(
            pathway["made_senior"],
            1 / (pathway["last_age"] - pathway["first_age"] + 1),
            0
        )

        return pathway

# Example usage
analyzer = InternationalAnalyzer()
print("International vs. Club analyzer initialized")
# R: International vs. Club performance analysis
library(tidyverse)

# Compare player performance across contexts
analyze_context_performance <- function(player_data) {

    player_data %>%
        group_by(player_id, context) %>%  # context = "club" or "international"
        summarise(
            matches = n(),
            goals = sum(goals),
            assists = sum(assists),
            xG = sum(xG),
            xA = sum(xA),
            avg_rating = mean(rating),

            # Per 90 metrics
            goals_p90 = sum(goals) / sum(minutes) * 90,
            xG_p90 = sum(xG) / sum(minutes) * 90,
            xA_p90 = sum(xA) / sum(minutes) * 90,

            .groups = "drop"
        ) %>%
        pivot_wider(
            id_cols = player_id,
            names_from = context,
            values_from = c(goals_p90, xG_p90, xA_p90, avg_rating)
        ) %>%
        mutate(
            # Performance difference
            goal_diff = goals_p90_international - goals_p90_club,
            xG_diff = xG_p90_international - xG_p90_club,

            # Classify player type
            player_type = case_when(
                goal_diff > 0.1 ~ "International Performer",
                goal_diff < -0.1 ~ "Club Performer",
                TRUE ~ "Consistent"
            )
        )
}

# Analyze opponent quality adjustment
adjust_for_opponent_quality <- function(player_data, team_rankings) {

    player_data %>%
        left_join(team_rankings, by = c("opponent" = "team")) %>%
        mutate(
            # Opponent strength factor (FIFA ranking-based)
            opponent_factor = case_when(
                fifa_rank <= 10 ~ 1.5,   # Elite opposition
                fifa_rank <= 25 ~ 1.2,   # Strong opposition
                fifa_rank <= 50 ~ 1.0,   # Average opposition
                fifa_rank <= 100 ~ 0.8,  # Weak opposition
                TRUE ~ 0.6               # Very weak opposition
            ),

            # Adjusted metrics
            adjusted_xG = xG * opponent_factor,
            adjusted_goals = goals * opponent_factor
        )
}

# Tournament performance analysis
analyze_tournament_performance <- function(tournament_data) {

    tournament_data %>%
        mutate(
            # Tournament stage
            stage = case_when(
                match_type == "Final" ~ 5,
                match_type == "Semi-final" ~ 4,
                match_type == "Quarter-final" ~ 3,
                match_type == "Round of 16" ~ 2,
                TRUE ~ 1  # Group stage
            )
        ) %>%
        group_by(player_id, player_name) %>%
        summarise(
            # Overall stats
            total_goals = sum(goals),
            total_xG = sum(xG),

            # Big game performance
            knockout_goals = sum(goals[stage >= 2]),
            knockout_xG = sum(xG[stage >= 2]),

            # Clutch factor
            high_pressure_rating = mean(rating[stage >= 3], na.rm = TRUE),
            group_stage_rating = mean(rating[stage == 1], na.rm = TRUE),

            clutch_factor = high_pressure_rating - group_stage_rating,

            .groups = "drop"
        ) %>%
        arrange(desc(clutch_factor))
}

# National team pipeline analysis
analyze_pathway_to_senior <- function(youth_international_data) {

    youth_international_data %>%
        arrange(player_id, age) %>%
        group_by(player_id) %>%
        summarise(
            # Youth international history
            u17_caps = sum(team_level == "U17"),
            u19_caps = sum(team_level == "U19"),
            u21_caps = sum(team_level == "U21"),
            senior_caps = sum(team_level == "Senior"),

            # Age of first senior cap
            first_senior_age = min(age[team_level == "Senior"], na.rm = TRUE),

            # Made senior team?
            made_senior = senior_caps > 0,

            .groups = "drop"
        ) %>%
        mutate(
            # Pathway analysis
            pathway_length = first_senior_age - 17,
            pathway_type = case_when(
                first_senior_age < 20 ~ "Fast Track",
                first_senior_age < 23 ~ "Standard",
                first_senior_age < 26 ~ "Late Bloomer",
                TRUE ~ "Never Progressed"
            )
        )
}

print("International performance analysis ready!")

Key Observations: International vs. Club

Visibility factor: World Cups and continental championships often provide the main exposure for players from smaller leagues
Opponent quality: International matches against lower-ranked nations may inflate statistics
Team dynamics: Some players thrive in national team systems that differ from their clubs
Sample size: International data is limited (10-15 matches per year maximum)
Tournament performance: Big tournament performers may command premium valuations

Growth and Opportunities in Women's Football Analytics

The women's football analytics landscape presents unique opportunities for analysts and data scientists. The rapid professionalization of the game means there's significant room to make an impact.

Growing Opportunities

Clubs building dedicated women's analytics departments
Federations investing in national team analytics
Media companies seeking women's football content
Data providers expanding coverage
Academic research gaining momentum

Areas for Innovation

Women's football-specific xG models
Physical performance benchmarks
Youth development pathways
Cross-league player comparison
Commercial analytics for growing the game

Analyzing the growth trajectory of women's football

# Analyzing Growth of Women's Football
import pandas as pd
import matplotlib.pyplot as plt

# Sample data: Growth metrics over time
growth_data = pd.DataFrame({
    "year": [2019, 2020, 2021, 2022, 2023] * 2,
    "metric": ["WSL Average Attendance"] * 5 + ["NWSL Average Attendance"] * 5,
    "value": [3048, 2847, 3523, 6744, 8134, 7337, 0, 7843, 10628, 11276]
})

# Filter out COVID year
growth_data = growth_data[growth_data["value"] > 0]

# Calculate growth rates
def calculate_growth(group):
    group = group.sort_values("year")
    group["yoy_growth"] = group["value"].pct_change() * 100
    group["cumulative_growth"] = (
        (group["value"] - group["value"].iloc[0]) / group["value"].iloc[0] * 100
    )
    return group

growth_analysis = growth_data.groupby("metric").apply(calculate_growth).reset_index(drop=True)

print("Growth Analysis:")
print(growth_analysis)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

for metric in growth_data["metric"].unique():
    data = growth_data[growth_data["metric"] == metric]
    ax.plot(data["year"], data["value"], marker="o",
            linewidth=2, markersize=8, label=metric)

ax.set_xlabel("Year")
ax.set_ylabel("Average Attendance")
ax.set_title("Growth of Women's Football Attendance")
ax.legend()
ax.grid(True, alpha=0.3)

# Format y-axis
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: format(int(x), ",")))

plt.tight_layout()
plt.show()

# Investment and coverage growth
investment_data = pd.DataFrame({
    "category": ["Broadcast Deals (M)", "Club Budgets (avg M)",
                 "Data Coverage (leagues)", "Analytics Staff (per club)"],
    "y2019": [5, 2, 5, 0.5],
    "y2023": [50, 8, 15, 2.5]
})

investment_data["growth_pct"] = (
    (investment_data["y2023"] - investment_data["y2019"]) /
    investment_data["y2019"] * 100
)

print("\nWomen's Football Growth 2019-2023:")
print(investment_data.to_string(index=False))

# Analyzing Growth of Women's Football
library(tidyverse)

# Sample data: Growth metrics over time
growth_data <- tribble(
  ~year, ~metric, ~value,
  2019, "WSL Average Attendance", 3048,
  2020, "WSL Average Attendance", 2847,
  2021, "WSL Average Attendance", 3523,
  2022, "WSL Average Attendance", 6744,
  2023, "WSL Average Attendance", 8134,

  2019, "NWSL Average Attendance", 7337,
  2020, "NWSL Average Attendance", 0,
  2021, "NWSL Average Attendance", 7843,
  2022, "NWSL Average Attendance", 10628,
  2023, "NWSL Average Attendance", 11276
)

# Calculate growth rates
growth_analysis <- growth_data %>%
  filter(value > 0) %>%
  group_by(metric) %>%
  arrange(year) %>%
  mutate(
    yoy_growth = (value - lag(value)) / lag(value) * 100,
    cumulative_growth = (value - first(value)) / first(value) * 100
  )

# Visualization
ggplot(growth_data %>% filter(value > 0),
       aes(x = year, y = value, color = metric)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Growth of Women's Football Attendance",
    x = "Year",
    y = "Average Attendance",
    color = "League"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Investment and coverage growth
investment_data <- tribble(
  ~category, ~y2019, ~y2023, ~growth_pct,
  "Broadcast Deals (M)", 5, 50, 900,
  "Club Budgets (avg M)", 2, 8, 300,
  "Data Coverage (leagues)", 5, 15, 200,
  "Analytics Staff (per club)", 0.5, 2.5, 400
)

print("Women's Football Growth 2019-2023:")
print(investment_data)

Resources for Women's Football Analytics

Data Sources

StatsBomb Open Data Free WWC and select league data
FBref Comprehensive women's league stats
Wyscout Video and data platform

Community

Women's Football Analytics Twitter/X community
Women in Football organizations
STATS Perform Women's Football Podcast
OptaPro Forum sessions on women's football

Career Opportunities

Club analyst positions (growing)
Federation analytics roles
Media and journalism
Data provider positions

Practice Exercises

Exercise 44.1: Build a Women's xG Model

Using StatsBomb's free Women's World Cup data, build an expected goals model specifically calibrated for women's football. Compare your model's predictions to StatsBomb's xG values and analyze any systematic differences.

Hints:

Consider whether shot distance and angle relationships differ
Analyze header conversion rates compared to men's football benchmarks
Test if goalkeeper characteristics affect xG differently

Exercise 44.2: Cross-League Player Comparison

Develop a system to compare players across different women's leagues (e.g., WSL, NWSL, Liga F). Account for league quality differences and create adjusted metrics that allow fair comparison.

Hints:

Use international match performance as a common baseline
Create league strength coefficients based on UEFA/FIFA rankings
Consider opponent quality in domestic matches

Exercise 44.3: Team Pressing Analysis

Analyze pressing patterns for teams in the Women's World Cup. Identify the most effective pressing teams and determine what tactical factors contribute to pressing success.

Hints:

Calculate PPDA (Passes Per Defensive Action)
Measure high press frequency and success rate
Correlate pressing metrics with match outcomes

Summary

Key Takeaways

Growing Data Ecosystem: Women's football data coverage has expanded significantly, with major providers now covering top leagues and tournaments
Unique Considerations: Analytics models should be calibrated specifically for women's football, not simply adapted from men's football
Recruitment Opportunities: The professionalization of women's football creates demand for sophisticated player evaluation and scouting systems
Career Growth: Analysts have significant opportunities to make an impact in a rapidly developing field
Community Building: Contributing to women's football analytics helps build the sport and creates pathways for future analysts

Women's football analytics represents one of the most exciting frontiers in the field. With increasing investment, growing data availability, and passionate communities, analysts have unprecedented opportunities to contribute to the development of the women's game while building rewarding careers.

Capstone - Complete Analytics System

Introduction to Women's Football Analytics

The Growth of Women's Football Analytics

The Women's Football Data Landscape

Analytical Considerations for Women's Football

Important Considerations

Key Analytical Differences

Physical Metrics

Statistical Baselines

Player Evaluation in Women's Football

Creating Player Comparison Visualizations

Team Analysis and Tactical Patterns

Recruitment and Scouting Analytics

Recruitment Challenges in Women's Football

Physical Performance Analysis

Key Physical Differences

Youth Development Analytics

Key Considerations for Youth Analytics

International vs. Club Performance

Growth and Opportunities in Women's Football Analytics

Resources for Women's Football Analytics

Practice Exercises

Exercise 44.1: Build a Women's xG Model

Exercise 44.2: Cross-League Player Comparison

Exercise 44.3: Team Pressing Analysis

Summary

Key Takeaways

On This Page

Exercises

Chapter Info