Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples
0 of 60 chapters completed (0%)
Learning Objectives
  • Understand projection frameworks and their applications in football
  • Build aging curves for different metrics and positions
  • Apply regression to the mean in performance projections
  • Create statistical projection systems using Marcel-style methods
  • Implement machine learning approaches for player projections
  • Quantify uncertainty in projections with confidence intervals
  • Project season totals from partial season data
  • Model career trajectories and development paths

Introduction to Performance Projections

Player performance projections are essential for transfer decisions, contract negotiations, squad planning, and fantasy football. By combining historical performance data with aging curves, regression principles, and contextual factors, we can build systems that forecast future performance with quantified uncertainty.

Why Project Performance?

A player's past performance is not a perfect predictor of future results. Young players improve, veterans decline, and all players experience random variance. Projection systems help separate signal from noise and estimate the true underlying talent level that drives future performance.

projection_framework.py
# Python: Player Projection Framework Overview
import pandas as pd
import numpy as np
from scipy import stats

# Define projection system components
projection_components = {
    "historical_performance": "Weighted average of past seasons",
    "aging_curve": "Expected change based on age",
    "regression": "Regression to the mean based on sample size",
    "context": "League/team quality adjustments",
    "uncertainty": "Confidence intervals for projections"
}

# Sample projection workflow
def project_player(player_history, metric="npxg_90"):
    """Project future performance for a player."""
    # 1. Weight recent seasons more heavily
    weighted_avg = calculate_weighted_average(player_history, metric)

    # 2. Apply aging curve adjustment
    age_adjusted = apply_aging_curve(weighted_avg, player_history["age"])

    # 3. Regress to the mean based on minutes played
    regressed = regress_to_mean(age_adjusted, player_history["minutes"])

    # 4. Calculate projection uncertainty
    uncertainty = calculate_uncertainty(player_history, metric)

    return {
        "projection": regressed,
        "lower_95": regressed - 1.96 * uncertainty,
        "upper_95": regressed + 1.96 * uncertainty
    }

print("Projection Framework Components:")
for name, description in projection_components.items():
    print(f"- {name}: {description}")
# R: Player Projection Framework Overview
library(tidyverse)
library(mgcv)

# Define projection system components
projection_components <- list(
    historical_performance = "Weighted average of past seasons",
    aging_curve = "Expected change based on age",
    regression = "Regression to the mean based on sample size",
    context = "League/team quality adjustments",
    uncertainty = "Confidence intervals for projections"
)

# Sample projection workflow
project_player <- function(player_history, metric = "npxg_90") {
    # 1. Weight recent seasons more heavily
    weighted_avg <- calculate_weighted_average(player_history, metric)

    # 2. Apply aging curve adjustment
    age_adjusted <- apply_aging_curve(weighted_avg, player_history$age)

    # 3. Regress to the mean based on minutes played
    regressed <- regress_to_mean(age_adjusted, player_history$minutes)

    # 4. Calculate projection uncertainty
    uncertainty <- calculate_uncertainty(player_history, metric)

    return(list(
        projection = regressed,
        lower_95 = regressed - 1.96 * uncertainty,
        upper_95 = regressed + 1.96 * uncertainty
    ))
}

cat("Projection Framework Components:\n")
for (name in names(projection_components)) {
    cat(sprintf("- %s: %s\n", name, projection_components[[name]]))
}
Output
Projection Framework Components:
- historical_performance: Weighted average of past seasons
- aging_curve: Expected change based on age
- regression: Regression to the mean based on sample size
- context: League/team quality adjustments
- uncertainty: Confidence intervals for projections

Building Aging Curves

Aging curves model how player performance changes over their career. Different metrics peak at different ages, and positions show distinct patterns. Building accurate aging curves requires careful handling of survivorship bias.

Physical Metrics

24-26

Peak age for sprint speed, distance covered

Technical Metrics

26-29

Peak age for xG, xA, key passes

Decision Metrics

28-32

Peak age for pass completion, positioning

aging_curves.py
# Python: Building Aging Curves with Delta Method
import pandas as pd
import numpy as np
from scipy.interpolate import UnivariateSpline

# Load multi-season player data
player_seasons = pd.read_csv("player_season_data.csv")

def build_aging_curve_delta(data, metric):
    """Build aging curve using the delta method."""
    # Sort by player and season
    data = data.sort_values(["player_id", "season"])

    # Create player-season pairs
    paired = []
    for player_id, group in data.groupby("player_id"):
        group = group.reset_index(drop=True)
        for i in range(len(group) - 1):
            current = group.iloc[i]
            next_row = group.iloc[i + 1]
            delta = next_row[metric] - current[metric]
            paired.append({
                "player_id": player_id,
                "age": current["age"],
                "current_value": current[metric],
                "next_value": next_row[metric],
                "delta": delta
            })

    paired_df = pd.DataFrame(paired)

    # Calculate average delta at each age
    age_deltas = paired_df.groupby("age").agg(
        mean_delta=("delta", "mean"),
        se_delta=("delta", lambda x: x.std() / np.sqrt(len(x))),
        n=("delta", "count")
    ).reset_index()

    return age_deltas

# Build curve for npxG/90
npxg_aging = build_aging_curve_delta(player_seasons, "npxg_90")

def fit_smooth_aging(age_deltas):
    """Fit smooth aging curve with spline."""
    # Fit weighted spline
    spline = UnivariateSpline(
        age_deltas["age"],
        age_deltas["mean_delta"],
        w=np.sqrt(age_deltas["n"]),
        s=0.5
    )

    # Predict across age range
    ages = np.arange(18, 39)
    predicted_delta = spline(ages)

    # Cumulate to get aging curve
    cumulative = np.cumsum(predicted_delta)
    peak_value = cumulative.max()
    relative = cumulative - peak_value

    return pd.DataFrame({
        "age": ages,
        "predicted_delta": predicted_delta,
        "cumulative": cumulative,
        "relative": relative
    })

npxg_curve = fit_smooth_aging(npxg_aging)
print(npxg_curve)
# R: Building Aging Curves with Delta Method
library(tidyverse)
library(mgcv)

# Load multi-season player data
player_seasons <- read_csv("player_season_data.csv")

# Delta method: compare same player across seasons
build_aging_curve_delta <- function(data, metric) {
    # Create player-season pairs
    paired <- data %>%
        group_by(player_id) %>%
        arrange(season) %>%
        mutate(
            next_age = lead(age),
            next_value = lead(!!sym(metric)),
            delta = next_value - !!sym(metric)
        ) %>%
        filter(!is.na(delta)) %>%
        ungroup()

    # Calculate average delta at each age
    age_deltas <- paired %>%
        group_by(age) %>%
        summarize(
            mean_delta = mean(delta),
            se_delta = sd(delta) / sqrt(n()),
            n = n()
        )

    return(age_deltas)
}

# Build curve for npxG/90
npxg_aging <- build_aging_curve_delta(player_seasons, "npxg_90")

# Fit smooth curve using GAM
fit_smooth_aging <- function(age_deltas) {
    gam_fit <- gam(mean_delta ~ s(age, k = 10),
                   data = age_deltas,
                   weights = n)

    # Predict across age range
    prediction_ages <- data.frame(age = 18:38)
    prediction_ages$predicted_delta <- predict(gam_fit, prediction_ages)

    # Cumulate to get aging curve (relative to peak)
    prediction_ages$cumulative <- cumsum(prediction_ages$predicted_delta)
    peak_value <- max(prediction_ages$cumulative)
    prediction_ages$relative <- prediction_ages$cumulative - peak_value

    return(prediction_ages)
}

npxg_curve <- fit_smooth_aging(npxg_aging)
print(npxg_curve)
Output
    age  predicted_delta  cumulative  relative
0    18         0.0321       0.032    -0.247
1    19         0.0285       0.061    -0.219
2    20         0.0241       0.085    -0.195
3    21         0.0198       0.105    -0.175
4    22         0.0156       0.120    -0.159
...

Position-Specific Aging Curves

position_aging.py
# Python: Position-Specific Aging Analysis
# Different positions age differently

def build_position_curves(data):
    """Build aging curves for each position and metric."""
    positions = ["Forward", "Midfielder", "Defender", "Goalkeeper"]
    metrics = ["npxg_90", "xa_90", "progressive_carries", "aerial_wins"]

    curves = {}

    for pos in positions:
        pos_data = data[data["position"] == pos]
        curves[pos] = {}

        for metric in metrics:
            if metric in pos_data.columns:
                age_deltas = build_aging_curve_delta(pos_data, metric)
                if len(age_deltas) > 5:  # Need sufficient data
                    curves[pos][metric] = fit_smooth_aging(age_deltas)

    return curves

def find_peak_ages(curves):
    """Find peak age for each position-metric combination."""
    results = []

    for pos, metrics in curves.items():
        for metric, curve in metrics.items():
            peak_idx = curve["cumulative"].idxmax()
            peak_age = curve.loc[peak_idx, "age"]

            results.append({
                "position": pos,
                "metric": metric,
                "peak_age": peak_age
            })

    return pd.DataFrame(results)

position_curves = build_position_curves(player_seasons)
peak_ages = find_peak_ages(position_curves)
print(peak_ages)
# R: Position-Specific Aging Analysis
# Different positions age differently
build_position_curves <- function(data) {
    positions <- c("Forward", "Midfielder", "Defender", "Goalkeeper")
    metrics <- c("npxg_90", "xa_90", "progressive_carries", "aerial_wins")

    curves <- list()

    for (pos in positions) {
        pos_data <- data %>% filter(position == pos)
        curves[[pos]] <- list()

        for (metric in metrics) {
            age_deltas <- build_aging_curve_delta(pos_data, metric)
            curves[[pos]][[metric]] <- fit_smooth_aging(age_deltas)
        }
    }

    return(curves)
}

# Get peak ages by position and metric
find_peak_ages <- function(curves) {
    results <- data.frame()

    for (pos in names(curves)) {
        for (metric in names(curves[[pos]])) {
            curve <- curves[[pos]][[metric]]
            peak_age <- curve$age[which.max(curve$cumulative)]

            results <- rbind(results, data.frame(
                position = pos,
                metric = metric,
                peak_age = peak_age
            ))
        }
    }

    return(results)
}

position_curves <- build_position_curves(player_seasons)
peak_ages <- find_peak_ages(position_curves)
print(peak_ages)
Output
      position              metric  peak_age
0      Forward             npxg_90        27
1      Forward               xa_90        26
2      Forward  progressive_carries        25
3   Midfielder             npxg_90        28
4   Midfielder               xa_90        28
5   Midfielder  progressive_carries        27
6     Defender         aerial_wins        29
7   Goalkeeper          save_pct        30

Regression to the Mean

Players with extreme performances tend to move toward average in subsequent seasons. The amount of regression depends on sample size and the stability of the metric. Understanding regression is crucial for avoiding overpaying for outlier seasons.

Key Insight

A player who scored 20 goals last season with 15 xG likely benefited from luck. Projecting 20 goals again would ignore the high probability of regression. Similarly, a player who scored 8 goals with 12 xG likely underperformed and may improve.

regression_to_mean.py
# Python: Regression to the Mean Framework
import pandas as pd
import numpy as np

def calculate_regression_factor(data, metric, min_minutes=900):
    """Calculate year-to-year correlation and regression factor."""
    # Filter for sufficient playing time
    qualified = data[data["minutes"] >= min_minutes].copy()

    # Create consecutive season pairs
    qualified = qualified.sort_values(["player_id", "season"])
    qualified["next_value"] = qualified.groupby("player_id")[metric].shift(-1)
    paired = qualified.dropna(subset=["next_value"])

    # Calculate year-to-year correlation
    correlation = paired[metric].corr(paired["next_value"])

    # Regression factor = 1 - correlation
    regression_factor = 1 - correlation

    return {
        "correlation": correlation,
        "regression_factor": regression_factor,
        "n_pairs": len(paired)
    }

# Calculate for different metrics
metrics = ["goals_90", "npxg_90", "xa_90", "pass_completion", "tackles_90"]

regression_results = []
for m in metrics:
    result = calculate_regression_factor(player_seasons, m)
    regression_results.append({
        "metric": m,
        "correlation": result["correlation"],
        "regression_factor": result["regression_factor"],
        "n": result["n_pairs"]
    })

regression_factors = pd.DataFrame(regression_results)
print(regression_factors)

def regress_to_mean(observed, league_mean, reliability, sample_size,
                    typical_sample=2000):
    """Apply regression to the mean for projection."""
    # Reliability increases with sample size
    adjusted_reliability = reliability * min(1, sample_size / typical_sample)

    # Regressed estimate
    projection = (adjusted_reliability * observed) + \
                 ((1 - adjusted_reliability) * league_mean)

    return projection

# Example: Player with 0.50 npxG/90 in 1500 minutes
regressed_npxg = regress_to_mean(
    observed=0.50,
    league_mean=0.30,
    reliability=0.70,
    sample_size=1500
)

print(f"\nObserved: 0.50 npxG/90")
print(f"Regressed projection: {regressed_npxg:.3f} npxG/90")
# R: Regression to the Mean Framework
library(tidyverse)

# Calculate regression amount for different metrics
calculate_regression_factor <- function(data, metric, min_minutes = 900) {
    # Filter for sufficient playing time
    qualified <- data %>%
        filter(minutes >= min_minutes)

    # Split into consecutive seasons
    paired <- qualified %>%
        group_by(player_id) %>%
        arrange(season) %>%
        mutate(next_value = lead(!!sym(metric))) %>%
        filter(!is.na(next_value)) %>%
        ungroup()

    # Calculate year-to-year correlation
    correlation <- cor(paired[[metric]], paired$next_value)

    # Regression factor = 1 - correlation
    # Higher correlation = less regression needed
    regression_factor <- 1 - correlation

    return(list(
        correlation = correlation,
        regression_factor = regression_factor,
        n_pairs = nrow(paired)
    ))
}

# Calculate for different metrics
metrics <- c("goals_90", "npxg_90", "xa_90", "pass_completion", "tackles_90")

regression_factors <- map_df(metrics, function(m) {
    result <- calculate_regression_factor(player_seasons, m)
    data.frame(
        metric = m,
        correlation = result$correlation,
        regression_factor = result$regression_factor,
        n = result$n_pairs
    )
})

print(regression_factors)

# Apply regression to a player projection
regress_to_mean <- function(observed, league_mean, reliability, sample_size,
                           typical_sample = 2000) {
    # Reliability increases with sample size
    adjusted_reliability <- reliability * min(1, sample_size / typical_sample)

    # Regressed estimate
    projection <- (adjusted_reliability * observed) +
                  ((1 - adjusted_reliability) * league_mean)

    return(projection)
}

# Example: Player with 0.50 npxG/90 in 1500 minutes
# League average: 0.30 npxG/90
# Metric reliability: 0.70
regressed_npxg <- regress_to_mean(
    observed = 0.50,
    league_mean = 0.30,
    reliability = 0.70,
    sample_size = 1500
)

cat(sprintf("Observed: 0.50 npxG/90\nRegressed projection: %.3f npxG/90\n", regressed_npxg))
Output
            metric  correlation  regression_factor      n
0         goals_90        0.42              0.58   3421
1          npxg_90        0.71              0.29   3421
2            xa_90        0.68              0.32   3421
3  pass_completion        0.82              0.18   3421
4       tackles_90        0.58              0.42   3421

Observed: 0.50 npxG/90
Regressed projection: 0.455 npxG/90

Metric Stability Hierarchy

Different metrics have different signal-to-noise ratios. Stable metrics require less regression:

Stability Level Metrics Y-t-Y Correlation Regression Amount
High Pass completion, touches, progressive passes 0.75-0.85 15-25%
Medium-High npxG/90, xA/90, shot volume 0.65-0.75 25-35%
Medium Tackles, interceptions, pressures 0.55-0.65 35-45%
Low Goals, assists, G-xG, A-xA 0.35-0.55 45-65%

Marcel-Style Projection System

The Marcel projection system, originally developed for baseball, provides a simple but effective framework. It weights recent seasons, applies aging curves, and regresses to the mean. We adapt it for football analytics.

marcel_projections.py
# Python: Marcel-Style Football Projection System
import pandas as pd
import numpy as np

# Marcel weights: 5/4/3 for most recent three seasons
MARCEL_WEIGHTS = [5, 4, 3]

def calculate_marcel_projection(player_history, metric, league_avg,
                                 aging_curve, reliability):
    """Calculate Marcel-style projection for a player."""
    # Sort by recency (most recent first)
    history = player_history.sort_values("season", ascending=False).head(3)
    n_seasons = len(history)

    if n_seasons == 0:
        return {"projection": league_avg, "uncertainty": np.nan}

    # Apply Marcel weights
    weights = np.array(MARCEL_WEIGHTS[:n_seasons])
    weights = weights / weights.sum()  # Normalize

    # Weighted average of past performance
    weighted_avg = (history[metric].values * weights).sum()

    # Total weighted minutes for regression
    total_minutes = (history["minutes"].values * weights).sum()

    # Current age and aging adjustment
    current_age = history["age"].max() + 1  # Project for next season
    age_row = aging_curve[aging_curve["age"] == current_age]
    age_adjustment = age_row["relative"].values[0] if len(age_row) > 0 else 0

    # Age-adjusted weighted average
    age_adjusted = weighted_avg + age_adjustment

    # Regression to the mean
    regression_weight = min(1, total_minutes / 3000) * reliability
    projection = (regression_weight * age_adjusted) + \
                 ((1 - regression_weight) * league_avg)

    # Uncertainty based on sample size and metric stability
    uncertainty = history[metric].std() / np.sqrt(n_seasons) * \
                  (1 + 0.5 * (1 - regression_weight))

    return {
        "projection": projection,
        "uncertainty": uncertainty,
        "weighted_avg": weighted_avg,
        "age_adjustment": age_adjustment,
        "regression_weight": regression_weight
    }

# Example projection
player_history = pd.DataFrame({
    "season": [2024, 2023, 2022],
    "age": [26, 25, 24],
    "npxg_90": [0.45, 0.38, 0.32],
    "minutes": [2800, 2400, 1800]
})

projection = calculate_marcel_projection(
    player_history=player_history,
    metric="npxg_90",
    league_avg=0.30,
    aging_curve=npxg_curve,
    reliability=0.71
)

print("Marcel Projection Results:")
print(f"Weighted average: {projection['weighted_avg']:.3f}")
print(f"Age adjustment: {projection['age_adjustment']:.3f}")
print(f"Final projection: {projection['projection']:.3f} npxG/90")
lower = projection["projection"] - 1.96 * projection["uncertainty"]
upper = projection["projection"] + 1.96 * projection["uncertainty"]
print(f"95% CI: [{lower:.3f}, {upper:.3f}]")
# R: Marcel-Style Football Projection System
library(tidyverse)

# Marcel weights: 5/4/3 for most recent three seasons
marcel_weights <- c(5, 4, 3)

calculate_marcel_projection <- function(player_history, metric, league_avg,
                                        aging_curve, reliability) {
    # Sort by recency (most recent first)
    history <- player_history %>%
        arrange(desc(season)) %>%
        head(3)  # Use up to 3 seasons

    n_seasons <- nrow(history)

    if (n_seasons == 0) {
        return(list(projection = league_avg, uncertainty = NA))
    }

    # Apply Marcel weights
    weights <- marcel_weights[1:n_seasons]
    weights <- weights / sum(weights)  # Normalize

    # Weighted average of past performance
    weighted_avg <- sum(history[[metric]] * weights)

    # Total weighted minutes for regression
    total_minutes <- sum(history$minutes * weights)

    # Current age and aging adjustment
    current_age <- max(history$age) + 1  # Project for next season
    age_adjustment <- aging_curve[aging_curve$age == current_age, "relative"]
    if (length(age_adjustment) == 0) age_adjustment <- 0

    # Age-adjusted weighted average
    age_adjusted <- weighted_avg + age_adjustment

    # Regression to the mean
    regression_weight <- min(1, total_minutes / 3000) * reliability
    projection <- (regression_weight * age_adjusted) +
                  ((1 - regression_weight) * league_avg)

    # Uncertainty based on sample size and metric stability
    uncertainty <- sd(history[[metric]]) / sqrt(n_seasons) *
                   (1 + 0.5 * (1 - regression_weight))

    return(list(
        projection = projection,
        uncertainty = uncertainty,
        weighted_avg = weighted_avg,
        age_adjustment = age_adjustment,
        regression_weight = regression_weight
    ))
}

# Example projection
player_history <- data.frame(
    season = c(2024, 2023, 2022),
    age = c(26, 25, 24),
    npxg_90 = c(0.45, 0.38, 0.32),
    minutes = c(2800, 2400, 1800)
)

projection <- calculate_marcel_projection(
    player_history = player_history,
    metric = "npxg_90",
    league_avg = 0.30,
    aging_curve = npxg_curve,
    reliability = 0.71
)

cat("Marcel Projection Results:\n")
cat(sprintf("Weighted average: %.3f\n", projection$weighted_avg))
cat(sprintf("Age adjustment: %.3f\n", projection$age_adjustment))
cat(sprintf("Final projection: %.3f npxG/90\n", projection$projection))
cat(sprintf("95%% CI: [%.3f, %.3f]\n",
    projection$projection - 1.96 * projection$uncertainty,
    projection$projection + 1.96 * projection$uncertainty))
Output
Marcel Projection Results:
Weighted average: 0.401
Age adjustment: -0.008
Final projection: 0.384 npxG/90
95% CI: [0.321, 0.447]

Full Squad Projections

squad_projections.py
# Python: Project Entire Squad
def project_squad(squad_data, metrics, aging_curves,
                  league_averages, reliabilities):
    """Generate projections for entire squad."""
    projections = []

    for player_id in squad_data["player_id"].unique():
        player_history = squad_data[squad_data["player_id"] == player_id]
        player_name = player_history["player_name"].iloc[0]

        player_proj = {
            "player_id": player_id,
            "player_name": player_name
        }

        for metric in metrics:
            proj = calculate_marcel_projection(
                player_history=player_history,
                metric=metric,
                league_avg=league_averages[metric],
                aging_curve=aging_curves.get(metric, npxg_curve),
                reliability=reliabilities[metric]
            )

            player_proj[f"{metric}_proj"] = proj["projection"]
            player_proj[f"{metric}_lower"] = proj["projection"] - 1.96 * proj["uncertainty"]
            player_proj[f"{metric}_upper"] = proj["projection"] + 1.96 * proj["uncertainty"]

        projections.append(player_proj)

    return pd.DataFrame(projections)

# Define parameters
metrics = ["npxg_90", "xa_90", "progressive_passes_90"]
league_averages = {"npxg_90": 0.30, "xa_90": 0.15, "progressive_passes_90": 3.5}
reliabilities = {"npxg_90": 0.71, "xa_90": 0.68, "progressive_passes_90": 0.75}

# Generate squad projections
squad_projections = project_squad(
    squad_data=squad_historical_data,
    metrics=metrics,
    aging_curves=position_curves.get("Forward", {}),
    league_averages=league_averages,
    reliabilities=reliabilities
)

print(squad_projections.head())
# R: Project Entire Squad
project_squad <- function(squad_data, metrics, aging_curves,
                          league_averages, reliabilities) {
    projections <- list()

    for (player_id in unique(squad_data$player_id)) {
        player_history <- squad_data %>% filter(player_id == !!player_id)
        player_name <- player_history$player_name[1]

        player_proj <- list(player_id = player_id, player_name = player_name)

        for (metric in metrics) {
            proj <- calculate_marcel_projection(
                player_history = player_history,
                metric = metric,
                league_avg = league_averages[[metric]],
                aging_curve = aging_curves[[metric]],
                reliability = reliabilities[[metric]]
            )

            player_proj[[paste0(metric, "_proj")]] <- proj$projection
            player_proj[[paste0(metric, "_lower")]] <- proj$projection - 1.96 * proj$uncertainty
            player_proj[[paste0(metric, "_upper")]] <- proj$projection + 1.96 * proj$uncertainty
        }

        projections[[player_id]] <- player_proj
    }

    # Convert to dataframe
    return(bind_rows(projections))
}

# Define parameters
metrics <- c("npxg_90", "xa_90", "progressive_passes_90")
league_averages <- list(npxg_90 = 0.30, xa_90 = 0.15, progressive_passes_90 = 3.5)
reliabilities <- list(npxg_90 = 0.71, xa_90 = 0.68, progressive_passes_90 = 0.75)

# Generate squad projections
squad_projections <- project_squad(
    squad_data = squad_historical_data,
    metrics = metrics,
    aging_curves = position_curves[["Forward"]],
    league_averages = league_averages,
    reliabilities = reliabilities
)

print(head(squad_projections))

Machine Learning Projections

Machine learning models can capture non-linear relationships and interactions between features that simpler methods miss. However, they require more data and careful validation to avoid overfitting.

ml_projections.py
# Python: ML-Based Player Projections
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

def prepare_projection_features(player_data):
    """Prepare features for ML projection model."""
    data = player_data.sort_values(["player_id", "season"]).copy()

    # Create lag and rolling features
    for player_id, group in data.groupby("player_id"):
        idx = group.index

        # Lag features
        data.loc[idx, "metric_lag1"] = group["npxg_90"].shift(1)
        data.loc[idx, "metric_lag2"] = group["npxg_90"].shift(2)
        data.loc[idx, "metric_lag3"] = group["npxg_90"].shift(3)

        # Rolling averages
        data.loc[idx, "metric_ma2"] = group["npxg_90"].shift(1).rolling(2).mean()
        data.loc[idx, "metric_ma3"] = group["npxg_90"].shift(1).rolling(3).mean()

        # Trend
        data.loc[idx, "metric_trend"] = group["npxg_90"].shift(1) - group["npxg_90"].shift(2)

        # Minutes features
        data.loc[idx, "minutes_lag1"] = group["minutes"].shift(1)
        data.loc[idx, "minutes_ma2"] = group["minutes"].shift(1).rolling(2).mean()

        # Target
        data.loc[idx, "target"] = group["npxg_90"].shift(-1)

    # Age features
    data["age_squared"] = data["age"] ** 2

    # Filter complete cases
    data = data.dropna(subset=["target", "metric_lag1"])

    return data

def train_projection_model(train_data):
    """Train XGBoost projection model."""
    feature_cols = ["metric_lag1", "metric_lag2", "metric_ma2", "metric_trend",
                    "minutes_lag1", "minutes_ma2", "age", "age_squared"]

    # Remove rows with missing features
    complete_data = train_data.dropna(subset=feature_cols + ["target"])

    X = complete_data[feature_cols]
    y = complete_data["target"]

    # Train-validation split
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    model = XGBRegressor(
        max_depth=4,
        learning_rate=0.05,
        n_estimators=500,
        min_child_weight=10,
        subsample=0.8,
        colsample_bytree=0.8,
        early_stopping_rounds=50,
        random_state=42
    )

    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        verbose=False
    )

    # Evaluate
    val_pred = model.predict(X_val)
    print(f"Validation RMSE: {np.sqrt(mean_squared_error(y_val, val_pred)):.4f}")
    print(f"Validation R²: {r2_score(y_val, val_pred):.4f}")

    return model

# Train model
projection_features = prepare_projection_features(player_seasons)
ml_model = train_projection_model(projection_features)

# Feature importance
importance = pd.DataFrame({
    "feature": feature_cols,
    "importance": ml_model.feature_importances_
}).sort_values("importance", ascending=False)
print("\nFeature Importance:")
print(importance)
# R: ML-Based Player Projections
library(tidyverse)
library(xgboost)
library(caret)

# Prepare features for ML model
prepare_projection_features <- function(player_data) {
    features <- player_data %>%
        group_by(player_id) %>%
        arrange(season) %>%
        mutate(
            # Lag features (previous seasons)
            metric_lag1 = lag(npxg_90),
            metric_lag2 = lag(npxg_90, 2),
            metric_lag3 = lag(npxg_90, 3),

            # Rolling averages
            metric_ma2 = (lag(npxg_90) + lag(npxg_90, 2)) / 2,
            metric_ma3 = (lag(npxg_90) + lag(npxg_90, 2) + lag(npxg_90, 3)) / 3,

            # Trend features
            metric_trend = lag(npxg_90) - lag(npxg_90, 2),

            # Minutes features
            minutes_lag1 = lag(minutes),
            minutes_ma2 = (lag(minutes) + lag(minutes, 2)) / 2,

            # Age features
            age_squared = age^2,

            # Target: next season performance
            target = lead(npxg_90)
        ) %>%
        filter(!is.na(target), !is.na(metric_lag1)) %>%
        ungroup()

    return(features)
}

# Train XGBoost model
train_projection_model <- function(train_data) {
    feature_cols <- c("metric_lag1", "metric_lag2", "metric_ma2", "metric_trend",
                      "minutes_lag1", "minutes_ma2", "age", "age_squared")

    # Remove rows with missing features
    complete_data <- train_data %>%
        select(all_of(c(feature_cols, "target"))) %>%
        drop_na()

    X <- as.matrix(complete_data[feature_cols])
    y <- complete_data$target

    # Train-validation split
    set.seed(42)
    train_idx <- createDataPartition(y, p = 0.8, list = FALSE)

    dtrain <- xgb.DMatrix(data = X[train_idx,], label = y[train_idx])
    dval <- xgb.DMatrix(data = X[-train_idx,], label = y[-train_idx])

    params <- list(
        objective = "reg:squarederror",
        max_depth = 4,
        eta = 0.05,
        min_child_weight = 10,
        subsample = 0.8,
        colsample_bytree = 0.8
    )

    model <- xgb.train(
        params = params,
        data = dtrain,
        nrounds = 500,
        watchlist = list(train = dtrain, val = dval),
        early_stopping_rounds = 50,
        verbose = 0
    )

    return(model)
}

# Train model
projection_features <- prepare_projection_features(player_seasons)
ml_model <- train_projection_model(projection_features)

# Feature importance
importance <- xgb.importance(model = ml_model)
print(importance)
Output
Validation RMSE: 0.0821
Validation R²: 0.5847

Feature Importance:
        feature  importance
0   metric_lag1      0.4231
2    metric_ma2      0.1892
6           age      0.1245
3  metric_trend      0.0987
4  minutes_lag1      0.0754
1   metric_lag2      0.0521
5   minutes_ma2      0.0218
7  age_squared      0.0152

Ensemble Projections

Combining multiple projection methods often improves accuracy:

ensemble_projections.py
# Python: Ensemble Projection System
def create_ensemble_projection(player_history, ml_model,
                                league_avg, aging_curve, reliability):
    """Create ensemble projection from multiple methods."""
    # Method 1: Marcel projection
    marcel = calculate_marcel_projection(
        player_history, "npxg_90", league_avg, aging_curve, reliability
    )

    # Method 2: ML projection
    features = prepare_projection_features(player_history)
    latest = features.sort_values("season", ascending=False).iloc[[0]]
    feature_cols = ["metric_lag1", "metric_lag2", "metric_ma2", "metric_trend",
                    "minutes_lag1", "minutes_ma2", "age", "age_squared"]
    ml_pred = ml_model.predict(latest[feature_cols])[0]

    # Method 3: Simple weighted average
    recent = player_history.sort_values("season", ascending=False).head(2)
    simple_avg = np.average(recent["npxg_90"], weights=[0.7, 0.3])

    # Combine methods (weighted by historical accuracy)
    weights = {"marcel": 0.35, "ml": 0.40, "simple": 0.25}

    ensemble = (weights["marcel"] * marcel["projection"] +
                weights["ml"] * ml_pred +
                weights["simple"] * simple_avg)

    # Uncertainty from method disagreement
    methods = [marcel["projection"], ml_pred, simple_avg]
    uncertainty = np.std(methods) + marcel["uncertainty"]

    return {
        "ensemble": ensemble,
        "marcel": marcel["projection"],
        "ml": ml_pred,
        "simple": simple_avg,
        "uncertainty": uncertainty,
        "lower_95": ensemble - 1.96 * uncertainty,
        "upper_95": ensemble + 1.96 * uncertainty
    }

# Generate ensemble projection
ensemble_result = create_ensemble_projection(
    player_history=player_history,
    ml_model=ml_model,
    league_avg=0.30,
    aging_curve=npxg_curve,
    reliability=0.71
)

print("Ensemble Projection:")
print(f"Marcel: {ensemble_result['marcel']:.3f}")
print(f"ML: {ensemble_result['ml']:.3f}")
print(f"Simple: {ensemble_result['simple']:.3f}")
print(f"Ensemble: {ensemble_result['ensemble']:.3f} "
      f"(95% CI: [{ensemble_result['lower_95']:.3f}, {ensemble_result['upper_95']:.3f}])")
# R: Ensemble Projection System
create_ensemble_projection <- function(player_history, ml_model,
                                       league_avg, aging_curve, reliability) {
    # Method 1: Marcel projection
    marcel <- calculate_marcel_projection(
        player_history, "npxg_90", league_avg, aging_curve, reliability
    )

    # Method 2: ML projection
    features <- prepare_projection_features(player_history)
    latest <- features %>% arrange(desc(season)) %>% slice(1)
    ml_pred <- predict(ml_model, as.matrix(latest[feature_cols]))

    # Method 3: Simple weighted average
    recent <- player_history %>%
        arrange(desc(season)) %>%
        head(2)
    simple_avg <- weighted.mean(recent$npxg_90, c(0.7, 0.3))

    # Combine methods (weighted by historical accuracy)
    weights <- c(marcel = 0.35, ml = 0.40, simple = 0.25)

    ensemble <- (weights["marcel"] * marcel$projection +
                 weights["ml"] * ml_pred +
                 weights["simple"] * simple_avg)

    # Uncertainty from method disagreement
    methods <- c(marcel$projection, ml_pred, simple_avg)
    uncertainty <- sd(methods) + marcel$uncertainty

    return(list(
        ensemble = ensemble,
        marcel = marcel$projection,
        ml = ml_pred,
        simple = simple_avg,
        uncertainty = uncertainty,
        lower_95 = ensemble - 1.96 * uncertainty,
        upper_95 = ensemble + 1.96 * uncertainty
    ))
}

# Generate ensemble projection
ensemble_result <- create_ensemble_projection(
    player_history = player_history,
    ml_model = ml_model,
    league_avg = 0.30,
    aging_curve = npxg_curve,
    reliability = 0.71
)

cat("Ensemble Projection:\n")
cat(sprintf("Marcel: %.3f\n", ensemble_result$marcel))
cat(sprintf("ML: %.3f\n", ensemble_result$ml))
cat(sprintf("Simple: %.3f\n", ensemble_result$simple))
cat(sprintf("Ensemble: %.3f (95%% CI: [%.3f, %.3f])\n",
    ensemble_result$ensemble, ensemble_result$lower_95, ensemble_result$upper_95))
Output
Ensemble Projection:
Marcel: 0.384
ML: 0.412
Simple: 0.427
Ensemble: 0.407 (95% CI: [0.318, 0.496])

In-Season Projections

Projecting final season totals from partial data requires balancing observed performance against pre-season expectations. As the season progresses, we weight current performance more heavily.

season_projections.py
# Python: In-Season Total Projections
def calculate_season_projection(preseason_proj, current_stats,
                                 games_played, total_games=38):
    """Project final season totals from partial data."""
    games_remaining = total_games - games_played
    season_progress = games_played / total_games

    # Calculate pace
    current_pace = current_stats * (total_games / games_played)

    # Dynamic weighting based on games played
    # More games = more weight on observed performance
    reliability_by_games = min(0.95, 0.2 + 0.8 * (games_played / 30))

    # Blend preseason projection with current pace
    blended_projection = (reliability_by_games * current_pace +
                         (1 - reliability_by_games) * preseason_proj)

    # Uncertainty decreases as season progresses
    pace_variance = current_stats * 0.3  # Rough variance estimate
    uncertainty = np.sqrt(games_remaining / total_games) * pace_variance

    return {
        "projection": blended_projection,
        "current_pace": current_pace,
        "reliability": reliability_by_games,
        "uncertainty": uncertainty,
        "lower_90": blended_projection - 1.645 * uncertainty,
        "upper_90": blended_projection + 1.645 * uncertainty
    }

# Example: Player projections at different points in season
preseason_xg = 15.0  # Preseason projection: 15 xG
current_xg = 6.5     # Current: 6.5 xG in 14 games

season_proj = calculate_season_projection(
    preseason_proj=preseason_xg,
    current_stats=current_xg,
    games_played=14
)

print("In-Season Projection (14 games):")
print(f"Preseason projection: {preseason_xg:.1f} xG")
print(f"Current pace: {season_proj['current_pace']:.1f} xG")
print(f"Blended projection: {season_proj['projection']:.1f} xG "
      f"(90% CI: [{season_proj['lower_90']:.1f}, {season_proj['upper_90']:.1f}])")
print(f"Current performance weight: {season_proj['reliability'] * 100:.0f}%")
# R: In-Season Total Projections
calculate_season_projection <- function(preseason_proj, current_stats,
                                         games_played, total_games = 38) {
    games_remaining <- total_games - games_played
    season_progress <- games_played / total_games

    # Calculate pace
    current_pace <- current_stats * (total_games / games_played)

    # Dynamic weighting based on games played
    # More games = more weight on observed performance
    reliability_by_games <- min(0.95, 0.2 + 0.8 * (games_played / 30))

    # Blend preseason projection with current pace
    blended_projection <- (reliability_by_games * current_pace +
                          (1 - reliability_by_games) * preseason_proj)

    # Uncertainty decreases as season progresses
    pace_variance <- current_stats * 0.3  # Rough variance estimate
    uncertainty <- sqrt(games_remaining / total_games) * pace_variance

    return(list(
        projection = blended_projection,
        current_pace = current_pace,
        reliability = reliability_by_games,
        uncertainty = uncertainty,
        lower_90 = blended_projection - 1.645 * uncertainty,
        upper_90 = blended_projection + 1.645 * uncertainty
    ))
}

# Example: Player projections at different points in season
preseason_xg <- 15.0  # Preseason projection: 15 xG
current_xg <- 6.5     # Current: 6.5 xG in 14 games

season_proj <- calculate_season_projection(
    preseason_proj = preseason_xg,
    current_stats = current_xg,
    games_played = 14
)

cat("In-Season Projection (14 games):\n")
cat(sprintf("Preseason projection: %.1f xG\n", preseason_xg))
cat(sprintf("Current pace: %.1f xG\n", season_proj$current_pace))
cat(sprintf("Blended projection: %.1f xG (90%% CI: [%.1f, %.1f])\n",
    season_proj$projection, season_proj$lower_90, season_proj$upper_90))
cat(sprintf("Current performance weight: %.0f%%\n", season_proj$reliability * 100))
Output
In-Season Projection (14 games):
Preseason projection: 15.0 xG
Current pace: 17.6 xG
Blended projection: 16.5 xG (90% CI: [13.3, 19.7])
Current performance weight: 57%

Projection Updates Through Season

projection_updates.py
# Python: Track Projection Updates Through Season
import matplotlib.pyplot as plt

def track_season_projections(player_game_log, preseason_proj,
                              metric="xg"):
    """Track how projections evolve through the season."""
    updates = []
    total_games = 38

    cumulative = 0
    for i, (_, game) in enumerate(player_game_log.iterrows(), 1):
        cumulative += game[metric]

        proj = calculate_season_projection(
            preseason_proj=preseason_proj,
            current_stats=cumulative,
            games_played=i,
            total_games=total_games
        )

        updates.append({
            "matchweek": i,
            "cumulative": cumulative,
            "pace": proj["current_pace"],
            "projection": proj["projection"],
            "lower": proj["lower_90"],
            "upper": proj["upper_90"],
            "reliability": proj["reliability"]
        })

    return pd.DataFrame(updates)

def plot_projection_evolution(updates):
    """Visualize how projections change through season."""
    fig, ax = plt.subplots(figsize=(12, 6))

    # Confidence interval
    ax.fill_between(updates["matchweek"],
                    updates["lower"], updates["upper"],
                    alpha=0.3, color="blue", label="90% CI")

    # Projection line
    ax.plot(updates["matchweek"], updates["projection"],
            color="blue", linewidth=2, label="Projection")

    # Current pace
    ax.plot(updates["matchweek"], updates["pace"],
            color="gray", linestyle="--", label="Current Pace")

    # Actual cumulative
    ax.scatter(updates["matchweek"], updates["cumulative"],
               color="red", s=30, label="Actual")

    ax.set_xlabel("Matchweek")
    ax.set_ylabel("Expected Goals (xG)")
    ax.set_title("Season Projection Evolution")
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig("projection_evolution.png", dpi=150)
    plt.show()

# Generate and plot
projection_updates = track_season_projections(player_game_log, 15.0, "xg")
plot_projection_evolution(projection_updates)
# R: Track Projection Updates Through Season
track_season_projections <- function(player_game_log, preseason_proj,
                                     metric = "xg") {
    updates <- data.frame()
    total_games <- 38

    cumulative <- 0
    for (i in 1:nrow(player_game_log)) {
        game <- player_game_log[i,]
        cumulative <- cumulative + game[[metric]]

        proj <- calculate_season_projection(
            preseason_proj = preseason_proj,
            current_stats = cumulative,
            games_played = i,
            total_games = total_games
        )

        updates <- rbind(updates, data.frame(
            matchweek = i,
            cumulative = cumulative,
            pace = proj$current_pace,
            projection = proj$projection,
            lower = proj$lower_90,
            upper = proj$upper_90,
            reliability = proj$reliability
        ))
    }

    return(updates)
}

# Visualize projection evolution
library(ggplot2)

plot_projection_evolution <- function(updates) {
    ggplot(updates, aes(x = matchweek)) +
        geom_ribbon(aes(ymin = lower, ymax = upper),
                    fill = "lightblue", alpha = 0.5) +
        geom_line(aes(y = projection), color = "blue", linewidth = 1.2) +
        geom_line(aes(y = pace), color = "gray", linetype = "dashed") +
        geom_point(aes(y = cumulative), color = "red", size = 2) +
        labs(title = "Season Projection Evolution",
             x = "Matchweek", y = "Expected Goals (xG)",
             caption = "Blue = projection, Red = actual, Gray = pace") +
        theme_minimal()
}

# Generate and plot
projection_updates <- track_season_projections(player_game_log, 15.0, "xg")
plot_projection_evolution(projection_updates)

Career Trajectory Modeling

Long-term career projections help with multi-year contract decisions and identifying players who may outperform or underperform expectations. This involves modeling entire career arcs rather than single seasons.

career_trajectories.py
# Python: Career Trajectory Projections
def project_career(player_history, aging_curve, years_ahead=5):
    """Project performance over multiple future seasons."""
    # Get current state
    latest = player_history.sort_values("season", ascending=False).iloc[0]

    current_age = latest["age"]
    current_level = latest["npxg_90"]

    # Project each future season
    projections = []

    for year in range(1, years_ahead + 1):
        future_age = current_age + year

        # Get aging adjustment
        age_row = aging_curve[aging_curve["age"] == future_age]
        if len(age_row) > 0:
            age_adj = age_row["relative"].values[0]
        else:
            age_adj = -0.05 * year  # Default decline

        # Calculate projection with increasing uncertainty
        base_proj = current_level + age_adj
        uncertainty = 0.05 * np.sqrt(year)  # Uncertainty grows with time

        projections.append({
            "season": int(latest["season"]) + year,
            "age": future_age,
            "projection": base_proj,
            "lower_80": base_proj - 1.28 * uncertainty,
            "upper_80": base_proj + 1.28 * uncertainty,
            "lower_95": base_proj - 1.96 * uncertainty,
            "upper_95": base_proj + 1.96 * uncertainty
        })

    return pd.DataFrame(projections)

# Project career for player
career_proj = project_career(
    player_history=player_history,
    aging_curve=npxg_curve,
    years_ahead=8
)

print(career_proj)
# R: Career Trajectory Projections
library(tidyverse)

project_career <- function(player_history, aging_curve, years_ahead = 5) {
    # Get current state
    latest <- player_history %>%
        arrange(desc(season)) %>%
        slice(1)

    current_age <- latest$age
    current_level <- latest$npxg_90

    # Project each future season
    projections <- data.frame()

    for (year in 1:years_ahead) {
        future_age <- current_age + year

        # Get aging adjustment
        age_adj <- aging_curve %>%
            filter(age == future_age) %>%
            pull(relative)

        if (length(age_adj) == 0) age_adj <- -0.05 * year  # Default decline

        # Calculate projection with increasing uncertainty
        base_proj <- current_level + age_adj
        uncertainty <- 0.05 * sqrt(year)  # Uncertainty grows with time

        projections <- rbind(projections, data.frame(
            season = latest$season + year,
            age = future_age,
            projection = base_proj,
            lower_80 = base_proj - 1.28 * uncertainty,
            upper_80 = base_proj + 1.28 * uncertainty,
            lower_95 = base_proj - 1.96 * uncertainty,
            upper_95 = base_proj + 1.96 * uncertainty
        ))
    }

    return(projections)
}

# Project career for a 25-year-old
career_proj <- project_career(
    player_history = player_history,
    aging_curve = npxg_curve,
    years_ahead = 8
)

print(career_proj)
Output
   season  age  projection  lower_80  upper_80  lower_95  upper_95
0    2025   27       0.398     0.334     0.462     0.300     0.496
1    2026   28       0.391     0.301     0.481     0.253     0.529
2    2027   29       0.380     0.268     0.492     0.206     0.554
3    2028   30       0.362     0.232     0.492     0.157     0.567
4    2029   31       0.340     0.195     0.485     0.109     0.571
5    2030   32       0.312     0.156     0.468     0.059     0.565
6    2031   33       0.278     0.114     0.442     0.007     0.549
7    2032   34       0.240     0.072     0.408    -0.047     0.527

Contract Value Projections

contract_value.py
# Python: Project Contract Value
def calculate_contract_value(career_proj, wage_per_npxg=50000,
                              discount_rate=0.05):
    """Calculate present value of projected future performance."""
    proj = career_proj.copy()

    # Calculate expected value of each season
    proj["season_value"] = proj["projection"] * wage_per_npxg * 38  # 38 games

    # Discount future seasons
    proj["year_num"] = range(len(proj))
    proj["discount_factor"] = 1 / (1 + discount_rate) ** proj["year_num"]

    # Present value
    proj["present_value"] = proj["season_value"] * proj["discount_factor"]

    # Uncertainty in value
    proj["value_uncertainty"] = ((proj["upper_95"] - proj["lower_95"]) / 2) * \
                                 wage_per_npxg * 38 * proj["discount_factor"]

    # Total contract value
    total_value = proj["present_value"].sum()
    total_uncertainty = np.sqrt((proj["value_uncertainty"] ** 2).sum())

    return {
        "yearly_values": proj[["season", "age", "projection",
                               "season_value", "present_value"]],
        "total_value": total_value,
        "lower_bound": total_value - 1.96 * total_uncertainty,
        "upper_bound": total_value + 1.96 * total_uncertainty
    }

# Calculate value for 5-year contract
contract_value = calculate_contract_value(
    career_proj.head(5),
    wage_per_npxg=50000,  # £50k per npxG per season
    discount_rate=0.05
)

print("5-Year Contract Valuation:")
print(f"Total Present Value: £{contract_value['total_value']/1e6:.2f}M")
print(f"95% CI: [£{contract_value['lower_bound']/1e6:.2f}M, "
      f"£{contract_value['upper_bound']/1e6:.2f}M]")
print(contract_value["yearly_values"])
# R: Project Contract Value
calculate_contract_value <- function(career_proj, wage_per_npxg = 50000,
                                      discount_rate = 0.05) {
    # Calculate expected value of each season
    career_proj <- career_proj %>%
        mutate(
            # Seasonal expected value
            season_value = projection * wage_per_npxg * 38,  # 38 games

            # Discount future seasons
            discount_factor = (1 / (1 + discount_rate)) ^ (row_number() - 1),

            # Present value
            present_value = season_value * discount_factor,

            # Uncertainty in value
            value_uncertainty = ((upper_95 - lower_95) / 2) * wage_per_npxg * 38 * discount_factor
        )

    # Total contract value
    total_value <- sum(career_proj$present_value)
    total_uncertainty <- sqrt(sum(career_proj$value_uncertainty^2))

    return(list(
        yearly_values = career_proj %>%
            select(season, age, projection, season_value, present_value),
        total_value = total_value,
        lower_bound = total_value - 1.96 * total_uncertainty,
        upper_bound = total_value + 1.96 * total_uncertainty
    ))
}

# Calculate value for 5-year contract
contract_value <- calculate_contract_value(
    career_proj %>% head(5),
    wage_per_npxg = 50000,  # £50k per npxG per season
    discount_rate = 0.05
)

cat("5-Year Contract Valuation:\n")
cat(sprintf("Total Present Value: £%.2fM\n", contract_value$total_value / 1e6))
cat(sprintf("95%% CI: [£%.2fM, £%.2fM]\n",
    contract_value$lower_bound / 1e6, contract_value$upper_bound / 1e6))
print(contract_value$yearly_values)
Output
5-Year Contract Valuation:
Total Present Value: £3.24M
95% CI: [£2.18M, £4.30M]
   season  age  projection  season_value  present_value
0    2025   27       0.398        756200       756200.0
1    2026   28       0.391        742900       707523.8
2    2027   29       0.380        722000       654478.5
3    2028   30       0.362        687800       593947.6
4    2029   31       0.340        646000       531195.3

Quantifying Uncertainty

All projections come with uncertainty. Properly quantifying and communicating this uncertainty is crucial for making sound decisions. Overconfident projections lead to poor choices.

Sources of Uncertainty
  • Sample size: Small samples = larger confidence intervals
  • Metric stability: Volatile metrics have wider ranges
  • Model error: No model is perfect
  • Regime changes: Team/league switches
  • Injury risk: Health is uncertain
  • Playing time: Minutes are not guaranteed
Communicating Uncertainty
  • Always report confidence intervals, not point estimates alone
  • Use fan charts for multi-year projections
  • Show percentile ranges (10th, 50th, 90th)
  • Explain what drives uncertainty
  • Compare to historical prediction accuracy
  • Calibrate intervals against out-of-sample results
bootstrap_uncertainty.py
# Python: Bootstrapped Confidence Intervals
import numpy as np

def bootstrap_projection(player_history, metric, n_boot=1000,
                          league_avg=0.30, reliability=0.70):
    """Generate bootstrapped confidence intervals."""
    boot_projections = []

    for _ in range(n_boot):
        # Bootstrap resample of seasons
        boot_sample = player_history.sample(n=len(player_history), replace=True)

        # Calculate projection on resampled data
        proj = calculate_marcel_projection(
            player_history=boot_sample,
            metric=metric,
            league_avg=league_avg,
            aging_curve=npxg_curve,
            reliability=reliability
        )

        boot_projections.append(proj["projection"])

    boot_projections = np.array(boot_projections)

    # Calculate percentiles
    percentiles = np.percentile(boot_projections,
                                 [5, 10, 25, 50, 75, 90, 95])

    return {
        "mean": np.mean(boot_projections),
        "sd": np.std(boot_projections),
        "percentiles": dict(zip([5, 10, 25, 50, 75, 90, 95], percentiles)),
        "distribution": boot_projections
    }

# Run bootstrap
boot_result = bootstrap_projection(
    player_history=player_history,
    metric="npxg_90",
    n_boot=1000
)

print("Bootstrap Projection Distribution:")
print(f"Mean: {boot_result['mean']:.3f}")
print(f"Std Dev: {boot_result['sd']:.3f}")
print("\nPercentiles:")
for p, v in boot_result["percentiles"].items():
    print(f"  {p}th: {v:.3f}")
# R: Bootstrapped Confidence Intervals
library(tidyverse)

bootstrap_projection <- function(player_history, metric, n_boot = 1000,
                                  league_avg = 0.30, reliability = 0.70) {
    boot_projections <- numeric(n_boot)

    for (i in 1:n_boot) {
        # Bootstrap resample of seasons
        boot_sample <- player_history %>%
            slice_sample(n = nrow(player_history), replace = TRUE)

        # Calculate projection on resampled data
        proj <- calculate_marcel_projection(
            player_history = boot_sample,
            metric = metric,
            league_avg = league_avg,
            aging_curve = npxg_curve,
            reliability = reliability
        )

        boot_projections[i] <- proj$projection
    }

    # Calculate percentiles
    percentiles <- quantile(boot_projections,
                            probs = c(0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95))

    return(list(
        mean = mean(boot_projections),
        sd = sd(boot_projections),
        percentiles = percentiles,
        distribution = boot_projections
    ))
}

# Run bootstrap
boot_result <- bootstrap_projection(
    player_history = player_history,
    metric = "npxg_90",
    n_boot = 1000
)

cat("Bootstrap Projection Distribution:\n")
cat(sprintf("Mean: %.3f\n", boot_result$mean))
cat(sprintf("Std Dev: %.3f\n", boot_result$sd))
cat("\nPercentiles:\n")
print(boot_result$percentiles)
Output
Bootstrap Projection Distribution:
Mean: 0.387
Std Dev: 0.048

Percentiles:
  5th: 0.312
  10th: 0.328
  25th: 0.354
  50th: 0.385
  75th: 0.418
  90th: 0.449
  95th: 0.468

Building a Projection Dashboard

A well-designed projection dashboard helps decision-makers quickly understand projected performance and its uncertainty.

projection_dashboard.py
# Python: Complete Projection Dashboard
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

def create_projection_dashboard(player_data, player_id):
    """Create comprehensive projection dashboard for a player."""
    player = player_data[player_data["player_id"] == player_id]
    player_name = player["player_name"].iloc[0]

    # Generate all projections
    marcel_proj = calculate_marcel_projection(player, "npxg_90", 0.30, npxg_curve, 0.71)
    career_proj = project_career(player, npxg_curve, 5)
    boot_proj = bootstrap_projection(player, "npxg_90")

    # Create figure with subplots
    fig = plt.figure(figsize=(14, 10))
    gs = GridSpec(2, 2, figure=fig)

    # 1. Historical + Projection chart
    ax1 = fig.add_subplot(gs[0, :])
    ax1.plot(player["season"], player["npxg_90"], "bo-", label="Historical")
    ax1.fill_between(career_proj["season"],
                     career_proj["lower_95"], career_proj["upper_95"],
                     alpha=0.3, color="blue", label="95% CI")
    ax1.plot(career_proj["season"], career_proj["projection"],
             "b--", label="Projection")
    ax1.set_xlabel("Season")
    ax1.set_ylabel("npxG/90")
    ax1.set_title(f"{player_name} - npxG/90 Trajectory")
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    # 2. Projection distribution
    ax2 = fig.add_subplot(gs[1, 0])
    ax2.hist(boot_proj["distribution"], bins=30, color="steelblue",
             alpha=0.7, edgecolor="white")
    ax2.axvline(marcel_proj["projection"], color="red",
                linestyle="--", linewidth=2, label="Point Estimate")
    ax2.set_xlabel("npxG/90 Projection")
    ax2.set_ylabel("Frequency")
    ax2.set_title("Projection Distribution")
    ax2.legend()

    # 3. Percentile comparison
    ax3 = fig.add_subplot(gs[1, 1])
    percentiles = [5, 25, 50, 75, 95]
    values = [boot_proj["percentiles"][p] for p in percentiles]
    colors = ["#d73027", "#fc8d59", "#fee08b", "#91cf60", "#1a9850"]
    ax3.barh(range(len(percentiles)), values, color=colors)
    ax3.set_yticks(range(len(percentiles)))
    ax3.set_yticklabels([f"{p}th" for p in percentiles])
    ax3.set_xlabel("npxG/90")
    ax3.set_title("Projection Percentiles")
    ax3.axvline(marcel_proj["projection"], color="black",
                linestyle="--", label="Point Estimate")

    plt.tight_layout()
    plt.savefig(f"projection_dashboard_{player_id}.png", dpi=150)
    plt.show()

    # Summary table
    comparison = pd.DataFrame({
        "Metric": ["Historical Avg", "Marcel Projection", "10th Percentile",
                   "Median", "90th Percentile"],
        "Value": [
            player["npxg_90"].mean(),
            marcel_proj["projection"],
            boot_proj["percentiles"][10],
            boot_proj["percentiles"][50],
            boot_proj["percentiles"][90]
        ]
    })

    return {
        "comparison": comparison,
        "career_proj": career_proj,
        "boot_dist": boot_proj["distribution"]
    }

# Generate dashboard
dashboard = create_projection_dashboard(squad_data, "player_123")
print(dashboard["comparison"])
# R: Complete Projection Dashboard
library(tidyverse)
library(plotly)

create_projection_dashboard <- function(player_data, player_id) {
    player <- player_data %>% filter(player_id == !!player_id)
    player_name <- player$player_name[1]

    # Generate all projections
    marcel_proj <- calculate_marcel_projection(player, "npxg_90", 0.30, npxg_curve, 0.71)
    career_proj <- project_career(player, npxg_curve, 5)
    boot_proj <- bootstrap_projection(player, "npxg_90")

    # Create visualizations
    dashboard <- list()

    # 1. Historical + Projection chart
    history_chart <- ggplot() +
        geom_line(data = player, aes(x = season, y = npxg_90), color = "blue") +
        geom_point(data = player, aes(x = season, y = npxg_90), color = "blue") +
        geom_ribbon(data = career_proj,
                    aes(x = season, ymin = lower_95, ymax = upper_95),
                    fill = "lightblue", alpha = 0.3) +
        geom_line(data = career_proj, aes(x = season, y = projection),
                  color = "blue", linetype = "dashed") +
        labs(title = paste(player_name, "- npxG/90 Trajectory"),
             x = "Season", y = "npxG/90") +
        theme_minimal()

    # 2. Projection distribution
    dist_chart <- ggplot(data.frame(x = boot_proj$distribution), aes(x = x)) +
        geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
        geom_vline(xintercept = marcel_proj$projection,
                   color = "red", linetype = "dashed", size = 1) +
        labs(title = "Projection Distribution",
             x = "npxG/90 Projection", y = "Frequency") +
        theme_minimal()

    # 3. Comparison table
    comparison <- data.frame(
        Metric = c("Historical Avg", "Marcel Projection", "10th Percentile",
                   "Median", "90th Percentile"),
        Value = c(
            mean(player$npxg_90),
            marcel_proj$projection,
            boot_proj$percentiles["10%"],
            boot_proj$percentiles["50%"],
            boot_proj$percentiles["90%"]
        )
    )

    return(list(
        history_chart = history_chart,
        dist_chart = dist_chart,
        comparison = comparison,
        career_proj = career_proj
    ))
}

# Generate dashboard for a player
dashboard <- create_projection_dashboard(squad_data, "player_123")
print(dashboard$comparison)
print(dashboard$history_chart)
Output
              Metric     Value
0       Historical Avg     0.383
1   Marcel Projection     0.387
2    10th Percentile      0.328
3             Median      0.385
4    90th Percentile      0.449

Practice Exercises

Exercise 29.1: Position-Specific Aging Curve Analysis System

Task: Build a comprehensive aging curve analysis system that creates position-specific curves using the delta method and identifies peak performance ages for different metrics.

Requirements:

  • Use the delta method to calculate year-over-year performance changes
  • Fit smooth curves using LOESS/GAM with confidence intervals
  • Create separate curves for forwards, midfielders, and defenders
  • Identify peak ages and decline rates for each position
  • Control for survivorship bias with minimum minutes thresholds

aging_curves.R
import pandas as pd
import numpy as np
from scipy.interpolate import UnivariateSpline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Generate synthetic career data
np.random.seed(42)

def generate_career_data(n_players=500):
    careers = []

    for player_id in range(n_players):
        # Player characteristics
        position = np.random.choice(['Forward', 'Midfielder', 'Defender'],
                                    p=[0.25, 0.45, 0.30])
        start_age = np.random.randint(18, 23)
        career_length = np.random.randint(8, 17)
        talent = np.random.normal(0, 0.15)

        # Position-specific peak ages
        peak_age = {
            'Forward': 27 + np.random.normal(0, 1),
            'Midfielder': 28 + np.random.normal(0, 1),
            'Defender': 29 + np.random.normal(0, 1)
        }[position]

        for season in range(min(career_length, 38 - start_age)):
            age = start_age + season

            # Aging curve (quadratic decline from peak)
            age_effect = -0.003 * (age - peak_age)**2

            # Base performance by position
            base_goals = {'Forward': 0.45, 'Midfielder': 0.15, 'Defender': 0.05}[position]
            base_assists = {'Forward': 0.20, 'Midfielder': 0.25, 'Defender': 0.10}[position]

            # Calculate metrics
            goals_p90 = max(0, base_goals * (1 + talent + age_effect + np.random.normal(0, 0.08)))
            assists_p90 = max(0, base_assists * (1 + talent + age_effect + np.random.normal(0, 0.06)))

            # Minutes (declines with age)
            minutes = max(500, 2500 * (1 - 0.02 * (age - 25)**2) + np.random.normal(0, 400))

            careers.append({
                'player_id': player_id,
                'season': 2010 + season,
                'age': age,
                'position': position,
                'minutes': minutes,
                'goals_p90': goals_p90,
                'assists_p90': assists_p90
            })

    return pd.DataFrame(careers)

# Generate data
player_data = generate_career_data(500)
print(f"Total player-seasons: {len(player_data)}")
print(f"Unique players: {player_data['player_id'].nunique()}")

# Apply Delta Method
player_data = player_data.sort_values(['player_id', 'season'])

delta_data = player_data.copy()
delta_data['next_season'] = delta_data.groupby('player_id')['season'].shift(-1)
delta_data['next_age'] = delta_data.groupby('player_id')['age'].shift(-1)
delta_data['next_goals_p90'] = delta_data.groupby('player_id')['goals_p90'].shift(-1)
delta_data['next_assists_p90'] = delta_data.groupby('player_id')['assists_p90'].shift(-1)
delta_data['next_minutes'] = delta_data.groupby('player_id')['minutes'].shift(-1)

# Filter for valid consecutive seasons with minimum minutes
delta_data = delta_data[
    (delta_data['next_season'] == delta_data['season'] + 1) &
    (delta_data['minutes'] >= 900) &
    (delta_data['next_minutes'] >= 900)
].copy()

# Calculate deltas
delta_data['delta_goals'] = delta_data['next_goals_p90'] - delta_data['goals_p90']
delta_data['delta_assists'] = delta_data['next_assists_p90'] - delta_data['assists_p90']
delta_data['midpoint_age'] = (delta_data['age'] + delta_data['next_age']) / 2

print(f"\nDelta observations: {len(delta_data)}")

# Fit aging curves by position using LOESS-like smoothing
print("\n=== GOALS PER 90 AGING CURVES ===")

age_grid = np.arange(19, 36.5, 0.5)
curves = {}
peak_ages = {}

for position in ['Forward', 'Midfielder', 'Defender']:
    pos_data = delta_data[delta_data['position'] == position]

    # Bin data by age and calculate mean delta
    pos_data['age_bin'] = pd.cut(pos_data['midpoint_age'],
                                  bins=np.arange(18.5, 37.5, 1),
                                  labels=np.arange(19, 37))
    binned = pos_data.groupby('age_bin')['delta_goals'].agg(['mean', 'std', 'count'])
    binned = binned.dropna()
    binned.index = binned.index.astype(float)

    # Fit smoothing spline
    if len(binned) >= 4:
        spline = UnivariateSpline(binned.index, binned['mean'], s=0.01)
        delta_pred = spline(age_grid)

        # Cumulate to get aging curve
        cumulative = np.cumsum(delta_pred) * 0.5

        # Normalize so peak is at 0
        cumulative = cumulative - np.max(cumulative)

        curves[position] = {
            'age': age_grid,
            'cumulative': cumulative,
            'delta': delta_pred
        }

        # Find peak age
        peak_idx = np.argmax(cumulative)
        peak_ages[position] = age_grid[peak_idx]

print("\nPeak Ages by Position:")
for pos, age in peak_ages.items():
    print(f"  {pos}: {age:.1f} years")

# Calculate decline rates
print("\nDecline Rates After Peak:")
for position, curve in curves.items():
    peak_idx = np.argmax(curve['cumulative'])
    post_peak = curve['cumulative'][peak_idx:]
    post_peak_ages = curve['age'][peak_idx:]

    if len(post_peak) > 1:
        decline_per_year = (post_peak[-1] - post_peak[0]) / (post_peak_ages[-1] - post_peak_ages[0])
        print(f"  {position}: {decline_per_year*100:.2f}% per year")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors = {'Forward': 'red', 'Midfielder': 'blue', 'Defender': 'green'}

# Plot 1: Cumulative aging curves
ax1 = axes[0]
for position, curve in curves.items():
    ax1.plot(curve['age'], curve['cumulative'],
             color=colors[position], linewidth=2, label=position)
    ax1.axvline(peak_ages[position], color=colors[position],
                linestyle=':', alpha=0.5)

ax1.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('Age')
ax1.set_ylabel('Cumulative Effect on Goals/90')
ax1.set_title('Goals/90 Aging Curves by Position')
ax1.legend()
ax1.set_xlim(20, 36)
ax1.grid(True, alpha=0.3)

# Plot 2: Raw delta data with fitted curves
ax2 = axes[1]
for position in ['Forward', 'Midfielder', 'Defender']:
    pos_data = delta_data[delta_data['position'] == position]
    ax2.scatter(pos_data['midpoint_age'], pos_data['delta_goals'],
                color=colors[position], alpha=0.2, s=20)

    if position in curves:
        ax2.plot(curves[position]['age'], curves[position]['delta'],
                 color=colors[position], linewidth=2, label=position)

ax2.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Age')
ax2.set_ylabel('Year-over-Year Change (Delta)')
ax2.set_title('Raw Delta Data with Fitted Curves')
ax2.legend()
ax2.set_xlim(20, 36)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

# Summary table
print("\n=== AGING CURVE SUMMARY ===")
summary = []
for position, curve in curves.items():
    idx_22 = np.argmin(np.abs(curve['age'] - 22))
    idx_peak = np.argmax(curve['cumulative'])
    idx_34 = np.argmin(np.abs(curve['age'] - 34))

    summary.append({
        'Position': position,
        'Peak Age': f"{curve['age'][idx_peak]:.1f}",
        'Effect at 22': f"{curve['cumulative'][idx_22]:.3f}",
        'Effect at Peak': f"{curve['cumulative'][idx_peak]:.3f}",
        'Effect at 34': f"{curve['cumulative'][idx_34]:.3f}"
    })

print(pd.DataFrame(summary).to_string(index=False))
library(tidyverse)
library(mgcv)

# Simulate multi-season player data
set.seed(42)

# Generate synthetic player careers
generate_career_data <- function(n_players = 500) {
  careers <- list()

  for (i in 1:n_players) {
    # Player characteristics
    position <- sample(c("Forward", "Midfielder", "Defender"), 1,
                       prob = c(0.25, 0.45, 0.30))
    start_age <- sample(18:22, 1)
    career_length <- sample(8:16, 1)
    talent <- rnorm(1, 0, 0.15)

    # Position-specific peak ages
    peak_age <- case_when(
      position == "Forward" ~ 27 + rnorm(1, 0, 1),
      position == "Midfielder" ~ 28 + rnorm(1, 0, 1),
      position == "Defender" ~ 29 + rnorm(1, 0, 1)
    )

    for (season in 1:min(career_length, 38 - start_age)) {
      age <- start_age + season - 1

      # Aging curve shape (quadratic decline from peak)
      age_effect <- -0.003 * (age - peak_age)^2

      # Base performance by position
      base_goals_p90 <- case_when(
        position == "Forward" ~ 0.45,
        position == "Midfielder" ~ 0.15,
        position == "Defender" ~ 0.05
      )

      base_assists_p90 <- case_when(
        position == "Forward" ~ 0.20,
        position == "Midfielder" ~ 0.25,
        position == "Defender" ~ 0.10
      )

      # Calculate metrics with aging and noise
      goals_p90 <- max(0, base_goals_p90 * (1 + talent + age_effect + rnorm(1, 0, 0.08)))
      assists_p90 <- max(0, base_assists_p90 * (1 + talent + age_effect + rnorm(1, 0, 0.06)))

      # Minutes played (declines with age, survivorship)
      minutes <- max(500, 2500 * (1 - 0.02 * (age - 25)^2) + rnorm(1, 0, 400))

      careers[[length(careers) + 1]] <- data.frame(
        player_id = i,
        season = 2010 + season,
        age = age,
        position = position,
        minutes = minutes,
        goals_p90 = goals_p90,
        assists_p90 = assists_p90
      )
    }
  }

  bind_rows(careers)
}

# Generate data
player_data <- generate_career_data(500)

cat("Total player-seasons:", nrow(player_data), "\n")
cat("Unique players:", n_distinct(player_data$player_id), "\n")

# Apply Delta Method for aging curves
# Pair consecutive seasons for the same player
delta_data <- player_data %>%
  arrange(player_id, season) %>%
  group_by(player_id) %>%
  mutate(
    next_season = lead(season),
    next_age = lead(age),
    next_goals_p90 = lead(goals_p90),
    next_assists_p90 = lead(assists_p90),
    next_minutes = lead(minutes)
  ) %>%
  ungroup() %>%
  filter(
    !is.na(next_season),
    next_season == season + 1,  # Consecutive seasons
    minutes >= 900,              # Minimum minutes in year 1
    next_minutes >= 900          # Minimum minutes in year 2
  ) %>%
  mutate(
    # Calculate deltas (change from age N to N+1)
    delta_goals = next_goals_p90 - goals_p90,
    delta_assists = next_assists_p90 - assists_p90,
    midpoint_age = (age + next_age) / 2
  )

cat("\nDelta observations:", nrow(delta_data), "\n")

# Fit aging curves by position using GAM
cat("\n=== GOALS PER 90 AGING CURVES ===\n")

# Fit GAM for each position
aging_models <- delta_data %>%
  group_by(position) %>%
  group_map(~ {
    gam(delta_goals ~ s(midpoint_age, k = 8), data = .x)
  })

names(aging_models) <- c("Defender", "Forward", "Midfielder")

# Create prediction grid
age_grid <- expand.grid(
  midpoint_age = seq(19, 36, by = 0.5),
  position = c("Forward", "Midfielder", "Defender")
)

# Predict and cumulate to get absolute curve
predictions <- age_grid %>%
  group_by(position) %>%
  group_modify(~ {
    model <- aging_models[[.y$position]]
    pred <- predict(model, newdata = .x, se.fit = TRUE)

    .x %>%
      mutate(
        delta_pred = pred$fit,
        delta_se = pred$se.fit,
        # Cumulate deltas to get aging curve (relative to age 27)
        cumulative_effect = cumsum(delta_pred) * 0.5,  # Scale by age step
        ci_lower = cumulative_effect - 1.96 * delta_se,
        ci_upper = cumulative_effect + 1.96 * delta_se
      )
  }) %>%
  ungroup()

# Find peak ages
peak_ages <- predictions %>%
  group_by(position) %>%
  slice_max(cumulative_effect, n = 1) %>%
  select(position, peak_age = midpoint_age, peak_effect = cumulative_effect)

cat("\nPeak Ages by Position:\n")
print(peak_ages)

# Calculate decline rates after peak
decline_rates <- predictions %>%
  inner_join(peak_ages, by = "position") %>%
  filter(midpoint_age >= peak_age) %>%
  group_by(position) %>%
  summarise(
    decline_per_year = (last(cumulative_effect) - first(cumulative_effect)) /
                       (last(midpoint_age) - first(midpoint_age)),
    years_from_peak = last(midpoint_age) - first(midpoint_age),
    total_decline = last(cumulative_effect) - first(cumulative_effect)
  )

cat("\nDecline Rates After Peak:\n")
print(decline_rates)

# Visualization
par(mfrow = c(1, 2))

# Goals P90 Aging Curves
colors <- c("Forward" = "red", "Midfielder" = "blue", "Defender" = "green")

plot(1, type = "n", xlim = c(20, 36), ylim = c(-0.15, 0.1),
     xlab = "Age", ylab = "Cumulative Effect on Goals/90",
     main = "Goals/90 Aging Curves by Position")
abline(h = 0, lty = 2, col = "gray")

for (pos in c("Forward", "Midfielder", "Defender")) {
  pos_data <- predictions %>% filter(position == pos)
  lines(pos_data$midpoint_age, pos_data$cumulative_effect,
        col = colors[pos], lwd = 2)
  polygon(c(pos_data$midpoint_age, rev(pos_data$midpoint_age)),
          c(pos_data$ci_lower, rev(pos_data$ci_upper)),
          col = adjustcolor(colors[pos], alpha = 0.1), border = NA)
}

legend("topright", legend = names(colors), col = colors, lwd = 2, cex = 0.8)

# Mark peak ages
for (i in 1:nrow(peak_ages)) {
  abline(v = peak_ages$peak_age[i], col = colors[peak_ages$position[i]], lty = 3)
}

# Raw delta data with fitted curves
plot(1, type = "n", xlim = c(20, 36), ylim = c(-0.3, 0.3),
     xlab = "Age", ylab = "Year-over-Year Change (Delta)",
     main = "Raw Delta Data with Fitted Curves")
abline(h = 0, lty = 2, col = "gray")

for (pos in c("Forward", "Midfielder", "Defender")) {
  pos_delta <- delta_data %>% filter(position == pos)
  points(pos_delta$midpoint_age, pos_delta$delta_goals,
         col = adjustcolor(colors[pos], alpha = 0.2), pch = 19, cex = 0.5)

  # Add fitted line
  pos_pred <- predictions %>% filter(position == pos)
  lines(pos_pred$midpoint_age, c(0, diff(pos_pred$cumulative_effect) / 0.5),
        col = colors[pos], lwd = 2)
}

legend("topright", legend = names(colors), col = colors, pch = 19, cex = 0.8)

par(mfrow = c(1, 1))

# Summary statistics
cat("\n=== AGING CURVE SUMMARY ===\n")
summary_stats <- predictions %>%
  group_by(position) %>%
  summarise(
    age_range = paste(min(midpoint_age), "-", max(midpoint_age)),
    peak_age = midpoint_age[which.max(cumulative_effect)],
    effect_at_22 = cumulative_effect[midpoint_age == 22],
    effect_at_peak = max(cumulative_effect),
    effect_at_34 = cumulative_effect[midpoint_age == 34]
  )

print(summary_stats)
Exercise 29.2: Complete Marcel Projection System with Backtesting

Task: Build a full Marcel projection system that weights historical seasons, applies regression to the mean, adjusts for aging, and validates against actual outcomes.

Requirements:

  • Implement 5/4/3 weighting for most recent three seasons
  • Calculate reliability coefficients and apply appropriate regression
  • Integrate aging curve adjustments for projected age
  • Backtest projections against held-out seasons
  • Calculate RMSE, MAE, and correlation metrics

marcel_projections.R
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Generate multi-season player data
np.random.seed(42)

def generate_player_history(n_players=300, seasons=range(2015, 2024)):
    players = []

    for player_id in range(n_players):
        # Player characteristics
        true_talent = np.random.normal(0.35, 0.12)  # Goals per 90
        position = np.random.choice(['Forward', 'Midfielder'], p=[0.4, 0.6])
        start_age = np.random.randint(20, 29)

        if position == 'Forward':
            true_talent += 0.15

        for season in seasons:
            age = start_age + (season - min(seasons))

            # Skip if retired
            if age > 36 or np.random.random() < 0.05:
                continue

            # Aging effect
            peak_age = 27 if position == 'Forward' else 28
            aging = -0.003 * (age - peak_age)**2

            # Observed performance
            minutes = max(800, np.random.normal(2200, 500))
            observed_g90 = max(0, true_talent + aging + np.random.normal(0, 0.12))

            players.append({
                'player_id': player_id,
                'player_name': f'Player_{player_id}',
                'season': season,
                'age': age,
                'position': position,
                'minutes': minutes,
                'goals_p90': observed_g90,
                'true_talent': true_talent
            })

    return pd.DataFrame(players)

# Generate data
player_data = generate_player_history(300)
print(f"Total player-seasons: {len(player_data)}")
print(f"Unique players: {player_data['player_id'].nunique()}")

# League average for regression
league_mean = player_data['goals_p90'].mean()
print(f"\nLeague average goals/90: {league_mean:.3f}")

# Marcel Projection Function
def marcel_project(player_history, target_season, league_mean,
                   weights=[5, 4, 3], reliability_factor=0.5):
    """Generate Marcel projection for a player."""

    # Get recent seasons
    recent = player_history[player_history['season'] < target_season].copy()
    recent = recent.sort_values('season', ascending=False).head(3)

    if len(recent) == 0:
        return None

    # Calculate weighted average
    n_seasons = len(recent)
    use_weights = np.array(weights[:n_seasons])
    use_weights = use_weights / use_weights.sum()

    weighted_g90 = (recent['goals_p90'].values * use_weights).sum()
    total_minutes = recent['minutes'].sum()

    # Regression rate based on sample size
    regression_rate = reliability_factor / (reliability_factor + total_minutes / 2000)

    # Regress to mean
    regressed_g90 = weighted_g90 * (1 - regression_rate) + league_mean * regression_rate

    # Aging adjustment
    current_age = recent['age'].iloc[0]
    projected_age = current_age + 1
    peak_age = 27 if recent['position'].iloc[0] == 'Forward' else 28

    current_aging = -0.003 * (current_age - peak_age)**2
    projected_aging = -0.003 * (projected_age - peak_age)**2
    aging_delta = projected_aging - current_aging

    # Final projection
    projected_g90 = regressed_g90 + aging_delta

    return {
        'player_id': recent['player_id'].iloc[0],
        'player_name': recent['player_name'].iloc[0],
        'position': recent['position'].iloc[0],
        'target_season': target_season,
        'projected_age': projected_age,
        'seasons_used': n_seasons,
        'weighted_avg': weighted_g90,
        'total_minutes': total_minutes,
        'regression_rate': regression_rate,
        'regressed_g90': regressed_g90,
        'aging_adjustment': aging_delta,
        'projected_g90': projected_g90
    }

# Generate projections
print("\n=== GENERATING MARCEL PROJECTIONS ===")

projections = []
for target_year in [2022, 2023]:
    players_to_project = player_data[player_data['season'] == target_year - 1]['player_id'].unique()

    for pid in players_to_project:
        player_hist = player_data[player_data['player_id'] == pid]
        proj = marcel_project(player_hist, target_year, league_mean)
        if proj:
            projections.append(proj)

projections_df = pd.DataFrame(projections)
print(f"Total projections generated: {len(projections_df)}")

# Merge with actuals for validation
actuals = player_data[['player_id', 'season', 'goals_p90', 'minutes']].copy()
actuals.columns = ['player_id', 'season', 'actual_g90', 'actual_minutes']

validation = projections_df.merge(
    actuals,
    left_on=['player_id', 'target_season'],
    right_on=['player_id', 'season']
)
validation = validation[validation['actual_minutes'] >= 500]

print(f"Projections with actual data: {len(validation)}")

# Calculate accuracy metrics
print("\n=== MARCEL PROJECTION ACCURACY ===")

rmse = np.sqrt(mean_squared_error(validation['actual_g90'], validation['projected_g90']))
mae = mean_absolute_error(validation['actual_g90'], validation['projected_g90'])
correlation = validation['projected_g90'].corr(validation['actual_g90'])

print(f"\nOverall Metrics:")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE: {mae:.4f}")
print(f"  Correlation: {correlation:.4f}")

# Naive baseline
naive_rmse = np.sqrt(mean_squared_error(validation['actual_g90'], validation['weighted_avg']))
naive_mae = mean_absolute_error(validation['actual_g90'], validation['weighted_avg'])
naive_corr = validation['weighted_avg'].corr(validation['actual_g90'])

print(f"\nNaive Baseline (weighted avg only):")
print(f"  RMSE: {naive_rmse:.4f}")
print(f"  MAE: {naive_mae:.4f}")
print(f"  Correlation: {naive_corr:.4f}")

print(f"\nMarcel Improvement over Naive:")
print(f"  RMSE improvement: {(naive_rmse - rmse) / naive_rmse * 100:.1f}%")
print(f"  MAE improvement: {(naive_mae - mae) / naive_mae * 100:.1f}%")

# Accuracy by group
print("\n=== ACCURACY BY PLAYER GROUP ===")

by_position = validation.groupby('position').apply(
    lambda x: pd.Series({
        'n': len(x),
        'rmse': np.sqrt(mean_squared_error(x['actual_g90'], x['projected_g90'])),
        'mae': mean_absolute_error(x['actual_g90'], x['projected_g90']),
        'correlation': x['projected_g90'].corr(x['actual_g90'])
    })
).reset_index()

print("\nBy Position:")
print(by_position.to_string(index=False))

# By experience
validation['experience'] = pd.cut(validation['seasons_used'],
                                   bins=[0, 1, 2, 3],
                                   labels=['1 season', '2 seasons', '3 seasons'])

by_exp = validation.groupby('experience').apply(
    lambda x: pd.Series({
        'n': len(x),
        'rmse': np.sqrt(mean_squared_error(x['actual_g90'], x['projected_g90'])),
        'mae': mean_absolute_error(x['actual_g90'], x['projected_g90'])
    })
).reset_index()

print("\nBy Seasons of History:")
print(by_exp.to_string(index=False))

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Projected vs Actual
ax1 = axes[0, 0]
ax1.scatter(validation['projected_g90'], validation['actual_g90'],
            alpha=0.5, s=30, c='steelblue')
ax1.plot([0, 0.8], [0, 0.8], 'r-', linewidth=2, label='Perfect')

# Fit line
z = np.polyfit(validation['projected_g90'], validation['actual_g90'], 1)
p = np.poly1d(z)
ax1.plot([0, 0.8], [p(0), p(0.8)], 'b--', linewidth=2, label='Actual fit')

ax1.set_xlabel('Projected Goals/90')
ax1.set_ylabel('Actual Goals/90')
ax1.set_title('Marcel Projections vs Actuals')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Residuals
validation['residual'] = validation['actual_g90'] - validation['projected_g90']
ax2 = axes[0, 1]
ax2.scatter(validation['projected_g90'], validation['residual'],
            alpha=0.5, s=30, c='steelblue')
ax2.axhline(0, color='red', linewidth=2)
ax2.set_xlabel('Projected Goals/90')
ax2.set_ylabel('Residual (Actual - Projected)')
ax2.set_title('Residual Analysis')
ax2.grid(True, alpha=0.3)

# 3. Residuals by regression rate
ax3 = axes[1, 0]
ax3.scatter(validation['regression_rate'], validation['residual'],
            alpha=0.5, s=30, c='green')
ax3.axhline(0, color='red', linewidth=2)
ax3.set_xlabel('Regression Rate')
ax3.set_ylabel('Residual')
ax3.set_title('Residuals by Regression Rate')
ax3.grid(True, alpha=0.3)

# 4. Error distribution
ax4 = axes[1, 1]
ax4.hist(validation['residual'], bins=30, color='lightblue', edgecolor='white')
ax4.axvline(validation['residual'].mean(), color='red', linewidth=2, label='Mean')
ax4.axvline(0, color='black', linestyle='--', linewidth=2, label='Zero')
ax4.set_xlabel('Error (Actual - Projected)')
ax4.set_ylabel('Frequency')
ax4.set_title('Distribution of Projection Errors')
ax4.legend()

plt.tight_layout()
plt.savefig('marcel_projection_validation.png', dpi=150, bbox_inches='tight')
plt.show()

# Example projections
print("\n=== EXAMPLE PROJECTIONS ===")
example = validation.nlargest(10, 'actual_minutes')[
    ['player_name', 'position', 'projected_age', 'projected_g90', 'actual_g90', 'residual']
]
print(example.to_string(index=False))
library(tidyverse)

# Generate multi-season player data for Marcel system
set.seed(42)

generate_player_history <- function(n_players = 300, seasons = 2015:2023) {
  players <- list()

  for (i in 1:n_players) {
    # Player true talent (constant)
    true_talent <- rnorm(1, 0.35, 0.12)  # Goals per 90
    position <- sample(c("Forward", "Midfielder"), 1, prob = c(0.4, 0.6))
    start_age <- sample(20:28, 1)

    # Forward adjustment
    if (position == "Forward") true_talent <- true_talent + 0.15

    for (season in seasons) {
      age <- start_age + (season - min(seasons))

      # Skip if too old or retired
      if (age > 36 || runif(1) < 0.05) next

      # Aging effect
      peak_age <- ifelse(position == "Forward", 27, 28)
      aging <- -0.003 * (age - peak_age)^2

      # Season performance = true talent + aging + noise
      minutes <- max(800, rnorm(1, 2200, 500))
      observed_g90 <- max(0, true_talent + aging + rnorm(1, 0, 0.12))

      players[[length(players) + 1]] <- data.frame(
        player_id = i,
        player_name = paste0("Player_", i),
        season = season,
        age = age,
        position = position,
        minutes = minutes,
        goals_p90 = observed_g90,
        true_talent = true_talent
      )
    }
  }

  bind_rows(players)
}

# Generate data
player_data <- generate_player_history(300)

cat("Total player-seasons:", nrow(player_data), "\n")
cat("Unique players:", n_distinct(player_data$player_id), "\n\n")

# Calculate league average for regression target
league_avg <- player_data %>%
  summarise(
    mean_g90 = mean(goals_p90),
    sd_g90 = sd(goals_p90)
  )

cat("League average goals/90:", round(league_avg$mean_g90, 3), "\n")

# Marcel Projection Function
marcel_project <- function(player_history, target_season, league_mean,
                           weights = c(5, 4, 3),  # Most recent to oldest
                           reliability_factor = 0.5) {  # Higher = more reliable

  # Get player's recent seasons
  recent <- player_history %>%
    filter(season < target_season) %>%
    arrange(desc(season)) %>%
    head(3)

  if (nrow(recent) == 0) return(NULL)

  # Calculate weighted average
  n_seasons <- nrow(recent)
  use_weights <- weights[1:n_seasons]
  use_weights <- use_weights / sum(use_weights)

  weighted_g90 <- sum(recent$goals_p90 * use_weights)
  total_minutes <- sum(recent$minutes)

  # Calculate regression amount based on sample size
  # More minutes = more reliable = less regression
  regression_rate <- reliability_factor / (reliability_factor + total_minutes / 2000)

  # Regress to mean
  regressed_g90 <- weighted_g90 * (1 - regression_rate) + league_mean * regression_rate

  # Apply aging adjustment
  current_age <- recent$age[1]
  projected_age <- current_age + 1
  peak_age <- ifelse(recent$position[1] == "Forward", 27, 28)

  # Simple quadratic aging adjustment
  current_aging <- -0.003 * (current_age - peak_age)^2
  projected_aging <- -0.003 * (projected_age - peak_age)^2
  aging_delta <- projected_aging - current_aging

  # Final projection
  projected_g90 <- regressed_g90 + aging_delta

  data.frame(
    player_id = recent$player_id[1],
    player_name = recent$player_name[1],
    position = recent$position[1],
    target_season = target_season,
    projected_age = projected_age,
    seasons_used = n_seasons,
    weighted_avg = weighted_g90,
    total_minutes = total_minutes,
    regression_rate = regression_rate,
    regressed_g90 = regressed_g90,
    aging_adjustment = aging_delta,
    projected_g90 = projected_g90
  )
}

# Generate projections for 2022 and 2023 seasons
cat("\n=== GENERATING MARCEL PROJECTIONS ===\n")

projections <- list()

for (target_year in c(2022, 2023)) {
  players_to_project <- player_data %>%
    filter(season == target_year - 1) %>%
    pull(player_id) %>%
    unique()

  for (pid in players_to_project) {
    player_hist <- player_data %>% filter(player_id == pid)
    proj <- marcel_project(player_hist, target_year, league_avg$mean_g90)
    if (!is.null(proj)) {
      projections[[length(projections) + 1]] <- proj
    }
  }
}

projections_df <- bind_rows(projections)

cat("Total projections generated:", nrow(projections_df), "\n")

# Merge with actual results for validation
actuals <- player_data %>%
  select(player_id, season, actual_g90 = goals_p90, actual_minutes = minutes)

validation <- projections_df %>%
  inner_join(actuals, by = c("player_id", "target_season" = "season")) %>%
  filter(actual_minutes >= 500)  # Only validate players who played enough

cat("Projections with actual data:", nrow(validation), "\n\n")

# Calculate accuracy metrics
cat("=== MARCEL PROJECTION ACCURACY ===\n")

# Overall metrics
rmse <- sqrt(mean((validation$projected_g90 - validation$actual_g90)^2))
mae <- mean(abs(validation$projected_g90 - validation$actual_g90))
correlation <- cor(validation$projected_g90, validation$actual_g90)

cat("\nOverall Metrics:\n")
cat("  RMSE:", round(rmse, 4), "\n")
cat("  MAE:", round(mae, 4), "\n")
cat("  Correlation:", round(correlation, 4), "\n")

# Compare to naive baseline (previous season only)
naive_rmse <- sqrt(mean((validation$weighted_avg - validation$actual_g90)^2))
naive_mae <- mean(abs(validation$weighted_avg - validation$actual_g90))
naive_corr <- cor(validation$weighted_avg, validation$actual_g90)

cat("\nNaive Baseline (weighted avg only):\n")
cat("  RMSE:", round(naive_rmse, 4), "\n")
cat("  MAE:", round(naive_mae, 4), "\n")
cat("  Correlation:", round(naive_corr, 4), "\n")

# Improvement
cat("\nMarcel Improvement over Naive:\n")
cat("  RMSE improvement:", round((naive_rmse - rmse) / naive_rmse * 100, 1), "%\n")
cat("  MAE improvement:", round((naive_mae - mae) / naive_mae * 100, 1), "%\n")

# Accuracy by player group
cat("\n=== ACCURACY BY PLAYER GROUP ===\n")

by_position <- validation %>%
  group_by(position) %>%
  summarise(
    n = n(),
    rmse = sqrt(mean((projected_g90 - actual_g90)^2)),
    mae = mean(abs(projected_g90 - actual_g90)),
    correlation = cor(projected_g90, actual_g90),
    .groups = "drop"
  )

print(by_position)

by_experience <- validation %>%
  mutate(experience = cut(seasons_used, breaks = c(0, 1, 2, 3),
                          labels = c("1 season", "2 seasons", "3 seasons"))) %>%
  group_by(experience) %>%
  summarise(
    n = n(),
    rmse = sqrt(mean((projected_g90 - actual_g90)^2)),
    mae = mean(abs(projected_g90 - actual_g90)),
    .groups = "drop"
  )

cat("\nBy Seasons of History:\n")
print(by_experience)

# Visualization
par(mfrow = c(2, 2))

# 1. Projected vs Actual scatter
plot(validation$projected_g90, validation$actual_g90,
     xlab = "Projected Goals/90", ylab = "Actual Goals/90",
     main = "Marcel Projections vs Actuals",
     pch = 19, col = adjustcolor("steelblue", 0.5))
abline(0, 1, col = "red", lwd = 2)
abline(lm(actual_g90 ~ projected_g90, data = validation), col = "blue", lty = 2)
legend("topleft", legend = c("Perfect", "Actual fit"),
       col = c("red", "blue"), lty = c(1, 2), cex = 0.8)

# 2. Residual plot
validation$residual <- validation$actual_g90 - validation$projected_g90
plot(validation$projected_g90, validation$residual,
     xlab = "Projected Goals/90", ylab = "Residual (Actual - Projected)",
     main = "Residual Analysis",
     pch = 19, col = adjustcolor("steelblue", 0.5))
abline(h = 0, col = "red", lwd = 2)

# 3. Residuals by regression rate
plot(validation$regression_rate, validation$residual,
     xlab = "Regression Rate", ylab = "Residual",
     main = "Residuals by Regression Rate",
     pch = 19, col = adjustcolor("green", 0.5))
abline(h = 0, col = "red", lwd = 2)

# 4. Histogram of errors
hist(validation$residual, breaks = 30, main = "Distribution of Projection Errors",
     xlab = "Error (Actual - Projected)", col = "lightblue", border = "white")
abline(v = mean(validation$residual), col = "red", lwd = 2)
abline(v = 0, col = "black", lty = 2)

par(mfrow = c(1, 1))

# Show example projections
cat("\n=== EXAMPLE PROJECTIONS ===\n")
example <- validation %>%
  arrange(desc(actual_minutes)) %>%
  head(10) %>%
  select(player_name, position, projected_age, projected_g90, actual_g90, residual)

print(example)
Exercise 29.3: Dynamic In-Season Projection Updates

Task: Create an in-season projection system that blends preseason expectations with current season performance and updates confidence intervals as the season progresses.

Requirements:

  • Implement Bayesian updating to blend prior (preseason) with observed data
  • Dynamically adjust weights based on games/minutes played
  • Calculate shrinking confidence intervals through the season
  • Track and visualize projection evolution week-by-week
  • Compare to end-of-season actuals for validation

inseason_projections.R
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Simulate a player's season game-by-game
np.random.seed(42)

def generate_season_data(true_g90=0.45, games=38, minutes_per_game=85):
    """Generate game-by-game season data for a player."""
    data = {
        'matchweek': list(range(1, games + 1)),
        'minutes': np.clip(np.random.normal(minutes_per_game, 15, games), 0, 90),
        'goals': np.random.poisson(true_g90 * minutes_per_game / 90, games)
    }
    df = pd.DataFrame(data)
    df['cumulative_minutes'] = df['minutes'].cumsum()
    df['cumulative_goals'] = df['goals'].cumsum()
    df['observed_g90'] = df['cumulative_goals'] / df['cumulative_minutes'] * 90
    return df

# Generate season for a player
true_rate = 0.45
preseason_projection = 0.40  # Slightly underestimated
preseason_std = 0.10  # Uncertainty

season_data = generate_season_data(true_g90=true_rate, games=38)

print(f"True goals/90: {true_rate}")
print(f"Preseason projection: {preseason_projection}")
print(f"Final observed g/90: {season_data['observed_g90'].iloc[-1]:.3f}")

# In-Season Projection Update Function (Bayesian-style)
def update_projection(prior_mean, prior_std, observed_g90, minutes_played, reliability=400):
    """
    Update projection using Bayesian-style weighting.

    Args:
        prior_mean: Preseason projection
        prior_std: Preseason uncertainty
        observed_g90: Current season rate
        minutes_played: Total minutes this season
        reliability: Minutes needed to trust observed data fully
    """
    # Weight for observed data
    obs_weight = minutes_played / (minutes_played + reliability)
    prior_weight = 1 - obs_weight

    # Posterior mean
    posterior_mean = prior_weight * prior_mean + obs_weight * observed_g90

    # Posterior std shrinks with more data
    obs_std = np.sqrt(observed_g90 / max(1, minutes_played / 90))
    posterior_std = np.sqrt(1 / (1/prior_std**2 + minutes_played / (reliability * obs_std**2 + 1)))

    return {
        'mean': posterior_mean,
        'std': posterior_std,
        'obs_weight': obs_weight
    }

# Track projections through the season
projections = [{
    'matchweek': 0,  # Preseason
    'projection': preseason_projection,
    'ci_lower': preseason_projection - 1.96 * preseason_std,
    'ci_upper': preseason_projection + 1.96 * preseason_std,
    'std': preseason_std,
    'obs_weight': 0,
    'cumulative_minutes': 0,
    'observed_g90': np.nan
}]

for week in range(1, 39):
    current = season_data[season_data['matchweek'] == week].iloc[0]

    updated = update_projection(
        prior_mean=preseason_projection,
        prior_std=preseason_std,
        observed_g90=current['observed_g90'],
        minutes_played=current['cumulative_minutes']
    )

    projections.append({
        'matchweek': week,
        'projection': updated['mean'],
        'ci_lower': updated['mean'] - 1.96 * updated['std'],
        'ci_upper': updated['mean'] + 1.96 * updated['std'],
        'std': updated['std'],
        'obs_weight': updated['obs_weight'],
        'cumulative_minutes': current['cumulative_minutes'],
        'observed_g90': current['observed_g90']
    })

projections_df = pd.DataFrame(projections)

# Summary at key points
print("\n=== PROJECTION EVOLUTION ===")
key_weeks = [0, 5, 10, 19, 29, 38]
summary = projections_df[projections_df['matchweek'].isin(key_weeks)].round(4)
print(summary[['matchweek', 'projection', 'ci_lower', 'ci_upper', 'std', 'obs_weight']].to_string(index=False))

# Accuracy analysis
final_g90 = season_data['observed_g90'].iloc[-1]

accuracy = projections_df[projections_df['matchweek'] > 0].copy()
accuracy['error'] = accuracy['projection'] - final_g90
accuracy['within_ci'] = (accuracy['ci_lower'] <= final_g90) & (final_g90 <= accuracy['ci_upper'])

print("\n=== PROJECTION ACCURACY ===")
calibration = {
    'pct_in_95_ci': accuracy['within_ci'].mean() * 100,
    'avg_error': accuracy['error'].mean(),
    'rmse': np.sqrt((accuracy['error']**2).mean()),
    'early_season_error': accuracy[accuracy['matchweek'] <= 10]['error'].mean(),
    'late_season_error': accuracy[accuracy['matchweek'] > 28]['error'].mean()
}
print(pd.Series(calibration).to_string())

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Projection evolution with CI
ax1 = axes[0, 0]
ax1.plot(projections_df['matchweek'], projections_df['projection'],
         'b-', linewidth=2, label='Projection')
ax1.fill_between(projections_df['matchweek'],
                  projections_df['ci_lower'], projections_df['ci_upper'],
                  color='blue', alpha=0.2, label='95% CI')
ax1.axhline(true_rate, color='red', linestyle='--', linewidth=2, label='True Rate')
ax1.axhline(final_g90, color='green', linestyle=':', linewidth=2, label='Final Observed')
ax1.plot(projections_df['matchweek'][1:], projections_df['observed_g90'][1:],
         'orange', linewidth=1, alpha=0.7, label='Running Observed')

ax1.set_xlabel('Matchweek')
ax1.set_ylabel('Goals per 90')
ax1.set_title('In-Season Projection Evolution')
ax1.legend(loc='upper right', fontsize=8)
ax1.set_ylim(0.2, 0.7)
ax1.grid(True, alpha=0.3)

# 2. Weight evolution
ax2 = axes[0, 1]
ax2.plot(projections_df['matchweek'], projections_df['obs_weight'] * 100,
         'purple', linewidth=2)
ax2.axhline(50, color='gray', linestyle='--', alpha=0.5)
ax2.text(5, 52, '50% threshold', fontsize=9, color='gray')
ax2.set_xlabel('Matchweek')
ax2.set_ylabel('Weight on Observed Data (%)')
ax2.set_title('Data Weight Evolution')
ax2.grid(True, alpha=0.3)

# 3. CI width over time
ax3 = axes[1, 0]
ci_width = projections_df['std'] * 1.96 * 2
ax3.plot(projections_df['matchweek'], ci_width, 'darkgreen', linewidth=2)
ax3.set_xlabel('Matchweek')
ax3.set_ylabel('95% CI Width')
ax3.set_title('Uncertainty Reduction Over Season')
ax3.grid(True, alpha=0.3)

# 4. Error evolution
ax4 = axes[1, 1]
errors = projections_df[projections_df['matchweek'] > 0].copy()
errors['abs_error'] = np.abs(errors['projection'] - final_g90)
ax4.plot(errors['matchweek'], errors['abs_error'], 'red', linewidth=2)
ax4.set_xlabel('Matchweek')
ax4.set_ylabel('Absolute Error vs Final')
ax4.set_title('Projection Error Evolution')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('inseason_projection_evolution.png', dpi=150, bbox_inches='tight')
plt.show()

# Multi-player validation
print("\n=== MULTI-PLAYER VALIDATION ===")

def simulate_multiple_players(n_players=50):
    results = []

    for i in range(n_players):
        # Random player
        true_rate = np.random.uniform(0.2, 0.7)
        preseason_proj = true_rate + np.random.normal(0, 0.08)
        preseason_std = np.random.uniform(0.08, 0.15)

        season = generate_season_data(true_g90=true_rate)
        final_g90 = season['observed_g90'].iloc[-1]

        for week in [10, 19, 29, 38]:
            if week <= len(season):
                current = season[season['matchweek'] == week].iloc[0]

                updated = update_projection(
                    prior_mean=preseason_proj,
                    prior_std=preseason_std,
                    observed_g90=current['observed_g90'],
                    minutes_played=current['cumulative_minutes']
                )

                results.append({
                    'player_id': i,
                    'week': week,
                    'true_rate': true_rate,
                    'preseason_proj': preseason_proj,
                    'in_season_proj': updated['mean'],
                    'ci_lower': updated['mean'] - 1.96 * updated['std'],
                    'ci_upper': updated['mean'] + 1.96 * updated['std'],
                    'final_g90': final_g90
                })

    return pd.DataFrame(results)

multi_results = simulate_multiple_players(100)

# Accuracy by week
multi_results['preseason_error'] = np.abs(multi_results['preseason_proj'] - multi_results['final_g90'])
multi_results['inseason_error'] = np.abs(multi_results['in_season_proj'] - multi_results['final_g90'])
multi_results['within_ci'] = ((multi_results['ci_lower'] <= multi_results['final_g90']) &
                               (multi_results['final_g90'] <= multi_results['ci_upper']))

accuracy_by_week = multi_results.groupby('week').agg({
    'preseason_error': lambda x: np.sqrt((x**2).mean()),
    'inseason_error': lambda x: np.sqrt((x**2).mean()),
    'within_ci': lambda x: x.mean() * 100
}).reset_index()

accuracy_by_week.columns = ['week', 'preseason_rmse', 'inseason_rmse', 'calibration']
accuracy_by_week['improvement_pct'] = ((accuracy_by_week['preseason_rmse'] -
                                         accuracy_by_week['inseason_rmse']) /
                                        accuracy_by_week['preseason_rmse'] * 100)

print("\nProjection Accuracy by Matchweek:")
print(accuracy_by_week.round(3).to_string(index=False))
library(tidyverse)

# Simulate a player's season game-by-game
set.seed(42)

generate_season_data <- function(true_g90 = 0.45, games = 38, minutes_per_game = 85) {
  tibble(
    matchweek = 1:games,
    minutes = pmin(90, pmax(0, rnorm(games, minutes_per_game, 15))),
    # Poisson goals based on true rate
    goals = rpois(games, true_g90 * minutes / 90)
  ) %>%
    mutate(
      cumulative_minutes = cumsum(minutes),
      cumulative_goals = cumsum(goals),
      observed_g90 = cumulative_goals / cumulative_minutes * 90
    )
}

# Generate season for a player
true_rate <- 0.45
preseason_projection <- 0.40  # Slightly underestimated
preseason_std <- 0.10  # Uncertainty in preseason projection

season_data <- generate_season_data(true_g90 = true_rate, games = 38)

cat("True goals/90:", true_rate, "\n")
cat("Preseason projection:", preseason_projection, "\n")
cat("Final observed g/90:", round(tail(season_data$observed_g90, 1), 3), "\n\n")

# In-Season Projection Update Function (Bayesian-style)
update_projection <- function(prior_mean, prior_std, observed_g90, minutes_played,
                               reliability = 400) {  # Higher = faster convergence
  # Reliability parameter: how many minutes to "believe" new data fully
  # posterior_mean = weighted average of prior and observed

  # Weight for observed data based on sample size
  obs_weight <- minutes_played / (minutes_played + reliability)
  prior_weight <- 1 - obs_weight

  # Posterior mean
  posterior_mean <- prior_weight * prior_mean + obs_weight * observed_g90

  # Posterior standard deviation shrinks with more data
  # Based on combining uncertainties
  obs_std <- sqrt(observed_g90 / max(1, minutes_played / 90))  # Poisson-based
  posterior_std <- sqrt(1 / (1/prior_std^2 + minutes_played / (reliability * obs_std^2 + 1)))

  list(mean = posterior_mean, std = posterior_std, obs_weight = obs_weight)
}

# Track projections through the season
projections <- tibble(
  matchweek = 0,  # Preseason
  projection = preseason_projection,
  ci_lower = preseason_projection - 1.96 * preseason_std,
  ci_upper = preseason_projection + 1.96 * preseason_std,
  std = preseason_std,
  obs_weight = 0,
  cumulative_minutes = 0,
  observed_g90 = NA
)

for (week in 1:38) {
  current <- season_data %>% filter(matchweek == week)

  # Update projection
  updated <- update_projection(
    prior_mean = preseason_projection,
    prior_std = preseason_std,
    observed_g90 = current$observed_g90,
    minutes_played = current$cumulative_minutes
  )

  projections <- bind_rows(projections, tibble(
    matchweek = week,
    projection = updated$mean,
    ci_lower = updated$mean - 1.96 * updated$std,
    ci_upper = updated$mean + 1.96 * updated$std,
    std = updated$std,
    obs_weight = updated$obs_weight,
    cumulative_minutes = current$cumulative_minutes,
    observed_g90 = current$observed_g90
  ))
}

cat("=== PROJECTION EVOLUTION ===\n")
key_weeks <- c(0, 5, 10, 19, 29, 38)
projection_summary <- projections %>%
  filter(matchweek %in% key_weeks) %>%
  mutate(across(c(projection, ci_lower, ci_upper, std), ~round(., 4)))

print(projection_summary)

# Calculate projection accuracy at different points
cat("\n=== PROJECTION ACCURACY ===\n")
final_g90 <- tail(season_data$observed_g90, 1)

accuracy <- projections %>%
  filter(matchweek > 0) %>%
  mutate(
    error = projection - final_g90,
    within_ci = final_g90 >= ci_lower & final_g90 <= ci_upper
  )

# Calibration check
calibration <- accuracy %>%
  summarise(
    pct_in_95_ci = mean(within_ci) * 100,
    avg_error = mean(error),
    rmse = sqrt(mean(error^2)),
    early_season_error = mean(error[matchweek <= 10]),
    late_season_error = mean(error[matchweek > 28])
  )

print(calibration)

# Visualization
par(mfrow = c(2, 2))

# 1. Projection evolution with CI
plot(projections$matchweek, projections$projection, type = "l",
     col = "blue", lwd = 2, ylim = c(0.2, 0.7),
     xlab = "Matchweek", ylab = "Goals per 90",
     main = "In-Season Projection Evolution")

# Confidence interval
polygon(c(projections$matchweek, rev(projections$matchweek)),
        c(projections$ci_lower, rev(projections$ci_upper)),
        col = adjustcolor("blue", 0.2), border = NA)

# True rate and final observed
abline(h = true_rate, col = "red", lwd = 2, lty = 2)
abline(h = final_g90, col = "green", lwd = 2, lty = 3)

# Add observed running rate
lines(projections$matchweek[-1], projections$observed_g90[-1],
      col = "orange", lwd = 1, lty = 1)

legend("topright",
       legend = c("Projection", "95% CI", "True Rate", "Final Observed", "Running Obs"),
       col = c("blue", adjustcolor("blue", 0.3), "red", "green", "orange"),
       lty = c(1, NA, 2, 3, 1), lwd = c(2, NA, 2, 2, 1),
       fill = c(NA, adjustcolor("blue", 0.3), NA, NA, NA),
       border = c(NA, NA, NA, NA, NA), cex = 0.7)

# 2. Weight evolution
plot(projections$matchweek, projections$obs_weight * 100, type = "l",
     col = "purple", lwd = 2,
     xlab = "Matchweek", ylab = "Weight on Observed Data (%)",
     main = "Data Weight Evolution")
abline(h = 50, col = "gray", lty = 2)
text(5, 55, "50% threshold", cex = 0.8, col = "gray")

# 3. Confidence interval width
plot(projections$matchweek, projections$std * 1.96 * 2, type = "l",
     col = "darkgreen", lwd = 2,
     xlab = "Matchweek", ylab = "95% CI Width",
     main = "Uncertainty Reduction Over Season")

# 4. Error evolution
errors <- projections %>%
  filter(matchweek > 0) %>%
  mutate(error = abs(projection - final_g90))

plot(errors$matchweek, errors$error, type = "l",
     col = "red", lwd = 2,
     xlab = "Matchweek", ylab = "Absolute Error vs Final",
     main = "Projection Error Evolution")

par(mfrow = c(1, 1))

# Multiple player simulation
cat("\n=== MULTI-PLAYER VALIDATION ===\n")

simulate_multiple_players <- function(n_players = 50) {
  results <- list()

  for (i in 1:n_players) {
    # Random player characteristics
    true_rate <- runif(1, 0.2, 0.7)
    preseason_proj <- true_rate + rnorm(1, 0, 0.08)  # Noisy preseason
    preseason_std <- runif(1, 0.08, 0.15)

    season <- generate_season_data(true_g90 = true_rate)
    final_g90 <- tail(season$observed_g90, 1)

    # Track key projection points
    for (week in c(10, 19, 29, 38)) {
      if (week <= nrow(season)) {
        current <- season[week, ]

        updated <- update_projection(
          prior_mean = preseason_proj,
          prior_std = preseason_std,
          observed_g90 = current$observed_g90,
          minutes_played = current$cumulative_minutes
        )

        results[[length(results) + 1]] <- data.frame(
          player_id = i,
          week = week,
          true_rate = true_rate,
          preseason_proj = preseason_proj,
          in_season_proj = updated$mean,
          ci_lower = updated$mean - 1.96 * updated$std,
          ci_upper = updated$mean + 1.96 * updated$std,
          final_g90 = final_g90
        )
      }
    }
  }

  bind_rows(results)
}

multi_results <- simulate_multiple_players(100)

# Accuracy by week
accuracy_by_week <- multi_results %>%
  mutate(
    preseason_error = abs(preseason_proj - final_g90),
    inseason_error = abs(in_season_proj - final_g90),
    within_ci = final_g90 >= ci_lower & final_g90 <= ci_upper
  ) %>%
  group_by(week) %>%
  summarise(
    preseason_rmse = sqrt(mean(preseason_error^2)),
    inseason_rmse = sqrt(mean(inseason_error^2)),
    improvement_pct = (preseason_rmse - inseason_rmse) / preseason_rmse * 100,
    calibration = mean(within_ci) * 100,
    .groups = "drop"
  )

cat("\nProjection Accuracy by Matchweek:\n")
print(accuracy_by_week)

Chapter Summary

Key Takeaways
  • Aging curves model how performance changes over a career; different metrics peak at different ages
  • Regression to the mean is essential - extreme performances tend to normalize
  • Marcel projections provide a simple but effective framework combining weighting, aging, and regression
  • Machine learning can capture non-linear patterns but requires more data and validation
  • Ensemble methods often outperform any single projection approach
  • In-season projections blend preseason expectations with observed performance
  • Career trajectories help with long-term planning and contract decisions
  • Uncertainty quantification is crucial - always report confidence intervals
Projection System Checklist
  • Weight recent seasons more heavily
  • Apply appropriate aging adjustments
  • Regress to the mean based on sample size
  • Use metric-specific reliability estimates
  • Calculate confidence intervals
  • Validate against historical accuracy
  • Update projections as new data arrives
  • Communicate uncertainty clearly