Capstone - Complete Analytics System
- Understand projection frameworks and their applications in football
- Build aging curves for different metrics and positions
- Apply regression to the mean in performance projections
- Create statistical projection systems using Marcel-style methods
- Implement machine learning approaches for player projections
- Quantify uncertainty in projections with confidence intervals
- Project season totals from partial season data
- Model career trajectories and development paths
Introduction to Performance Projections
Player performance projections are essential for transfer decisions, contract negotiations, squad planning, and fantasy football. By combining historical performance data with aging curves, regression principles, and contextual factors, we can build systems that forecast future performance with quantified uncertainty.
Why Project Performance?
A player's past performance is not a perfect predictor of future results. Young players improve, veterans decline, and all players experience random variance. Projection systems help separate signal from noise and estimate the true underlying talent level that drives future performance.
# Python: Player Projection Framework Overview
import pandas as pd
import numpy as np
from scipy import stats
# Define projection system components
projection_components = {
"historical_performance": "Weighted average of past seasons",
"aging_curve": "Expected change based on age",
"regression": "Regression to the mean based on sample size",
"context": "League/team quality adjustments",
"uncertainty": "Confidence intervals for projections"
}
# Sample projection workflow
def project_player(player_history, metric="npxg_90"):
"""Project future performance for a player."""
# 1. Weight recent seasons more heavily
weighted_avg = calculate_weighted_average(player_history, metric)
# 2. Apply aging curve adjustment
age_adjusted = apply_aging_curve(weighted_avg, player_history["age"])
# 3. Regress to the mean based on minutes played
regressed = regress_to_mean(age_adjusted, player_history["minutes"])
# 4. Calculate projection uncertainty
uncertainty = calculate_uncertainty(player_history, metric)
return {
"projection": regressed,
"lower_95": regressed - 1.96 * uncertainty,
"upper_95": regressed + 1.96 * uncertainty
}
print("Projection Framework Components:")
for name, description in projection_components.items():
print(f"- {name}: {description}")# R: Player Projection Framework Overview
library(tidyverse)
library(mgcv)
# Define projection system components
projection_components <- list(
historical_performance = "Weighted average of past seasons",
aging_curve = "Expected change based on age",
regression = "Regression to the mean based on sample size",
context = "League/team quality adjustments",
uncertainty = "Confidence intervals for projections"
)
# Sample projection workflow
project_player <- function(player_history, metric = "npxg_90") {
# 1. Weight recent seasons more heavily
weighted_avg <- calculate_weighted_average(player_history, metric)
# 2. Apply aging curve adjustment
age_adjusted <- apply_aging_curve(weighted_avg, player_history$age)
# 3. Regress to the mean based on minutes played
regressed <- regress_to_mean(age_adjusted, player_history$minutes)
# 4. Calculate projection uncertainty
uncertainty <- calculate_uncertainty(player_history, metric)
return(list(
projection = regressed,
lower_95 = regressed - 1.96 * uncertainty,
upper_95 = regressed + 1.96 * uncertainty
))
}
cat("Projection Framework Components:\n")
for (name in names(projection_components)) {
cat(sprintf("- %s: %s\n", name, projection_components[[name]]))
}Projection Framework Components:
- historical_performance: Weighted average of past seasons
- aging_curve: Expected change based on age
- regression: Regression to the mean based on sample size
- context: League/team quality adjustments
- uncertainty: Confidence intervals for projectionsBuilding Aging Curves
Aging curves model how player performance changes over their career. Different metrics peak at different ages, and positions show distinct patterns. Building accurate aging curves requires careful handling of survivorship bias.
Physical Metrics
24-26
Peak age for sprint speed, distance covered
Technical Metrics
26-29
Peak age for xG, xA, key passes
Decision Metrics
28-32
Peak age for pass completion, positioning
# Python: Building Aging Curves with Delta Method
import pandas as pd
import numpy as np
from scipy.interpolate import UnivariateSpline
# Load multi-season player data
player_seasons = pd.read_csv("player_season_data.csv")
def build_aging_curve_delta(data, metric):
"""Build aging curve using the delta method."""
# Sort by player and season
data = data.sort_values(["player_id", "season"])
# Create player-season pairs
paired = []
for player_id, group in data.groupby("player_id"):
group = group.reset_index(drop=True)
for i in range(len(group) - 1):
current = group.iloc[i]
next_row = group.iloc[i + 1]
delta = next_row[metric] - current[metric]
paired.append({
"player_id": player_id,
"age": current["age"],
"current_value": current[metric],
"next_value": next_row[metric],
"delta": delta
})
paired_df = pd.DataFrame(paired)
# Calculate average delta at each age
age_deltas = paired_df.groupby("age").agg(
mean_delta=("delta", "mean"),
se_delta=("delta", lambda x: x.std() / np.sqrt(len(x))),
n=("delta", "count")
).reset_index()
return age_deltas
# Build curve for npxG/90
npxg_aging = build_aging_curve_delta(player_seasons, "npxg_90")
def fit_smooth_aging(age_deltas):
"""Fit smooth aging curve with spline."""
# Fit weighted spline
spline = UnivariateSpline(
age_deltas["age"],
age_deltas["mean_delta"],
w=np.sqrt(age_deltas["n"]),
s=0.5
)
# Predict across age range
ages = np.arange(18, 39)
predicted_delta = spline(ages)
# Cumulate to get aging curve
cumulative = np.cumsum(predicted_delta)
peak_value = cumulative.max()
relative = cumulative - peak_value
return pd.DataFrame({
"age": ages,
"predicted_delta": predicted_delta,
"cumulative": cumulative,
"relative": relative
})
npxg_curve = fit_smooth_aging(npxg_aging)
print(npxg_curve)# R: Building Aging Curves with Delta Method
library(tidyverse)
library(mgcv)
# Load multi-season player data
player_seasons <- read_csv("player_season_data.csv")
# Delta method: compare same player across seasons
build_aging_curve_delta <- function(data, metric) {
# Create player-season pairs
paired <- data %>%
group_by(player_id) %>%
arrange(season) %>%
mutate(
next_age = lead(age),
next_value = lead(!!sym(metric)),
delta = next_value - !!sym(metric)
) %>%
filter(!is.na(delta)) %>%
ungroup()
# Calculate average delta at each age
age_deltas <- paired %>%
group_by(age) %>%
summarize(
mean_delta = mean(delta),
se_delta = sd(delta) / sqrt(n()),
n = n()
)
return(age_deltas)
}
# Build curve for npxG/90
npxg_aging <- build_aging_curve_delta(player_seasons, "npxg_90")
# Fit smooth curve using GAM
fit_smooth_aging <- function(age_deltas) {
gam_fit <- gam(mean_delta ~ s(age, k = 10),
data = age_deltas,
weights = n)
# Predict across age range
prediction_ages <- data.frame(age = 18:38)
prediction_ages$predicted_delta <- predict(gam_fit, prediction_ages)
# Cumulate to get aging curve (relative to peak)
prediction_ages$cumulative <- cumsum(prediction_ages$predicted_delta)
peak_value <- max(prediction_ages$cumulative)
prediction_ages$relative <- prediction_ages$cumulative - peak_value
return(prediction_ages)
}
npxg_curve <- fit_smooth_aging(npxg_aging)
print(npxg_curve) age predicted_delta cumulative relative
0 18 0.0321 0.032 -0.247
1 19 0.0285 0.061 -0.219
2 20 0.0241 0.085 -0.195
3 21 0.0198 0.105 -0.175
4 22 0.0156 0.120 -0.159
...Position-Specific Aging Curves
# Python: Position-Specific Aging Analysis
# Different positions age differently
def build_position_curves(data):
"""Build aging curves for each position and metric."""
positions = ["Forward", "Midfielder", "Defender", "Goalkeeper"]
metrics = ["npxg_90", "xa_90", "progressive_carries", "aerial_wins"]
curves = {}
for pos in positions:
pos_data = data[data["position"] == pos]
curves[pos] = {}
for metric in metrics:
if metric in pos_data.columns:
age_deltas = build_aging_curve_delta(pos_data, metric)
if len(age_deltas) > 5: # Need sufficient data
curves[pos][metric] = fit_smooth_aging(age_deltas)
return curves
def find_peak_ages(curves):
"""Find peak age for each position-metric combination."""
results = []
for pos, metrics in curves.items():
for metric, curve in metrics.items():
peak_idx = curve["cumulative"].idxmax()
peak_age = curve.loc[peak_idx, "age"]
results.append({
"position": pos,
"metric": metric,
"peak_age": peak_age
})
return pd.DataFrame(results)
position_curves = build_position_curves(player_seasons)
peak_ages = find_peak_ages(position_curves)
print(peak_ages)# R: Position-Specific Aging Analysis
# Different positions age differently
build_position_curves <- function(data) {
positions <- c("Forward", "Midfielder", "Defender", "Goalkeeper")
metrics <- c("npxg_90", "xa_90", "progressive_carries", "aerial_wins")
curves <- list()
for (pos in positions) {
pos_data <- data %>% filter(position == pos)
curves[[pos]] <- list()
for (metric in metrics) {
age_deltas <- build_aging_curve_delta(pos_data, metric)
curves[[pos]][[metric]] <- fit_smooth_aging(age_deltas)
}
}
return(curves)
}
# Get peak ages by position and metric
find_peak_ages <- function(curves) {
results <- data.frame()
for (pos in names(curves)) {
for (metric in names(curves[[pos]])) {
curve <- curves[[pos]][[metric]]
peak_age <- curve$age[which.max(curve$cumulative)]
results <- rbind(results, data.frame(
position = pos,
metric = metric,
peak_age = peak_age
))
}
}
return(results)
}
position_curves <- build_position_curves(player_seasons)
peak_ages <- find_peak_ages(position_curves)
print(peak_ages) position metric peak_age
0 Forward npxg_90 27
1 Forward xa_90 26
2 Forward progressive_carries 25
3 Midfielder npxg_90 28
4 Midfielder xa_90 28
5 Midfielder progressive_carries 27
6 Defender aerial_wins 29
7 Goalkeeper save_pct 30Regression to the Mean
Players with extreme performances tend to move toward average in subsequent seasons. The amount of regression depends on sample size and the stability of the metric. Understanding regression is crucial for avoiding overpaying for outlier seasons.
Key Insight
A player who scored 20 goals last season with 15 xG likely benefited from luck. Projecting 20 goals again would ignore the high probability of regression. Similarly, a player who scored 8 goals with 12 xG likely underperformed and may improve.
# Python: Regression to the Mean Framework
import pandas as pd
import numpy as np
def calculate_regression_factor(data, metric, min_minutes=900):
"""Calculate year-to-year correlation and regression factor."""
# Filter for sufficient playing time
qualified = data[data["minutes"] >= min_minutes].copy()
# Create consecutive season pairs
qualified = qualified.sort_values(["player_id", "season"])
qualified["next_value"] = qualified.groupby("player_id")[metric].shift(-1)
paired = qualified.dropna(subset=["next_value"])
# Calculate year-to-year correlation
correlation = paired[metric].corr(paired["next_value"])
# Regression factor = 1 - correlation
regression_factor = 1 - correlation
return {
"correlation": correlation,
"regression_factor": regression_factor,
"n_pairs": len(paired)
}
# Calculate for different metrics
metrics = ["goals_90", "npxg_90", "xa_90", "pass_completion", "tackles_90"]
regression_results = []
for m in metrics:
result = calculate_regression_factor(player_seasons, m)
regression_results.append({
"metric": m,
"correlation": result["correlation"],
"regression_factor": result["regression_factor"],
"n": result["n_pairs"]
})
regression_factors = pd.DataFrame(regression_results)
print(regression_factors)
def regress_to_mean(observed, league_mean, reliability, sample_size,
typical_sample=2000):
"""Apply regression to the mean for projection."""
# Reliability increases with sample size
adjusted_reliability = reliability * min(1, sample_size / typical_sample)
# Regressed estimate
projection = (adjusted_reliability * observed) + \
((1 - adjusted_reliability) * league_mean)
return projection
# Example: Player with 0.50 npxG/90 in 1500 minutes
regressed_npxg = regress_to_mean(
observed=0.50,
league_mean=0.30,
reliability=0.70,
sample_size=1500
)
print(f"\nObserved: 0.50 npxG/90")
print(f"Regressed projection: {regressed_npxg:.3f} npxG/90")# R: Regression to the Mean Framework
library(tidyverse)
# Calculate regression amount for different metrics
calculate_regression_factor <- function(data, metric, min_minutes = 900) {
# Filter for sufficient playing time
qualified <- data %>%
filter(minutes >= min_minutes)
# Split into consecutive seasons
paired <- qualified %>%
group_by(player_id) %>%
arrange(season) %>%
mutate(next_value = lead(!!sym(metric))) %>%
filter(!is.na(next_value)) %>%
ungroup()
# Calculate year-to-year correlation
correlation <- cor(paired[[metric]], paired$next_value)
# Regression factor = 1 - correlation
# Higher correlation = less regression needed
regression_factor <- 1 - correlation
return(list(
correlation = correlation,
regression_factor = regression_factor,
n_pairs = nrow(paired)
))
}
# Calculate for different metrics
metrics <- c("goals_90", "npxg_90", "xa_90", "pass_completion", "tackles_90")
regression_factors <- map_df(metrics, function(m) {
result <- calculate_regression_factor(player_seasons, m)
data.frame(
metric = m,
correlation = result$correlation,
regression_factor = result$regression_factor,
n = result$n_pairs
)
})
print(regression_factors)
# Apply regression to a player projection
regress_to_mean <- function(observed, league_mean, reliability, sample_size,
typical_sample = 2000) {
# Reliability increases with sample size
adjusted_reliability <- reliability * min(1, sample_size / typical_sample)
# Regressed estimate
projection <- (adjusted_reliability * observed) +
((1 - adjusted_reliability) * league_mean)
return(projection)
}
# Example: Player with 0.50 npxG/90 in 1500 minutes
# League average: 0.30 npxG/90
# Metric reliability: 0.70
regressed_npxg <- regress_to_mean(
observed = 0.50,
league_mean = 0.30,
reliability = 0.70,
sample_size = 1500
)
cat(sprintf("Observed: 0.50 npxG/90\nRegressed projection: %.3f npxG/90\n", regressed_npxg)) metric correlation regression_factor n
0 goals_90 0.42 0.58 3421
1 npxg_90 0.71 0.29 3421
2 xa_90 0.68 0.32 3421
3 pass_completion 0.82 0.18 3421
4 tackles_90 0.58 0.42 3421
Observed: 0.50 npxG/90
Regressed projection: 0.455 npxG/90Metric Stability Hierarchy
Different metrics have different signal-to-noise ratios. Stable metrics require less regression:
| Stability Level | Metrics | Y-t-Y Correlation | Regression Amount |
|---|---|---|---|
| High | Pass completion, touches, progressive passes | 0.75-0.85 | 15-25% |
| Medium-High | npxG/90, xA/90, shot volume | 0.65-0.75 | 25-35% |
| Medium | Tackles, interceptions, pressures | 0.55-0.65 | 35-45% |
| Low | Goals, assists, G-xG, A-xA | 0.35-0.55 | 45-65% |
Marcel-Style Projection System
The Marcel projection system, originally developed for baseball, provides a simple but effective framework. It weights recent seasons, applies aging curves, and regresses to the mean. We adapt it for football analytics.
# Python: Marcel-Style Football Projection System
import pandas as pd
import numpy as np
# Marcel weights: 5/4/3 for most recent three seasons
MARCEL_WEIGHTS = [5, 4, 3]
def calculate_marcel_projection(player_history, metric, league_avg,
aging_curve, reliability):
"""Calculate Marcel-style projection for a player."""
# Sort by recency (most recent first)
history = player_history.sort_values("season", ascending=False).head(3)
n_seasons = len(history)
if n_seasons == 0:
return {"projection": league_avg, "uncertainty": np.nan}
# Apply Marcel weights
weights = np.array(MARCEL_WEIGHTS[:n_seasons])
weights = weights / weights.sum() # Normalize
# Weighted average of past performance
weighted_avg = (history[metric].values * weights).sum()
# Total weighted minutes for regression
total_minutes = (history["minutes"].values * weights).sum()
# Current age and aging adjustment
current_age = history["age"].max() + 1 # Project for next season
age_row = aging_curve[aging_curve["age"] == current_age]
age_adjustment = age_row["relative"].values[0] if len(age_row) > 0 else 0
# Age-adjusted weighted average
age_adjusted = weighted_avg + age_adjustment
# Regression to the mean
regression_weight = min(1, total_minutes / 3000) * reliability
projection = (regression_weight * age_adjusted) + \
((1 - regression_weight) * league_avg)
# Uncertainty based on sample size and metric stability
uncertainty = history[metric].std() / np.sqrt(n_seasons) * \
(1 + 0.5 * (1 - regression_weight))
return {
"projection": projection,
"uncertainty": uncertainty,
"weighted_avg": weighted_avg,
"age_adjustment": age_adjustment,
"regression_weight": regression_weight
}
# Example projection
player_history = pd.DataFrame({
"season": [2024, 2023, 2022],
"age": [26, 25, 24],
"npxg_90": [0.45, 0.38, 0.32],
"minutes": [2800, 2400, 1800]
})
projection = calculate_marcel_projection(
player_history=player_history,
metric="npxg_90",
league_avg=0.30,
aging_curve=npxg_curve,
reliability=0.71
)
print("Marcel Projection Results:")
print(f"Weighted average: {projection['weighted_avg']:.3f}")
print(f"Age adjustment: {projection['age_adjustment']:.3f}")
print(f"Final projection: {projection['projection']:.3f} npxG/90")
lower = projection["projection"] - 1.96 * projection["uncertainty"]
upper = projection["projection"] + 1.96 * projection["uncertainty"]
print(f"95% CI: [{lower:.3f}, {upper:.3f}]")# R: Marcel-Style Football Projection System
library(tidyverse)
# Marcel weights: 5/4/3 for most recent three seasons
marcel_weights <- c(5, 4, 3)
calculate_marcel_projection <- function(player_history, metric, league_avg,
aging_curve, reliability) {
# Sort by recency (most recent first)
history <- player_history %>%
arrange(desc(season)) %>%
head(3) # Use up to 3 seasons
n_seasons <- nrow(history)
if (n_seasons == 0) {
return(list(projection = league_avg, uncertainty = NA))
}
# Apply Marcel weights
weights <- marcel_weights[1:n_seasons]
weights <- weights / sum(weights) # Normalize
# Weighted average of past performance
weighted_avg <- sum(history[[metric]] * weights)
# Total weighted minutes for regression
total_minutes <- sum(history$minutes * weights)
# Current age and aging adjustment
current_age <- max(history$age) + 1 # Project for next season
age_adjustment <- aging_curve[aging_curve$age == current_age, "relative"]
if (length(age_adjustment) == 0) age_adjustment <- 0
# Age-adjusted weighted average
age_adjusted <- weighted_avg + age_adjustment
# Regression to the mean
regression_weight <- min(1, total_minutes / 3000) * reliability
projection <- (regression_weight * age_adjusted) +
((1 - regression_weight) * league_avg)
# Uncertainty based on sample size and metric stability
uncertainty <- sd(history[[metric]]) / sqrt(n_seasons) *
(1 + 0.5 * (1 - regression_weight))
return(list(
projection = projection,
uncertainty = uncertainty,
weighted_avg = weighted_avg,
age_adjustment = age_adjustment,
regression_weight = regression_weight
))
}
# Example projection
player_history <- data.frame(
season = c(2024, 2023, 2022),
age = c(26, 25, 24),
npxg_90 = c(0.45, 0.38, 0.32),
minutes = c(2800, 2400, 1800)
)
projection <- calculate_marcel_projection(
player_history = player_history,
metric = "npxg_90",
league_avg = 0.30,
aging_curve = npxg_curve,
reliability = 0.71
)
cat("Marcel Projection Results:\n")
cat(sprintf("Weighted average: %.3f\n", projection$weighted_avg))
cat(sprintf("Age adjustment: %.3f\n", projection$age_adjustment))
cat(sprintf("Final projection: %.3f npxG/90\n", projection$projection))
cat(sprintf("95%% CI: [%.3f, %.3f]\n",
projection$projection - 1.96 * projection$uncertainty,
projection$projection + 1.96 * projection$uncertainty))Marcel Projection Results:
Weighted average: 0.401
Age adjustment: -0.008
Final projection: 0.384 npxG/90
95% CI: [0.321, 0.447]Full Squad Projections
# Python: Project Entire Squad
def project_squad(squad_data, metrics, aging_curves,
league_averages, reliabilities):
"""Generate projections for entire squad."""
projections = []
for player_id in squad_data["player_id"].unique():
player_history = squad_data[squad_data["player_id"] == player_id]
player_name = player_history["player_name"].iloc[0]
player_proj = {
"player_id": player_id,
"player_name": player_name
}
for metric in metrics:
proj = calculate_marcel_projection(
player_history=player_history,
metric=metric,
league_avg=league_averages[metric],
aging_curve=aging_curves.get(metric, npxg_curve),
reliability=reliabilities[metric]
)
player_proj[f"{metric}_proj"] = proj["projection"]
player_proj[f"{metric}_lower"] = proj["projection"] - 1.96 * proj["uncertainty"]
player_proj[f"{metric}_upper"] = proj["projection"] + 1.96 * proj["uncertainty"]
projections.append(player_proj)
return pd.DataFrame(projections)
# Define parameters
metrics = ["npxg_90", "xa_90", "progressive_passes_90"]
league_averages = {"npxg_90": 0.30, "xa_90": 0.15, "progressive_passes_90": 3.5}
reliabilities = {"npxg_90": 0.71, "xa_90": 0.68, "progressive_passes_90": 0.75}
# Generate squad projections
squad_projections = project_squad(
squad_data=squad_historical_data,
metrics=metrics,
aging_curves=position_curves.get("Forward", {}),
league_averages=league_averages,
reliabilities=reliabilities
)
print(squad_projections.head())# R: Project Entire Squad
project_squad <- function(squad_data, metrics, aging_curves,
league_averages, reliabilities) {
projections <- list()
for (player_id in unique(squad_data$player_id)) {
player_history <- squad_data %>% filter(player_id == !!player_id)
player_name <- player_history$player_name[1]
player_proj <- list(player_id = player_id, player_name = player_name)
for (metric in metrics) {
proj <- calculate_marcel_projection(
player_history = player_history,
metric = metric,
league_avg = league_averages[[metric]],
aging_curve = aging_curves[[metric]],
reliability = reliabilities[[metric]]
)
player_proj[[paste0(metric, "_proj")]] <- proj$projection
player_proj[[paste0(metric, "_lower")]] <- proj$projection - 1.96 * proj$uncertainty
player_proj[[paste0(metric, "_upper")]] <- proj$projection + 1.96 * proj$uncertainty
}
projections[[player_id]] <- player_proj
}
# Convert to dataframe
return(bind_rows(projections))
}
# Define parameters
metrics <- c("npxg_90", "xa_90", "progressive_passes_90")
league_averages <- list(npxg_90 = 0.30, xa_90 = 0.15, progressive_passes_90 = 3.5)
reliabilities <- list(npxg_90 = 0.71, xa_90 = 0.68, progressive_passes_90 = 0.75)
# Generate squad projections
squad_projections <- project_squad(
squad_data = squad_historical_data,
metrics = metrics,
aging_curves = position_curves[["Forward"]],
league_averages = league_averages,
reliabilities = reliabilities
)
print(head(squad_projections))Machine Learning Projections
Machine learning models can capture non-linear relationships and interactions between features that simpler methods miss. However, they require more data and careful validation to avoid overfitting.
# Python: ML-Based Player Projections
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
def prepare_projection_features(player_data):
"""Prepare features for ML projection model."""
data = player_data.sort_values(["player_id", "season"]).copy()
# Create lag and rolling features
for player_id, group in data.groupby("player_id"):
idx = group.index
# Lag features
data.loc[idx, "metric_lag1"] = group["npxg_90"].shift(1)
data.loc[idx, "metric_lag2"] = group["npxg_90"].shift(2)
data.loc[idx, "metric_lag3"] = group["npxg_90"].shift(3)
# Rolling averages
data.loc[idx, "metric_ma2"] = group["npxg_90"].shift(1).rolling(2).mean()
data.loc[idx, "metric_ma3"] = group["npxg_90"].shift(1).rolling(3).mean()
# Trend
data.loc[idx, "metric_trend"] = group["npxg_90"].shift(1) - group["npxg_90"].shift(2)
# Minutes features
data.loc[idx, "minutes_lag1"] = group["minutes"].shift(1)
data.loc[idx, "minutes_ma2"] = group["minutes"].shift(1).rolling(2).mean()
# Target
data.loc[idx, "target"] = group["npxg_90"].shift(-1)
# Age features
data["age_squared"] = data["age"] ** 2
# Filter complete cases
data = data.dropna(subset=["target", "metric_lag1"])
return data
def train_projection_model(train_data):
"""Train XGBoost projection model."""
feature_cols = ["metric_lag1", "metric_lag2", "metric_ma2", "metric_trend",
"minutes_lag1", "minutes_ma2", "age", "age_squared"]
# Remove rows with missing features
complete_data = train_data.dropna(subset=feature_cols + ["target"])
X = complete_data[feature_cols]
y = complete_data["target"]
# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = XGBRegressor(
max_depth=4,
learning_rate=0.05,
n_estimators=500,
min_child_weight=10,
subsample=0.8,
colsample_bytree=0.8,
early_stopping_rounds=50,
random_state=42
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False
)
# Evaluate
val_pred = model.predict(X_val)
print(f"Validation RMSE: {np.sqrt(mean_squared_error(y_val, val_pred)):.4f}")
print(f"Validation R²: {r2_score(y_val, val_pred):.4f}")
return model
# Train model
projection_features = prepare_projection_features(player_seasons)
ml_model = train_projection_model(projection_features)
# Feature importance
importance = pd.DataFrame({
"feature": feature_cols,
"importance": ml_model.feature_importances_
}).sort_values("importance", ascending=False)
print("\nFeature Importance:")
print(importance)# R: ML-Based Player Projections
library(tidyverse)
library(xgboost)
library(caret)
# Prepare features for ML model
prepare_projection_features <- function(player_data) {
features <- player_data %>%
group_by(player_id) %>%
arrange(season) %>%
mutate(
# Lag features (previous seasons)
metric_lag1 = lag(npxg_90),
metric_lag2 = lag(npxg_90, 2),
metric_lag3 = lag(npxg_90, 3),
# Rolling averages
metric_ma2 = (lag(npxg_90) + lag(npxg_90, 2)) / 2,
metric_ma3 = (lag(npxg_90) + lag(npxg_90, 2) + lag(npxg_90, 3)) / 3,
# Trend features
metric_trend = lag(npxg_90) - lag(npxg_90, 2),
# Minutes features
minutes_lag1 = lag(minutes),
minutes_ma2 = (lag(minutes) + lag(minutes, 2)) / 2,
# Age features
age_squared = age^2,
# Target: next season performance
target = lead(npxg_90)
) %>%
filter(!is.na(target), !is.na(metric_lag1)) %>%
ungroup()
return(features)
}
# Train XGBoost model
train_projection_model <- function(train_data) {
feature_cols <- c("metric_lag1", "metric_lag2", "metric_ma2", "metric_trend",
"minutes_lag1", "minutes_ma2", "age", "age_squared")
# Remove rows with missing features
complete_data <- train_data %>%
select(all_of(c(feature_cols, "target"))) %>%
drop_na()
X <- as.matrix(complete_data[feature_cols])
y <- complete_data$target
# Train-validation split
set.seed(42)
train_idx <- createDataPartition(y, p = 0.8, list = FALSE)
dtrain <- xgb.DMatrix(data = X[train_idx,], label = y[train_idx])
dval <- xgb.DMatrix(data = X[-train_idx,], label = y[-train_idx])
params <- list(
objective = "reg:squarederror",
max_depth = 4,
eta = 0.05,
min_child_weight = 10,
subsample = 0.8,
colsample_bytree = 0.8
)
model <- xgb.train(
params = params,
data = dtrain,
nrounds = 500,
watchlist = list(train = dtrain, val = dval),
early_stopping_rounds = 50,
verbose = 0
)
return(model)
}
# Train model
projection_features <- prepare_projection_features(player_seasons)
ml_model <- train_projection_model(projection_features)
# Feature importance
importance <- xgb.importance(model = ml_model)
print(importance)Validation RMSE: 0.0821
Validation R²: 0.5847
Feature Importance:
feature importance
0 metric_lag1 0.4231
2 metric_ma2 0.1892
6 age 0.1245
3 metric_trend 0.0987
4 minutes_lag1 0.0754
1 metric_lag2 0.0521
5 minutes_ma2 0.0218
7 age_squared 0.0152Ensemble Projections
Combining multiple projection methods often improves accuracy:
# Python: Ensemble Projection System
def create_ensemble_projection(player_history, ml_model,
league_avg, aging_curve, reliability):
"""Create ensemble projection from multiple methods."""
# Method 1: Marcel projection
marcel = calculate_marcel_projection(
player_history, "npxg_90", league_avg, aging_curve, reliability
)
# Method 2: ML projection
features = prepare_projection_features(player_history)
latest = features.sort_values("season", ascending=False).iloc[[0]]
feature_cols = ["metric_lag1", "metric_lag2", "metric_ma2", "metric_trend",
"minutes_lag1", "minutes_ma2", "age", "age_squared"]
ml_pred = ml_model.predict(latest[feature_cols])[0]
# Method 3: Simple weighted average
recent = player_history.sort_values("season", ascending=False).head(2)
simple_avg = np.average(recent["npxg_90"], weights=[0.7, 0.3])
# Combine methods (weighted by historical accuracy)
weights = {"marcel": 0.35, "ml": 0.40, "simple": 0.25}
ensemble = (weights["marcel"] * marcel["projection"] +
weights["ml"] * ml_pred +
weights["simple"] * simple_avg)
# Uncertainty from method disagreement
methods = [marcel["projection"], ml_pred, simple_avg]
uncertainty = np.std(methods) + marcel["uncertainty"]
return {
"ensemble": ensemble,
"marcel": marcel["projection"],
"ml": ml_pred,
"simple": simple_avg,
"uncertainty": uncertainty,
"lower_95": ensemble - 1.96 * uncertainty,
"upper_95": ensemble + 1.96 * uncertainty
}
# Generate ensemble projection
ensemble_result = create_ensemble_projection(
player_history=player_history,
ml_model=ml_model,
league_avg=0.30,
aging_curve=npxg_curve,
reliability=0.71
)
print("Ensemble Projection:")
print(f"Marcel: {ensemble_result['marcel']:.3f}")
print(f"ML: {ensemble_result['ml']:.3f}")
print(f"Simple: {ensemble_result['simple']:.3f}")
print(f"Ensemble: {ensemble_result['ensemble']:.3f} "
f"(95% CI: [{ensemble_result['lower_95']:.3f}, {ensemble_result['upper_95']:.3f}])")# R: Ensemble Projection System
create_ensemble_projection <- function(player_history, ml_model,
league_avg, aging_curve, reliability) {
# Method 1: Marcel projection
marcel <- calculate_marcel_projection(
player_history, "npxg_90", league_avg, aging_curve, reliability
)
# Method 2: ML projection
features <- prepare_projection_features(player_history)
latest <- features %>% arrange(desc(season)) %>% slice(1)
ml_pred <- predict(ml_model, as.matrix(latest[feature_cols]))
# Method 3: Simple weighted average
recent <- player_history %>%
arrange(desc(season)) %>%
head(2)
simple_avg <- weighted.mean(recent$npxg_90, c(0.7, 0.3))
# Combine methods (weighted by historical accuracy)
weights <- c(marcel = 0.35, ml = 0.40, simple = 0.25)
ensemble <- (weights["marcel"] * marcel$projection +
weights["ml"] * ml_pred +
weights["simple"] * simple_avg)
# Uncertainty from method disagreement
methods <- c(marcel$projection, ml_pred, simple_avg)
uncertainty <- sd(methods) + marcel$uncertainty
return(list(
ensemble = ensemble,
marcel = marcel$projection,
ml = ml_pred,
simple = simple_avg,
uncertainty = uncertainty,
lower_95 = ensemble - 1.96 * uncertainty,
upper_95 = ensemble + 1.96 * uncertainty
))
}
# Generate ensemble projection
ensemble_result <- create_ensemble_projection(
player_history = player_history,
ml_model = ml_model,
league_avg = 0.30,
aging_curve = npxg_curve,
reliability = 0.71
)
cat("Ensemble Projection:\n")
cat(sprintf("Marcel: %.3f\n", ensemble_result$marcel))
cat(sprintf("ML: %.3f\n", ensemble_result$ml))
cat(sprintf("Simple: %.3f\n", ensemble_result$simple))
cat(sprintf("Ensemble: %.3f (95%% CI: [%.3f, %.3f])\n",
ensemble_result$ensemble, ensemble_result$lower_95, ensemble_result$upper_95))Ensemble Projection:
Marcel: 0.384
ML: 0.412
Simple: 0.427
Ensemble: 0.407 (95% CI: [0.318, 0.496])In-Season Projections
Projecting final season totals from partial data requires balancing observed performance against pre-season expectations. As the season progresses, we weight current performance more heavily.
# Python: In-Season Total Projections
def calculate_season_projection(preseason_proj, current_stats,
games_played, total_games=38):
"""Project final season totals from partial data."""
games_remaining = total_games - games_played
season_progress = games_played / total_games
# Calculate pace
current_pace = current_stats * (total_games / games_played)
# Dynamic weighting based on games played
# More games = more weight on observed performance
reliability_by_games = min(0.95, 0.2 + 0.8 * (games_played / 30))
# Blend preseason projection with current pace
blended_projection = (reliability_by_games * current_pace +
(1 - reliability_by_games) * preseason_proj)
# Uncertainty decreases as season progresses
pace_variance = current_stats * 0.3 # Rough variance estimate
uncertainty = np.sqrt(games_remaining / total_games) * pace_variance
return {
"projection": blended_projection,
"current_pace": current_pace,
"reliability": reliability_by_games,
"uncertainty": uncertainty,
"lower_90": blended_projection - 1.645 * uncertainty,
"upper_90": blended_projection + 1.645 * uncertainty
}
# Example: Player projections at different points in season
preseason_xg = 15.0 # Preseason projection: 15 xG
current_xg = 6.5 # Current: 6.5 xG in 14 games
season_proj = calculate_season_projection(
preseason_proj=preseason_xg,
current_stats=current_xg,
games_played=14
)
print("In-Season Projection (14 games):")
print(f"Preseason projection: {preseason_xg:.1f} xG")
print(f"Current pace: {season_proj['current_pace']:.1f} xG")
print(f"Blended projection: {season_proj['projection']:.1f} xG "
f"(90% CI: [{season_proj['lower_90']:.1f}, {season_proj['upper_90']:.1f}])")
print(f"Current performance weight: {season_proj['reliability'] * 100:.0f}%")# R: In-Season Total Projections
calculate_season_projection <- function(preseason_proj, current_stats,
games_played, total_games = 38) {
games_remaining <- total_games - games_played
season_progress <- games_played / total_games
# Calculate pace
current_pace <- current_stats * (total_games / games_played)
# Dynamic weighting based on games played
# More games = more weight on observed performance
reliability_by_games <- min(0.95, 0.2 + 0.8 * (games_played / 30))
# Blend preseason projection with current pace
blended_projection <- (reliability_by_games * current_pace +
(1 - reliability_by_games) * preseason_proj)
# Uncertainty decreases as season progresses
pace_variance <- current_stats * 0.3 # Rough variance estimate
uncertainty <- sqrt(games_remaining / total_games) * pace_variance
return(list(
projection = blended_projection,
current_pace = current_pace,
reliability = reliability_by_games,
uncertainty = uncertainty,
lower_90 = blended_projection - 1.645 * uncertainty,
upper_90 = blended_projection + 1.645 * uncertainty
))
}
# Example: Player projections at different points in season
preseason_xg <- 15.0 # Preseason projection: 15 xG
current_xg <- 6.5 # Current: 6.5 xG in 14 games
season_proj <- calculate_season_projection(
preseason_proj = preseason_xg,
current_stats = current_xg,
games_played = 14
)
cat("In-Season Projection (14 games):\n")
cat(sprintf("Preseason projection: %.1f xG\n", preseason_xg))
cat(sprintf("Current pace: %.1f xG\n", season_proj$current_pace))
cat(sprintf("Blended projection: %.1f xG (90%% CI: [%.1f, %.1f])\n",
season_proj$projection, season_proj$lower_90, season_proj$upper_90))
cat(sprintf("Current performance weight: %.0f%%\n", season_proj$reliability * 100))In-Season Projection (14 games):
Preseason projection: 15.0 xG
Current pace: 17.6 xG
Blended projection: 16.5 xG (90% CI: [13.3, 19.7])
Current performance weight: 57%Projection Updates Through Season
# Python: Track Projection Updates Through Season
import matplotlib.pyplot as plt
def track_season_projections(player_game_log, preseason_proj,
metric="xg"):
"""Track how projections evolve through the season."""
updates = []
total_games = 38
cumulative = 0
for i, (_, game) in enumerate(player_game_log.iterrows(), 1):
cumulative += game[metric]
proj = calculate_season_projection(
preseason_proj=preseason_proj,
current_stats=cumulative,
games_played=i,
total_games=total_games
)
updates.append({
"matchweek": i,
"cumulative": cumulative,
"pace": proj["current_pace"],
"projection": proj["projection"],
"lower": proj["lower_90"],
"upper": proj["upper_90"],
"reliability": proj["reliability"]
})
return pd.DataFrame(updates)
def plot_projection_evolution(updates):
"""Visualize how projections change through season."""
fig, ax = plt.subplots(figsize=(12, 6))
# Confidence interval
ax.fill_between(updates["matchweek"],
updates["lower"], updates["upper"],
alpha=0.3, color="blue", label="90% CI")
# Projection line
ax.plot(updates["matchweek"], updates["projection"],
color="blue", linewidth=2, label="Projection")
# Current pace
ax.plot(updates["matchweek"], updates["pace"],
color="gray", linestyle="--", label="Current Pace")
# Actual cumulative
ax.scatter(updates["matchweek"], updates["cumulative"],
color="red", s=30, label="Actual")
ax.set_xlabel("Matchweek")
ax.set_ylabel("Expected Goals (xG)")
ax.set_title("Season Projection Evolution")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("projection_evolution.png", dpi=150)
plt.show()
# Generate and plot
projection_updates = track_season_projections(player_game_log, 15.0, "xg")
plot_projection_evolution(projection_updates)# R: Track Projection Updates Through Season
track_season_projections <- function(player_game_log, preseason_proj,
metric = "xg") {
updates <- data.frame()
total_games <- 38
cumulative <- 0
for (i in 1:nrow(player_game_log)) {
game <- player_game_log[i,]
cumulative <- cumulative + game[[metric]]
proj <- calculate_season_projection(
preseason_proj = preseason_proj,
current_stats = cumulative,
games_played = i,
total_games = total_games
)
updates <- rbind(updates, data.frame(
matchweek = i,
cumulative = cumulative,
pace = proj$current_pace,
projection = proj$projection,
lower = proj$lower_90,
upper = proj$upper_90,
reliability = proj$reliability
))
}
return(updates)
}
# Visualize projection evolution
library(ggplot2)
plot_projection_evolution <- function(updates) {
ggplot(updates, aes(x = matchweek)) +
geom_ribbon(aes(ymin = lower, ymax = upper),
fill = "lightblue", alpha = 0.5) +
geom_line(aes(y = projection), color = "blue", linewidth = 1.2) +
geom_line(aes(y = pace), color = "gray", linetype = "dashed") +
geom_point(aes(y = cumulative), color = "red", size = 2) +
labs(title = "Season Projection Evolution",
x = "Matchweek", y = "Expected Goals (xG)",
caption = "Blue = projection, Red = actual, Gray = pace") +
theme_minimal()
}
# Generate and plot
projection_updates <- track_season_projections(player_game_log, 15.0, "xg")
plot_projection_evolution(projection_updates)Career Trajectory Modeling
Long-term career projections help with multi-year contract decisions and identifying players who may outperform or underperform expectations. This involves modeling entire career arcs rather than single seasons.
# Python: Career Trajectory Projections
def project_career(player_history, aging_curve, years_ahead=5):
"""Project performance over multiple future seasons."""
# Get current state
latest = player_history.sort_values("season", ascending=False).iloc[0]
current_age = latest["age"]
current_level = latest["npxg_90"]
# Project each future season
projections = []
for year in range(1, years_ahead + 1):
future_age = current_age + year
# Get aging adjustment
age_row = aging_curve[aging_curve["age"] == future_age]
if len(age_row) > 0:
age_adj = age_row["relative"].values[0]
else:
age_adj = -0.05 * year # Default decline
# Calculate projection with increasing uncertainty
base_proj = current_level + age_adj
uncertainty = 0.05 * np.sqrt(year) # Uncertainty grows with time
projections.append({
"season": int(latest["season"]) + year,
"age": future_age,
"projection": base_proj,
"lower_80": base_proj - 1.28 * uncertainty,
"upper_80": base_proj + 1.28 * uncertainty,
"lower_95": base_proj - 1.96 * uncertainty,
"upper_95": base_proj + 1.96 * uncertainty
})
return pd.DataFrame(projections)
# Project career for player
career_proj = project_career(
player_history=player_history,
aging_curve=npxg_curve,
years_ahead=8
)
print(career_proj)# R: Career Trajectory Projections
library(tidyverse)
project_career <- function(player_history, aging_curve, years_ahead = 5) {
# Get current state
latest <- player_history %>%
arrange(desc(season)) %>%
slice(1)
current_age <- latest$age
current_level <- latest$npxg_90
# Project each future season
projections <- data.frame()
for (year in 1:years_ahead) {
future_age <- current_age + year
# Get aging adjustment
age_adj <- aging_curve %>%
filter(age == future_age) %>%
pull(relative)
if (length(age_adj) == 0) age_adj <- -0.05 * year # Default decline
# Calculate projection with increasing uncertainty
base_proj <- current_level + age_adj
uncertainty <- 0.05 * sqrt(year) # Uncertainty grows with time
projections <- rbind(projections, data.frame(
season = latest$season + year,
age = future_age,
projection = base_proj,
lower_80 = base_proj - 1.28 * uncertainty,
upper_80 = base_proj + 1.28 * uncertainty,
lower_95 = base_proj - 1.96 * uncertainty,
upper_95 = base_proj + 1.96 * uncertainty
))
}
return(projections)
}
# Project career for a 25-year-old
career_proj <- project_career(
player_history = player_history,
aging_curve = npxg_curve,
years_ahead = 8
)
print(career_proj) season age projection lower_80 upper_80 lower_95 upper_95
0 2025 27 0.398 0.334 0.462 0.300 0.496
1 2026 28 0.391 0.301 0.481 0.253 0.529
2 2027 29 0.380 0.268 0.492 0.206 0.554
3 2028 30 0.362 0.232 0.492 0.157 0.567
4 2029 31 0.340 0.195 0.485 0.109 0.571
5 2030 32 0.312 0.156 0.468 0.059 0.565
6 2031 33 0.278 0.114 0.442 0.007 0.549
7 2032 34 0.240 0.072 0.408 -0.047 0.527Contract Value Projections
# Python: Project Contract Value
def calculate_contract_value(career_proj, wage_per_npxg=50000,
discount_rate=0.05):
"""Calculate present value of projected future performance."""
proj = career_proj.copy()
# Calculate expected value of each season
proj["season_value"] = proj["projection"] * wage_per_npxg * 38 # 38 games
# Discount future seasons
proj["year_num"] = range(len(proj))
proj["discount_factor"] = 1 / (1 + discount_rate) ** proj["year_num"]
# Present value
proj["present_value"] = proj["season_value"] * proj["discount_factor"]
# Uncertainty in value
proj["value_uncertainty"] = ((proj["upper_95"] - proj["lower_95"]) / 2) * \
wage_per_npxg * 38 * proj["discount_factor"]
# Total contract value
total_value = proj["present_value"].sum()
total_uncertainty = np.sqrt((proj["value_uncertainty"] ** 2).sum())
return {
"yearly_values": proj[["season", "age", "projection",
"season_value", "present_value"]],
"total_value": total_value,
"lower_bound": total_value - 1.96 * total_uncertainty,
"upper_bound": total_value + 1.96 * total_uncertainty
}
# Calculate value for 5-year contract
contract_value = calculate_contract_value(
career_proj.head(5),
wage_per_npxg=50000, # £50k per npxG per season
discount_rate=0.05
)
print("5-Year Contract Valuation:")
print(f"Total Present Value: £{contract_value['total_value']/1e6:.2f}M")
print(f"95% CI: [£{contract_value['lower_bound']/1e6:.2f}M, "
f"£{contract_value['upper_bound']/1e6:.2f}M]")
print(contract_value["yearly_values"])# R: Project Contract Value
calculate_contract_value <- function(career_proj, wage_per_npxg = 50000,
discount_rate = 0.05) {
# Calculate expected value of each season
career_proj <- career_proj %>%
mutate(
# Seasonal expected value
season_value = projection * wage_per_npxg * 38, # 38 games
# Discount future seasons
discount_factor = (1 / (1 + discount_rate)) ^ (row_number() - 1),
# Present value
present_value = season_value * discount_factor,
# Uncertainty in value
value_uncertainty = ((upper_95 - lower_95) / 2) * wage_per_npxg * 38 * discount_factor
)
# Total contract value
total_value <- sum(career_proj$present_value)
total_uncertainty <- sqrt(sum(career_proj$value_uncertainty^2))
return(list(
yearly_values = career_proj %>%
select(season, age, projection, season_value, present_value),
total_value = total_value,
lower_bound = total_value - 1.96 * total_uncertainty,
upper_bound = total_value + 1.96 * total_uncertainty
))
}
# Calculate value for 5-year contract
contract_value <- calculate_contract_value(
career_proj %>% head(5),
wage_per_npxg = 50000, # £50k per npxG per season
discount_rate = 0.05
)
cat("5-Year Contract Valuation:\n")
cat(sprintf("Total Present Value: £%.2fM\n", contract_value$total_value / 1e6))
cat(sprintf("95%% CI: [£%.2fM, £%.2fM]\n",
contract_value$lower_bound / 1e6, contract_value$upper_bound / 1e6))
print(contract_value$yearly_values)5-Year Contract Valuation:
Total Present Value: £3.24M
95% CI: [£2.18M, £4.30M]
season age projection season_value present_value
0 2025 27 0.398 756200 756200.0
1 2026 28 0.391 742900 707523.8
2 2027 29 0.380 722000 654478.5
3 2028 30 0.362 687800 593947.6
4 2029 31 0.340 646000 531195.3Quantifying Uncertainty
All projections come with uncertainty. Properly quantifying and communicating this uncertainty is crucial for making sound decisions. Overconfident projections lead to poor choices.
- Sample size: Small samples = larger confidence intervals
- Metric stability: Volatile metrics have wider ranges
- Model error: No model is perfect
- Regime changes: Team/league switches
- Injury risk: Health is uncertain
- Playing time: Minutes are not guaranteed
- Always report confidence intervals, not point estimates alone
- Use fan charts for multi-year projections
- Show percentile ranges (10th, 50th, 90th)
- Explain what drives uncertainty
- Compare to historical prediction accuracy
- Calibrate intervals against out-of-sample results
# Python: Bootstrapped Confidence Intervals
import numpy as np
def bootstrap_projection(player_history, metric, n_boot=1000,
league_avg=0.30, reliability=0.70):
"""Generate bootstrapped confidence intervals."""
boot_projections = []
for _ in range(n_boot):
# Bootstrap resample of seasons
boot_sample = player_history.sample(n=len(player_history), replace=True)
# Calculate projection on resampled data
proj = calculate_marcel_projection(
player_history=boot_sample,
metric=metric,
league_avg=league_avg,
aging_curve=npxg_curve,
reliability=reliability
)
boot_projections.append(proj["projection"])
boot_projections = np.array(boot_projections)
# Calculate percentiles
percentiles = np.percentile(boot_projections,
[5, 10, 25, 50, 75, 90, 95])
return {
"mean": np.mean(boot_projections),
"sd": np.std(boot_projections),
"percentiles": dict(zip([5, 10, 25, 50, 75, 90, 95], percentiles)),
"distribution": boot_projections
}
# Run bootstrap
boot_result = bootstrap_projection(
player_history=player_history,
metric="npxg_90",
n_boot=1000
)
print("Bootstrap Projection Distribution:")
print(f"Mean: {boot_result['mean']:.3f}")
print(f"Std Dev: {boot_result['sd']:.3f}")
print("\nPercentiles:")
for p, v in boot_result["percentiles"].items():
print(f" {p}th: {v:.3f}")# R: Bootstrapped Confidence Intervals
library(tidyverse)
bootstrap_projection <- function(player_history, metric, n_boot = 1000,
league_avg = 0.30, reliability = 0.70) {
boot_projections <- numeric(n_boot)
for (i in 1:n_boot) {
# Bootstrap resample of seasons
boot_sample <- player_history %>%
slice_sample(n = nrow(player_history), replace = TRUE)
# Calculate projection on resampled data
proj <- calculate_marcel_projection(
player_history = boot_sample,
metric = metric,
league_avg = league_avg,
aging_curve = npxg_curve,
reliability = reliability
)
boot_projections[i] <- proj$projection
}
# Calculate percentiles
percentiles <- quantile(boot_projections,
probs = c(0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95))
return(list(
mean = mean(boot_projections),
sd = sd(boot_projections),
percentiles = percentiles,
distribution = boot_projections
))
}
# Run bootstrap
boot_result <- bootstrap_projection(
player_history = player_history,
metric = "npxg_90",
n_boot = 1000
)
cat("Bootstrap Projection Distribution:\n")
cat(sprintf("Mean: %.3f\n", boot_result$mean))
cat(sprintf("Std Dev: %.3f\n", boot_result$sd))
cat("\nPercentiles:\n")
print(boot_result$percentiles)Bootstrap Projection Distribution:
Mean: 0.387
Std Dev: 0.048
Percentiles:
5th: 0.312
10th: 0.328
25th: 0.354
50th: 0.385
75th: 0.418
90th: 0.449
95th: 0.468Building a Projection Dashboard
A well-designed projection dashboard helps decision-makers quickly understand projected performance and its uncertainty.
# Python: Complete Projection Dashboard
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
def create_projection_dashboard(player_data, player_id):
"""Create comprehensive projection dashboard for a player."""
player = player_data[player_data["player_id"] == player_id]
player_name = player["player_name"].iloc[0]
# Generate all projections
marcel_proj = calculate_marcel_projection(player, "npxg_90", 0.30, npxg_curve, 0.71)
career_proj = project_career(player, npxg_curve, 5)
boot_proj = bootstrap_projection(player, "npxg_90")
# Create figure with subplots
fig = plt.figure(figsize=(14, 10))
gs = GridSpec(2, 2, figure=fig)
# 1. Historical + Projection chart
ax1 = fig.add_subplot(gs[0, :])
ax1.plot(player["season"], player["npxg_90"], "bo-", label="Historical")
ax1.fill_between(career_proj["season"],
career_proj["lower_95"], career_proj["upper_95"],
alpha=0.3, color="blue", label="95% CI")
ax1.plot(career_proj["season"], career_proj["projection"],
"b--", label="Projection")
ax1.set_xlabel("Season")
ax1.set_ylabel("npxG/90")
ax1.set_title(f"{player_name} - npxG/90 Trajectory")
ax1.legend()
ax1.grid(True, alpha=0.3)
# 2. Projection distribution
ax2 = fig.add_subplot(gs[1, 0])
ax2.hist(boot_proj["distribution"], bins=30, color="steelblue",
alpha=0.7, edgecolor="white")
ax2.axvline(marcel_proj["projection"], color="red",
linestyle="--", linewidth=2, label="Point Estimate")
ax2.set_xlabel("npxG/90 Projection")
ax2.set_ylabel("Frequency")
ax2.set_title("Projection Distribution")
ax2.legend()
# 3. Percentile comparison
ax3 = fig.add_subplot(gs[1, 1])
percentiles = [5, 25, 50, 75, 95]
values = [boot_proj["percentiles"][p] for p in percentiles]
colors = ["#d73027", "#fc8d59", "#fee08b", "#91cf60", "#1a9850"]
ax3.barh(range(len(percentiles)), values, color=colors)
ax3.set_yticks(range(len(percentiles)))
ax3.set_yticklabels([f"{p}th" for p in percentiles])
ax3.set_xlabel("npxG/90")
ax3.set_title("Projection Percentiles")
ax3.axvline(marcel_proj["projection"], color="black",
linestyle="--", label="Point Estimate")
plt.tight_layout()
plt.savefig(f"projection_dashboard_{player_id}.png", dpi=150)
plt.show()
# Summary table
comparison = pd.DataFrame({
"Metric": ["Historical Avg", "Marcel Projection", "10th Percentile",
"Median", "90th Percentile"],
"Value": [
player["npxg_90"].mean(),
marcel_proj["projection"],
boot_proj["percentiles"][10],
boot_proj["percentiles"][50],
boot_proj["percentiles"][90]
]
})
return {
"comparison": comparison,
"career_proj": career_proj,
"boot_dist": boot_proj["distribution"]
}
# Generate dashboard
dashboard = create_projection_dashboard(squad_data, "player_123")
print(dashboard["comparison"])# R: Complete Projection Dashboard
library(tidyverse)
library(plotly)
create_projection_dashboard <- function(player_data, player_id) {
player <- player_data %>% filter(player_id == !!player_id)
player_name <- player$player_name[1]
# Generate all projections
marcel_proj <- calculate_marcel_projection(player, "npxg_90", 0.30, npxg_curve, 0.71)
career_proj <- project_career(player, npxg_curve, 5)
boot_proj <- bootstrap_projection(player, "npxg_90")
# Create visualizations
dashboard <- list()
# 1. Historical + Projection chart
history_chart <- ggplot() +
geom_line(data = player, aes(x = season, y = npxg_90), color = "blue") +
geom_point(data = player, aes(x = season, y = npxg_90), color = "blue") +
geom_ribbon(data = career_proj,
aes(x = season, ymin = lower_95, ymax = upper_95),
fill = "lightblue", alpha = 0.3) +
geom_line(data = career_proj, aes(x = season, y = projection),
color = "blue", linetype = "dashed") +
labs(title = paste(player_name, "- npxG/90 Trajectory"),
x = "Season", y = "npxG/90") +
theme_minimal()
# 2. Projection distribution
dist_chart <- ggplot(data.frame(x = boot_proj$distribution), aes(x = x)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = marcel_proj$projection,
color = "red", linetype = "dashed", size = 1) +
labs(title = "Projection Distribution",
x = "npxG/90 Projection", y = "Frequency") +
theme_minimal()
# 3. Comparison table
comparison <- data.frame(
Metric = c("Historical Avg", "Marcel Projection", "10th Percentile",
"Median", "90th Percentile"),
Value = c(
mean(player$npxg_90),
marcel_proj$projection,
boot_proj$percentiles["10%"],
boot_proj$percentiles["50%"],
boot_proj$percentiles["90%"]
)
)
return(list(
history_chart = history_chart,
dist_chart = dist_chart,
comparison = comparison,
career_proj = career_proj
))
}
# Generate dashboard for a player
dashboard <- create_projection_dashboard(squad_data, "player_123")
print(dashboard$comparison)
print(dashboard$history_chart) Metric Value
0 Historical Avg 0.383
1 Marcel Projection 0.387
2 10th Percentile 0.328
3 Median 0.385
4 90th Percentile 0.449Practice Exercises
Exercise 29.1: Position-Specific Aging Curve Analysis System
Task: Build a comprehensive aging curve analysis system that creates position-specific curves using the delta method and identifies peak performance ages for different metrics.
Requirements:
- Use the delta method to calculate year-over-year performance changes
- Fit smooth curves using LOESS/GAM with confidence intervals
- Create separate curves for forwards, midfielders, and defenders
- Identify peak ages and decline rates for each position
- Control for survivorship bias with minimum minutes thresholds
import pandas as pd
import numpy as np
from scipy.interpolate import UnivariateSpline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Generate synthetic career data
np.random.seed(42)
def generate_career_data(n_players=500):
careers = []
for player_id in range(n_players):
# Player characteristics
position = np.random.choice(['Forward', 'Midfielder', 'Defender'],
p=[0.25, 0.45, 0.30])
start_age = np.random.randint(18, 23)
career_length = np.random.randint(8, 17)
talent = np.random.normal(0, 0.15)
# Position-specific peak ages
peak_age = {
'Forward': 27 + np.random.normal(0, 1),
'Midfielder': 28 + np.random.normal(0, 1),
'Defender': 29 + np.random.normal(0, 1)
}[position]
for season in range(min(career_length, 38 - start_age)):
age = start_age + season
# Aging curve (quadratic decline from peak)
age_effect = -0.003 * (age - peak_age)**2
# Base performance by position
base_goals = {'Forward': 0.45, 'Midfielder': 0.15, 'Defender': 0.05}[position]
base_assists = {'Forward': 0.20, 'Midfielder': 0.25, 'Defender': 0.10}[position]
# Calculate metrics
goals_p90 = max(0, base_goals * (1 + talent + age_effect + np.random.normal(0, 0.08)))
assists_p90 = max(0, base_assists * (1 + talent + age_effect + np.random.normal(0, 0.06)))
# Minutes (declines with age)
minutes = max(500, 2500 * (1 - 0.02 * (age - 25)**2) + np.random.normal(0, 400))
careers.append({
'player_id': player_id,
'season': 2010 + season,
'age': age,
'position': position,
'minutes': minutes,
'goals_p90': goals_p90,
'assists_p90': assists_p90
})
return pd.DataFrame(careers)
# Generate data
player_data = generate_career_data(500)
print(f"Total player-seasons: {len(player_data)}")
print(f"Unique players: {player_data['player_id'].nunique()}")
# Apply Delta Method
player_data = player_data.sort_values(['player_id', 'season'])
delta_data = player_data.copy()
delta_data['next_season'] = delta_data.groupby('player_id')['season'].shift(-1)
delta_data['next_age'] = delta_data.groupby('player_id')['age'].shift(-1)
delta_data['next_goals_p90'] = delta_data.groupby('player_id')['goals_p90'].shift(-1)
delta_data['next_assists_p90'] = delta_data.groupby('player_id')['assists_p90'].shift(-1)
delta_data['next_minutes'] = delta_data.groupby('player_id')['minutes'].shift(-1)
# Filter for valid consecutive seasons with minimum minutes
delta_data = delta_data[
(delta_data['next_season'] == delta_data['season'] + 1) &
(delta_data['minutes'] >= 900) &
(delta_data['next_minutes'] >= 900)
].copy()
# Calculate deltas
delta_data['delta_goals'] = delta_data['next_goals_p90'] - delta_data['goals_p90']
delta_data['delta_assists'] = delta_data['next_assists_p90'] - delta_data['assists_p90']
delta_data['midpoint_age'] = (delta_data['age'] + delta_data['next_age']) / 2
print(f"\nDelta observations: {len(delta_data)}")
# Fit aging curves by position using LOESS-like smoothing
print("\n=== GOALS PER 90 AGING CURVES ===")
age_grid = np.arange(19, 36.5, 0.5)
curves = {}
peak_ages = {}
for position in ['Forward', 'Midfielder', 'Defender']:
pos_data = delta_data[delta_data['position'] == position]
# Bin data by age and calculate mean delta
pos_data['age_bin'] = pd.cut(pos_data['midpoint_age'],
bins=np.arange(18.5, 37.5, 1),
labels=np.arange(19, 37))
binned = pos_data.groupby('age_bin')['delta_goals'].agg(['mean', 'std', 'count'])
binned = binned.dropna()
binned.index = binned.index.astype(float)
# Fit smoothing spline
if len(binned) >= 4:
spline = UnivariateSpline(binned.index, binned['mean'], s=0.01)
delta_pred = spline(age_grid)
# Cumulate to get aging curve
cumulative = np.cumsum(delta_pred) * 0.5
# Normalize so peak is at 0
cumulative = cumulative - np.max(cumulative)
curves[position] = {
'age': age_grid,
'cumulative': cumulative,
'delta': delta_pred
}
# Find peak age
peak_idx = np.argmax(cumulative)
peak_ages[position] = age_grid[peak_idx]
print("\nPeak Ages by Position:")
for pos, age in peak_ages.items():
print(f" {pos}: {age:.1f} years")
# Calculate decline rates
print("\nDecline Rates After Peak:")
for position, curve in curves.items():
peak_idx = np.argmax(curve['cumulative'])
post_peak = curve['cumulative'][peak_idx:]
post_peak_ages = curve['age'][peak_idx:]
if len(post_peak) > 1:
decline_per_year = (post_peak[-1] - post_peak[0]) / (post_peak_ages[-1] - post_peak_ages[0])
print(f" {position}: {decline_per_year*100:.2f}% per year")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = {'Forward': 'red', 'Midfielder': 'blue', 'Defender': 'green'}
# Plot 1: Cumulative aging curves
ax1 = axes[0]
for position, curve in curves.items():
ax1.plot(curve['age'], curve['cumulative'],
color=colors[position], linewidth=2, label=position)
ax1.axvline(peak_ages[position], color=colors[position],
linestyle=':', alpha=0.5)
ax1.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('Age')
ax1.set_ylabel('Cumulative Effect on Goals/90')
ax1.set_title('Goals/90 Aging Curves by Position')
ax1.legend()
ax1.set_xlim(20, 36)
ax1.grid(True, alpha=0.3)
# Plot 2: Raw delta data with fitted curves
ax2 = axes[1]
for position in ['Forward', 'Midfielder', 'Defender']:
pos_data = delta_data[delta_data['position'] == position]
ax2.scatter(pos_data['midpoint_age'], pos_data['delta_goals'],
color=colors[position], alpha=0.2, s=20)
if position in curves:
ax2.plot(curves[position]['age'], curves[position]['delta'],
color=colors[position], linewidth=2, label=position)
ax2.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Age')
ax2.set_ylabel('Year-over-Year Change (Delta)')
ax2.set_title('Raw Delta Data with Fitted Curves')
ax2.legend()
ax2.set_xlim(20, 36)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('aging_curves_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
# Summary table
print("\n=== AGING CURVE SUMMARY ===")
summary = []
for position, curve in curves.items():
idx_22 = np.argmin(np.abs(curve['age'] - 22))
idx_peak = np.argmax(curve['cumulative'])
idx_34 = np.argmin(np.abs(curve['age'] - 34))
summary.append({
'Position': position,
'Peak Age': f"{curve['age'][idx_peak]:.1f}",
'Effect at 22': f"{curve['cumulative'][idx_22]:.3f}",
'Effect at Peak': f"{curve['cumulative'][idx_peak]:.3f}",
'Effect at 34': f"{curve['cumulative'][idx_34]:.3f}"
})
print(pd.DataFrame(summary).to_string(index=False))library(tidyverse)
library(mgcv)
# Simulate multi-season player data
set.seed(42)
# Generate synthetic player careers
generate_career_data <- function(n_players = 500) {
careers <- list()
for (i in 1:n_players) {
# Player characteristics
position <- sample(c("Forward", "Midfielder", "Defender"), 1,
prob = c(0.25, 0.45, 0.30))
start_age <- sample(18:22, 1)
career_length <- sample(8:16, 1)
talent <- rnorm(1, 0, 0.15)
# Position-specific peak ages
peak_age <- case_when(
position == "Forward" ~ 27 + rnorm(1, 0, 1),
position == "Midfielder" ~ 28 + rnorm(1, 0, 1),
position == "Defender" ~ 29 + rnorm(1, 0, 1)
)
for (season in 1:min(career_length, 38 - start_age)) {
age <- start_age + season - 1
# Aging curve shape (quadratic decline from peak)
age_effect <- -0.003 * (age - peak_age)^2
# Base performance by position
base_goals_p90 <- case_when(
position == "Forward" ~ 0.45,
position == "Midfielder" ~ 0.15,
position == "Defender" ~ 0.05
)
base_assists_p90 <- case_when(
position == "Forward" ~ 0.20,
position == "Midfielder" ~ 0.25,
position == "Defender" ~ 0.10
)
# Calculate metrics with aging and noise
goals_p90 <- max(0, base_goals_p90 * (1 + talent + age_effect + rnorm(1, 0, 0.08)))
assists_p90 <- max(0, base_assists_p90 * (1 + talent + age_effect + rnorm(1, 0, 0.06)))
# Minutes played (declines with age, survivorship)
minutes <- max(500, 2500 * (1 - 0.02 * (age - 25)^2) + rnorm(1, 0, 400))
careers[[length(careers) + 1]] <- data.frame(
player_id = i,
season = 2010 + season,
age = age,
position = position,
minutes = minutes,
goals_p90 = goals_p90,
assists_p90 = assists_p90
)
}
}
bind_rows(careers)
}
# Generate data
player_data <- generate_career_data(500)
cat("Total player-seasons:", nrow(player_data), "\n")
cat("Unique players:", n_distinct(player_data$player_id), "\n")
# Apply Delta Method for aging curves
# Pair consecutive seasons for the same player
delta_data <- player_data %>%
arrange(player_id, season) %>%
group_by(player_id) %>%
mutate(
next_season = lead(season),
next_age = lead(age),
next_goals_p90 = lead(goals_p90),
next_assists_p90 = lead(assists_p90),
next_minutes = lead(minutes)
) %>%
ungroup() %>%
filter(
!is.na(next_season),
next_season == season + 1, # Consecutive seasons
minutes >= 900, # Minimum minutes in year 1
next_minutes >= 900 # Minimum minutes in year 2
) %>%
mutate(
# Calculate deltas (change from age N to N+1)
delta_goals = next_goals_p90 - goals_p90,
delta_assists = next_assists_p90 - assists_p90,
midpoint_age = (age + next_age) / 2
)
cat("\nDelta observations:", nrow(delta_data), "\n")
# Fit aging curves by position using GAM
cat("\n=== GOALS PER 90 AGING CURVES ===\n")
# Fit GAM for each position
aging_models <- delta_data %>%
group_by(position) %>%
group_map(~ {
gam(delta_goals ~ s(midpoint_age, k = 8), data = .x)
})
names(aging_models) <- c("Defender", "Forward", "Midfielder")
# Create prediction grid
age_grid <- expand.grid(
midpoint_age = seq(19, 36, by = 0.5),
position = c("Forward", "Midfielder", "Defender")
)
# Predict and cumulate to get absolute curve
predictions <- age_grid %>%
group_by(position) %>%
group_modify(~ {
model <- aging_models[[.y$position]]
pred <- predict(model, newdata = .x, se.fit = TRUE)
.x %>%
mutate(
delta_pred = pred$fit,
delta_se = pred$se.fit,
# Cumulate deltas to get aging curve (relative to age 27)
cumulative_effect = cumsum(delta_pred) * 0.5, # Scale by age step
ci_lower = cumulative_effect - 1.96 * delta_se,
ci_upper = cumulative_effect + 1.96 * delta_se
)
}) %>%
ungroup()
# Find peak ages
peak_ages <- predictions %>%
group_by(position) %>%
slice_max(cumulative_effect, n = 1) %>%
select(position, peak_age = midpoint_age, peak_effect = cumulative_effect)
cat("\nPeak Ages by Position:\n")
print(peak_ages)
# Calculate decline rates after peak
decline_rates <- predictions %>%
inner_join(peak_ages, by = "position") %>%
filter(midpoint_age >= peak_age) %>%
group_by(position) %>%
summarise(
decline_per_year = (last(cumulative_effect) - first(cumulative_effect)) /
(last(midpoint_age) - first(midpoint_age)),
years_from_peak = last(midpoint_age) - first(midpoint_age),
total_decline = last(cumulative_effect) - first(cumulative_effect)
)
cat("\nDecline Rates After Peak:\n")
print(decline_rates)
# Visualization
par(mfrow = c(1, 2))
# Goals P90 Aging Curves
colors <- c("Forward" = "red", "Midfielder" = "blue", "Defender" = "green")
plot(1, type = "n", xlim = c(20, 36), ylim = c(-0.15, 0.1),
xlab = "Age", ylab = "Cumulative Effect on Goals/90",
main = "Goals/90 Aging Curves by Position")
abline(h = 0, lty = 2, col = "gray")
for (pos in c("Forward", "Midfielder", "Defender")) {
pos_data <- predictions %>% filter(position == pos)
lines(pos_data$midpoint_age, pos_data$cumulative_effect,
col = colors[pos], lwd = 2)
polygon(c(pos_data$midpoint_age, rev(pos_data$midpoint_age)),
c(pos_data$ci_lower, rev(pos_data$ci_upper)),
col = adjustcolor(colors[pos], alpha = 0.1), border = NA)
}
legend("topright", legend = names(colors), col = colors, lwd = 2, cex = 0.8)
# Mark peak ages
for (i in 1:nrow(peak_ages)) {
abline(v = peak_ages$peak_age[i], col = colors[peak_ages$position[i]], lty = 3)
}
# Raw delta data with fitted curves
plot(1, type = "n", xlim = c(20, 36), ylim = c(-0.3, 0.3),
xlab = "Age", ylab = "Year-over-Year Change (Delta)",
main = "Raw Delta Data with Fitted Curves")
abline(h = 0, lty = 2, col = "gray")
for (pos in c("Forward", "Midfielder", "Defender")) {
pos_delta <- delta_data %>% filter(position == pos)
points(pos_delta$midpoint_age, pos_delta$delta_goals,
col = adjustcolor(colors[pos], alpha = 0.2), pch = 19, cex = 0.5)
# Add fitted line
pos_pred <- predictions %>% filter(position == pos)
lines(pos_pred$midpoint_age, c(0, diff(pos_pred$cumulative_effect) / 0.5),
col = colors[pos], lwd = 2)
}
legend("topright", legend = names(colors), col = colors, pch = 19, cex = 0.8)
par(mfrow = c(1, 1))
# Summary statistics
cat("\n=== AGING CURVE SUMMARY ===\n")
summary_stats <- predictions %>%
group_by(position) %>%
summarise(
age_range = paste(min(midpoint_age), "-", max(midpoint_age)),
peak_age = midpoint_age[which.max(cumulative_effect)],
effect_at_22 = cumulative_effect[midpoint_age == 22],
effect_at_peak = max(cumulative_effect),
effect_at_34 = cumulative_effect[midpoint_age == 34]
)
print(summary_stats)Exercise 29.2: Complete Marcel Projection System with Backtesting
Task: Build a full Marcel projection system that weights historical seasons, applies regression to the mean, adjusts for aging, and validates against actual outcomes.
Requirements:
- Implement 5/4/3 weighting for most recent three seasons
- Calculate reliability coefficients and apply appropriate regression
- Integrate aging curve adjustments for projected age
- Backtest projections against held-out seasons
- Calculate RMSE, MAE, and correlation metrics
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Generate multi-season player data
np.random.seed(42)
def generate_player_history(n_players=300, seasons=range(2015, 2024)):
players = []
for player_id in range(n_players):
# Player characteristics
true_talent = np.random.normal(0.35, 0.12) # Goals per 90
position = np.random.choice(['Forward', 'Midfielder'], p=[0.4, 0.6])
start_age = np.random.randint(20, 29)
if position == 'Forward':
true_talent += 0.15
for season in seasons:
age = start_age + (season - min(seasons))
# Skip if retired
if age > 36 or np.random.random() < 0.05:
continue
# Aging effect
peak_age = 27 if position == 'Forward' else 28
aging = -0.003 * (age - peak_age)**2
# Observed performance
minutes = max(800, np.random.normal(2200, 500))
observed_g90 = max(0, true_talent + aging + np.random.normal(0, 0.12))
players.append({
'player_id': player_id,
'player_name': f'Player_{player_id}',
'season': season,
'age': age,
'position': position,
'minutes': minutes,
'goals_p90': observed_g90,
'true_talent': true_talent
})
return pd.DataFrame(players)
# Generate data
player_data = generate_player_history(300)
print(f"Total player-seasons: {len(player_data)}")
print(f"Unique players: {player_data['player_id'].nunique()}")
# League average for regression
league_mean = player_data['goals_p90'].mean()
print(f"\nLeague average goals/90: {league_mean:.3f}")
# Marcel Projection Function
def marcel_project(player_history, target_season, league_mean,
weights=[5, 4, 3], reliability_factor=0.5):
"""Generate Marcel projection for a player."""
# Get recent seasons
recent = player_history[player_history['season'] < target_season].copy()
recent = recent.sort_values('season', ascending=False).head(3)
if len(recent) == 0:
return None
# Calculate weighted average
n_seasons = len(recent)
use_weights = np.array(weights[:n_seasons])
use_weights = use_weights / use_weights.sum()
weighted_g90 = (recent['goals_p90'].values * use_weights).sum()
total_minutes = recent['minutes'].sum()
# Regression rate based on sample size
regression_rate = reliability_factor / (reliability_factor + total_minutes / 2000)
# Regress to mean
regressed_g90 = weighted_g90 * (1 - regression_rate) + league_mean * regression_rate
# Aging adjustment
current_age = recent['age'].iloc[0]
projected_age = current_age + 1
peak_age = 27 if recent['position'].iloc[0] == 'Forward' else 28
current_aging = -0.003 * (current_age - peak_age)**2
projected_aging = -0.003 * (projected_age - peak_age)**2
aging_delta = projected_aging - current_aging
# Final projection
projected_g90 = regressed_g90 + aging_delta
return {
'player_id': recent['player_id'].iloc[0],
'player_name': recent['player_name'].iloc[0],
'position': recent['position'].iloc[0],
'target_season': target_season,
'projected_age': projected_age,
'seasons_used': n_seasons,
'weighted_avg': weighted_g90,
'total_minutes': total_minutes,
'regression_rate': regression_rate,
'regressed_g90': regressed_g90,
'aging_adjustment': aging_delta,
'projected_g90': projected_g90
}
# Generate projections
print("\n=== GENERATING MARCEL PROJECTIONS ===")
projections = []
for target_year in [2022, 2023]:
players_to_project = player_data[player_data['season'] == target_year - 1]['player_id'].unique()
for pid in players_to_project:
player_hist = player_data[player_data['player_id'] == pid]
proj = marcel_project(player_hist, target_year, league_mean)
if proj:
projections.append(proj)
projections_df = pd.DataFrame(projections)
print(f"Total projections generated: {len(projections_df)}")
# Merge with actuals for validation
actuals = player_data[['player_id', 'season', 'goals_p90', 'minutes']].copy()
actuals.columns = ['player_id', 'season', 'actual_g90', 'actual_minutes']
validation = projections_df.merge(
actuals,
left_on=['player_id', 'target_season'],
right_on=['player_id', 'season']
)
validation = validation[validation['actual_minutes'] >= 500]
print(f"Projections with actual data: {len(validation)}")
# Calculate accuracy metrics
print("\n=== MARCEL PROJECTION ACCURACY ===")
rmse = np.sqrt(mean_squared_error(validation['actual_g90'], validation['projected_g90']))
mae = mean_absolute_error(validation['actual_g90'], validation['projected_g90'])
correlation = validation['projected_g90'].corr(validation['actual_g90'])
print(f"\nOverall Metrics:")
print(f" RMSE: {rmse:.4f}")
print(f" MAE: {mae:.4f}")
print(f" Correlation: {correlation:.4f}")
# Naive baseline
naive_rmse = np.sqrt(mean_squared_error(validation['actual_g90'], validation['weighted_avg']))
naive_mae = mean_absolute_error(validation['actual_g90'], validation['weighted_avg'])
naive_corr = validation['weighted_avg'].corr(validation['actual_g90'])
print(f"\nNaive Baseline (weighted avg only):")
print(f" RMSE: {naive_rmse:.4f}")
print(f" MAE: {naive_mae:.4f}")
print(f" Correlation: {naive_corr:.4f}")
print(f"\nMarcel Improvement over Naive:")
print(f" RMSE improvement: {(naive_rmse - rmse) / naive_rmse * 100:.1f}%")
print(f" MAE improvement: {(naive_mae - mae) / naive_mae * 100:.1f}%")
# Accuracy by group
print("\n=== ACCURACY BY PLAYER GROUP ===")
by_position = validation.groupby('position').apply(
lambda x: pd.Series({
'n': len(x),
'rmse': np.sqrt(mean_squared_error(x['actual_g90'], x['projected_g90'])),
'mae': mean_absolute_error(x['actual_g90'], x['projected_g90']),
'correlation': x['projected_g90'].corr(x['actual_g90'])
})
).reset_index()
print("\nBy Position:")
print(by_position.to_string(index=False))
# By experience
validation['experience'] = pd.cut(validation['seasons_used'],
bins=[0, 1, 2, 3],
labels=['1 season', '2 seasons', '3 seasons'])
by_exp = validation.groupby('experience').apply(
lambda x: pd.Series({
'n': len(x),
'rmse': np.sqrt(mean_squared_error(x['actual_g90'], x['projected_g90'])),
'mae': mean_absolute_error(x['actual_g90'], x['projected_g90'])
})
).reset_index()
print("\nBy Seasons of History:")
print(by_exp.to_string(index=False))
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Projected vs Actual
ax1 = axes[0, 0]
ax1.scatter(validation['projected_g90'], validation['actual_g90'],
alpha=0.5, s=30, c='steelblue')
ax1.plot([0, 0.8], [0, 0.8], 'r-', linewidth=2, label='Perfect')
# Fit line
z = np.polyfit(validation['projected_g90'], validation['actual_g90'], 1)
p = np.poly1d(z)
ax1.plot([0, 0.8], [p(0), p(0.8)], 'b--', linewidth=2, label='Actual fit')
ax1.set_xlabel('Projected Goals/90')
ax1.set_ylabel('Actual Goals/90')
ax1.set_title('Marcel Projections vs Actuals')
ax1.legend()
ax1.grid(True, alpha=0.3)
# 2. Residuals
validation['residual'] = validation['actual_g90'] - validation['projected_g90']
ax2 = axes[0, 1]
ax2.scatter(validation['projected_g90'], validation['residual'],
alpha=0.5, s=30, c='steelblue')
ax2.axhline(0, color='red', linewidth=2)
ax2.set_xlabel('Projected Goals/90')
ax2.set_ylabel('Residual (Actual - Projected)')
ax2.set_title('Residual Analysis')
ax2.grid(True, alpha=0.3)
# 3. Residuals by regression rate
ax3 = axes[1, 0]
ax3.scatter(validation['regression_rate'], validation['residual'],
alpha=0.5, s=30, c='green')
ax3.axhline(0, color='red', linewidth=2)
ax3.set_xlabel('Regression Rate')
ax3.set_ylabel('Residual')
ax3.set_title('Residuals by Regression Rate')
ax3.grid(True, alpha=0.3)
# 4. Error distribution
ax4 = axes[1, 1]
ax4.hist(validation['residual'], bins=30, color='lightblue', edgecolor='white')
ax4.axvline(validation['residual'].mean(), color='red', linewidth=2, label='Mean')
ax4.axvline(0, color='black', linestyle='--', linewidth=2, label='Zero')
ax4.set_xlabel('Error (Actual - Projected)')
ax4.set_ylabel('Frequency')
ax4.set_title('Distribution of Projection Errors')
ax4.legend()
plt.tight_layout()
plt.savefig('marcel_projection_validation.png', dpi=150, bbox_inches='tight')
plt.show()
# Example projections
print("\n=== EXAMPLE PROJECTIONS ===")
example = validation.nlargest(10, 'actual_minutes')[
['player_name', 'position', 'projected_age', 'projected_g90', 'actual_g90', 'residual']
]
print(example.to_string(index=False))library(tidyverse)
# Generate multi-season player data for Marcel system
set.seed(42)
generate_player_history <- function(n_players = 300, seasons = 2015:2023) {
players <- list()
for (i in 1:n_players) {
# Player true talent (constant)
true_talent <- rnorm(1, 0.35, 0.12) # Goals per 90
position <- sample(c("Forward", "Midfielder"), 1, prob = c(0.4, 0.6))
start_age <- sample(20:28, 1)
# Forward adjustment
if (position == "Forward") true_talent <- true_talent + 0.15
for (season in seasons) {
age <- start_age + (season - min(seasons))
# Skip if too old or retired
if (age > 36 || runif(1) < 0.05) next
# Aging effect
peak_age <- ifelse(position == "Forward", 27, 28)
aging <- -0.003 * (age - peak_age)^2
# Season performance = true talent + aging + noise
minutes <- max(800, rnorm(1, 2200, 500))
observed_g90 <- max(0, true_talent + aging + rnorm(1, 0, 0.12))
players[[length(players) + 1]] <- data.frame(
player_id = i,
player_name = paste0("Player_", i),
season = season,
age = age,
position = position,
minutes = minutes,
goals_p90 = observed_g90,
true_talent = true_talent
)
}
}
bind_rows(players)
}
# Generate data
player_data <- generate_player_history(300)
cat("Total player-seasons:", nrow(player_data), "\n")
cat("Unique players:", n_distinct(player_data$player_id), "\n\n")
# Calculate league average for regression target
league_avg <- player_data %>%
summarise(
mean_g90 = mean(goals_p90),
sd_g90 = sd(goals_p90)
)
cat("League average goals/90:", round(league_avg$mean_g90, 3), "\n")
# Marcel Projection Function
marcel_project <- function(player_history, target_season, league_mean,
weights = c(5, 4, 3), # Most recent to oldest
reliability_factor = 0.5) { # Higher = more reliable
# Get player's recent seasons
recent <- player_history %>%
filter(season < target_season) %>%
arrange(desc(season)) %>%
head(3)
if (nrow(recent) == 0) return(NULL)
# Calculate weighted average
n_seasons <- nrow(recent)
use_weights <- weights[1:n_seasons]
use_weights <- use_weights / sum(use_weights)
weighted_g90 <- sum(recent$goals_p90 * use_weights)
total_minutes <- sum(recent$minutes)
# Calculate regression amount based on sample size
# More minutes = more reliable = less regression
regression_rate <- reliability_factor / (reliability_factor + total_minutes / 2000)
# Regress to mean
regressed_g90 <- weighted_g90 * (1 - regression_rate) + league_mean * regression_rate
# Apply aging adjustment
current_age <- recent$age[1]
projected_age <- current_age + 1
peak_age <- ifelse(recent$position[1] == "Forward", 27, 28)
# Simple quadratic aging adjustment
current_aging <- -0.003 * (current_age - peak_age)^2
projected_aging <- -0.003 * (projected_age - peak_age)^2
aging_delta <- projected_aging - current_aging
# Final projection
projected_g90 <- regressed_g90 + aging_delta
data.frame(
player_id = recent$player_id[1],
player_name = recent$player_name[1],
position = recent$position[1],
target_season = target_season,
projected_age = projected_age,
seasons_used = n_seasons,
weighted_avg = weighted_g90,
total_minutes = total_minutes,
regression_rate = regression_rate,
regressed_g90 = regressed_g90,
aging_adjustment = aging_delta,
projected_g90 = projected_g90
)
}
# Generate projections for 2022 and 2023 seasons
cat("\n=== GENERATING MARCEL PROJECTIONS ===\n")
projections <- list()
for (target_year in c(2022, 2023)) {
players_to_project <- player_data %>%
filter(season == target_year - 1) %>%
pull(player_id) %>%
unique()
for (pid in players_to_project) {
player_hist <- player_data %>% filter(player_id == pid)
proj <- marcel_project(player_hist, target_year, league_avg$mean_g90)
if (!is.null(proj)) {
projections[[length(projections) + 1]] <- proj
}
}
}
projections_df <- bind_rows(projections)
cat("Total projections generated:", nrow(projections_df), "\n")
# Merge with actual results for validation
actuals <- player_data %>%
select(player_id, season, actual_g90 = goals_p90, actual_minutes = minutes)
validation <- projections_df %>%
inner_join(actuals, by = c("player_id", "target_season" = "season")) %>%
filter(actual_minutes >= 500) # Only validate players who played enough
cat("Projections with actual data:", nrow(validation), "\n\n")
# Calculate accuracy metrics
cat("=== MARCEL PROJECTION ACCURACY ===\n")
# Overall metrics
rmse <- sqrt(mean((validation$projected_g90 - validation$actual_g90)^2))
mae <- mean(abs(validation$projected_g90 - validation$actual_g90))
correlation <- cor(validation$projected_g90, validation$actual_g90)
cat("\nOverall Metrics:\n")
cat(" RMSE:", round(rmse, 4), "\n")
cat(" MAE:", round(mae, 4), "\n")
cat(" Correlation:", round(correlation, 4), "\n")
# Compare to naive baseline (previous season only)
naive_rmse <- sqrt(mean((validation$weighted_avg - validation$actual_g90)^2))
naive_mae <- mean(abs(validation$weighted_avg - validation$actual_g90))
naive_corr <- cor(validation$weighted_avg, validation$actual_g90)
cat("\nNaive Baseline (weighted avg only):\n")
cat(" RMSE:", round(naive_rmse, 4), "\n")
cat(" MAE:", round(naive_mae, 4), "\n")
cat(" Correlation:", round(naive_corr, 4), "\n")
# Improvement
cat("\nMarcel Improvement over Naive:\n")
cat(" RMSE improvement:", round((naive_rmse - rmse) / naive_rmse * 100, 1), "%\n")
cat(" MAE improvement:", round((naive_mae - mae) / naive_mae * 100, 1), "%\n")
# Accuracy by player group
cat("\n=== ACCURACY BY PLAYER GROUP ===\n")
by_position <- validation %>%
group_by(position) %>%
summarise(
n = n(),
rmse = sqrt(mean((projected_g90 - actual_g90)^2)),
mae = mean(abs(projected_g90 - actual_g90)),
correlation = cor(projected_g90, actual_g90),
.groups = "drop"
)
print(by_position)
by_experience <- validation %>%
mutate(experience = cut(seasons_used, breaks = c(0, 1, 2, 3),
labels = c("1 season", "2 seasons", "3 seasons"))) %>%
group_by(experience) %>%
summarise(
n = n(),
rmse = sqrt(mean((projected_g90 - actual_g90)^2)),
mae = mean(abs(projected_g90 - actual_g90)),
.groups = "drop"
)
cat("\nBy Seasons of History:\n")
print(by_experience)
# Visualization
par(mfrow = c(2, 2))
# 1. Projected vs Actual scatter
plot(validation$projected_g90, validation$actual_g90,
xlab = "Projected Goals/90", ylab = "Actual Goals/90",
main = "Marcel Projections vs Actuals",
pch = 19, col = adjustcolor("steelblue", 0.5))
abline(0, 1, col = "red", lwd = 2)
abline(lm(actual_g90 ~ projected_g90, data = validation), col = "blue", lty = 2)
legend("topleft", legend = c("Perfect", "Actual fit"),
col = c("red", "blue"), lty = c(1, 2), cex = 0.8)
# 2. Residual plot
validation$residual <- validation$actual_g90 - validation$projected_g90
plot(validation$projected_g90, validation$residual,
xlab = "Projected Goals/90", ylab = "Residual (Actual - Projected)",
main = "Residual Analysis",
pch = 19, col = adjustcolor("steelblue", 0.5))
abline(h = 0, col = "red", lwd = 2)
# 3. Residuals by regression rate
plot(validation$regression_rate, validation$residual,
xlab = "Regression Rate", ylab = "Residual",
main = "Residuals by Regression Rate",
pch = 19, col = adjustcolor("green", 0.5))
abline(h = 0, col = "red", lwd = 2)
# 4. Histogram of errors
hist(validation$residual, breaks = 30, main = "Distribution of Projection Errors",
xlab = "Error (Actual - Projected)", col = "lightblue", border = "white")
abline(v = mean(validation$residual), col = "red", lwd = 2)
abline(v = 0, col = "black", lty = 2)
par(mfrow = c(1, 1))
# Show example projections
cat("\n=== EXAMPLE PROJECTIONS ===\n")
example <- validation %>%
arrange(desc(actual_minutes)) %>%
head(10) %>%
select(player_name, position, projected_age, projected_g90, actual_g90, residual)
print(example)Exercise 29.3: Dynamic In-Season Projection Updates
Task: Create an in-season projection system that blends preseason expectations with current season performance and updates confidence intervals as the season progresses.
Requirements:
- Implement Bayesian updating to blend prior (preseason) with observed data
- Dynamically adjust weights based on games/minutes played
- Calculate shrinking confidence intervals through the season
- Track and visualize projection evolution week-by-week
- Compare to end-of-season actuals for validation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Simulate a player's season game-by-game
np.random.seed(42)
def generate_season_data(true_g90=0.45, games=38, minutes_per_game=85):
"""Generate game-by-game season data for a player."""
data = {
'matchweek': list(range(1, games + 1)),
'minutes': np.clip(np.random.normal(minutes_per_game, 15, games), 0, 90),
'goals': np.random.poisson(true_g90 * minutes_per_game / 90, games)
}
df = pd.DataFrame(data)
df['cumulative_minutes'] = df['minutes'].cumsum()
df['cumulative_goals'] = df['goals'].cumsum()
df['observed_g90'] = df['cumulative_goals'] / df['cumulative_minutes'] * 90
return df
# Generate season for a player
true_rate = 0.45
preseason_projection = 0.40 # Slightly underestimated
preseason_std = 0.10 # Uncertainty
season_data = generate_season_data(true_g90=true_rate, games=38)
print(f"True goals/90: {true_rate}")
print(f"Preseason projection: {preseason_projection}")
print(f"Final observed g/90: {season_data['observed_g90'].iloc[-1]:.3f}")
# In-Season Projection Update Function (Bayesian-style)
def update_projection(prior_mean, prior_std, observed_g90, minutes_played, reliability=400):
"""
Update projection using Bayesian-style weighting.
Args:
prior_mean: Preseason projection
prior_std: Preseason uncertainty
observed_g90: Current season rate
minutes_played: Total minutes this season
reliability: Minutes needed to trust observed data fully
"""
# Weight for observed data
obs_weight = minutes_played / (minutes_played + reliability)
prior_weight = 1 - obs_weight
# Posterior mean
posterior_mean = prior_weight * prior_mean + obs_weight * observed_g90
# Posterior std shrinks with more data
obs_std = np.sqrt(observed_g90 / max(1, minutes_played / 90))
posterior_std = np.sqrt(1 / (1/prior_std**2 + minutes_played / (reliability * obs_std**2 + 1)))
return {
'mean': posterior_mean,
'std': posterior_std,
'obs_weight': obs_weight
}
# Track projections through the season
projections = [{
'matchweek': 0, # Preseason
'projection': preseason_projection,
'ci_lower': preseason_projection - 1.96 * preseason_std,
'ci_upper': preseason_projection + 1.96 * preseason_std,
'std': preseason_std,
'obs_weight': 0,
'cumulative_minutes': 0,
'observed_g90': np.nan
}]
for week in range(1, 39):
current = season_data[season_data['matchweek'] == week].iloc[0]
updated = update_projection(
prior_mean=preseason_projection,
prior_std=preseason_std,
observed_g90=current['observed_g90'],
minutes_played=current['cumulative_minutes']
)
projections.append({
'matchweek': week,
'projection': updated['mean'],
'ci_lower': updated['mean'] - 1.96 * updated['std'],
'ci_upper': updated['mean'] + 1.96 * updated['std'],
'std': updated['std'],
'obs_weight': updated['obs_weight'],
'cumulative_minutes': current['cumulative_minutes'],
'observed_g90': current['observed_g90']
})
projections_df = pd.DataFrame(projections)
# Summary at key points
print("\n=== PROJECTION EVOLUTION ===")
key_weeks = [0, 5, 10, 19, 29, 38]
summary = projections_df[projections_df['matchweek'].isin(key_weeks)].round(4)
print(summary[['matchweek', 'projection', 'ci_lower', 'ci_upper', 'std', 'obs_weight']].to_string(index=False))
# Accuracy analysis
final_g90 = season_data['observed_g90'].iloc[-1]
accuracy = projections_df[projections_df['matchweek'] > 0].copy()
accuracy['error'] = accuracy['projection'] - final_g90
accuracy['within_ci'] = (accuracy['ci_lower'] <= final_g90) & (final_g90 <= accuracy['ci_upper'])
print("\n=== PROJECTION ACCURACY ===")
calibration = {
'pct_in_95_ci': accuracy['within_ci'].mean() * 100,
'avg_error': accuracy['error'].mean(),
'rmse': np.sqrt((accuracy['error']**2).mean()),
'early_season_error': accuracy[accuracy['matchweek'] <= 10]['error'].mean(),
'late_season_error': accuracy[accuracy['matchweek'] > 28]['error'].mean()
}
print(pd.Series(calibration).to_string())
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Projection evolution with CI
ax1 = axes[0, 0]
ax1.plot(projections_df['matchweek'], projections_df['projection'],
'b-', linewidth=2, label='Projection')
ax1.fill_between(projections_df['matchweek'],
projections_df['ci_lower'], projections_df['ci_upper'],
color='blue', alpha=0.2, label='95% CI')
ax1.axhline(true_rate, color='red', linestyle='--', linewidth=2, label='True Rate')
ax1.axhline(final_g90, color='green', linestyle=':', linewidth=2, label='Final Observed')
ax1.plot(projections_df['matchweek'][1:], projections_df['observed_g90'][1:],
'orange', linewidth=1, alpha=0.7, label='Running Observed')
ax1.set_xlabel('Matchweek')
ax1.set_ylabel('Goals per 90')
ax1.set_title('In-Season Projection Evolution')
ax1.legend(loc='upper right', fontsize=8)
ax1.set_ylim(0.2, 0.7)
ax1.grid(True, alpha=0.3)
# 2. Weight evolution
ax2 = axes[0, 1]
ax2.plot(projections_df['matchweek'], projections_df['obs_weight'] * 100,
'purple', linewidth=2)
ax2.axhline(50, color='gray', linestyle='--', alpha=0.5)
ax2.text(5, 52, '50% threshold', fontsize=9, color='gray')
ax2.set_xlabel('Matchweek')
ax2.set_ylabel('Weight on Observed Data (%)')
ax2.set_title('Data Weight Evolution')
ax2.grid(True, alpha=0.3)
# 3. CI width over time
ax3 = axes[1, 0]
ci_width = projections_df['std'] * 1.96 * 2
ax3.plot(projections_df['matchweek'], ci_width, 'darkgreen', linewidth=2)
ax3.set_xlabel('Matchweek')
ax3.set_ylabel('95% CI Width')
ax3.set_title('Uncertainty Reduction Over Season')
ax3.grid(True, alpha=0.3)
# 4. Error evolution
ax4 = axes[1, 1]
errors = projections_df[projections_df['matchweek'] > 0].copy()
errors['abs_error'] = np.abs(errors['projection'] - final_g90)
ax4.plot(errors['matchweek'], errors['abs_error'], 'red', linewidth=2)
ax4.set_xlabel('Matchweek')
ax4.set_ylabel('Absolute Error vs Final')
ax4.set_title('Projection Error Evolution')
ax4.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('inseason_projection_evolution.png', dpi=150, bbox_inches='tight')
plt.show()
# Multi-player validation
print("\n=== MULTI-PLAYER VALIDATION ===")
def simulate_multiple_players(n_players=50):
results = []
for i in range(n_players):
# Random player
true_rate = np.random.uniform(0.2, 0.7)
preseason_proj = true_rate + np.random.normal(0, 0.08)
preseason_std = np.random.uniform(0.08, 0.15)
season = generate_season_data(true_g90=true_rate)
final_g90 = season['observed_g90'].iloc[-1]
for week in [10, 19, 29, 38]:
if week <= len(season):
current = season[season['matchweek'] == week].iloc[0]
updated = update_projection(
prior_mean=preseason_proj,
prior_std=preseason_std,
observed_g90=current['observed_g90'],
minutes_played=current['cumulative_minutes']
)
results.append({
'player_id': i,
'week': week,
'true_rate': true_rate,
'preseason_proj': preseason_proj,
'in_season_proj': updated['mean'],
'ci_lower': updated['mean'] - 1.96 * updated['std'],
'ci_upper': updated['mean'] + 1.96 * updated['std'],
'final_g90': final_g90
})
return pd.DataFrame(results)
multi_results = simulate_multiple_players(100)
# Accuracy by week
multi_results['preseason_error'] = np.abs(multi_results['preseason_proj'] - multi_results['final_g90'])
multi_results['inseason_error'] = np.abs(multi_results['in_season_proj'] - multi_results['final_g90'])
multi_results['within_ci'] = ((multi_results['ci_lower'] <= multi_results['final_g90']) &
(multi_results['final_g90'] <= multi_results['ci_upper']))
accuracy_by_week = multi_results.groupby('week').agg({
'preseason_error': lambda x: np.sqrt((x**2).mean()),
'inseason_error': lambda x: np.sqrt((x**2).mean()),
'within_ci': lambda x: x.mean() * 100
}).reset_index()
accuracy_by_week.columns = ['week', 'preseason_rmse', 'inseason_rmse', 'calibration']
accuracy_by_week['improvement_pct'] = ((accuracy_by_week['preseason_rmse'] -
accuracy_by_week['inseason_rmse']) /
accuracy_by_week['preseason_rmse'] * 100)
print("\nProjection Accuracy by Matchweek:")
print(accuracy_by_week.round(3).to_string(index=False))library(tidyverse)
# Simulate a player's season game-by-game
set.seed(42)
generate_season_data <- function(true_g90 = 0.45, games = 38, minutes_per_game = 85) {
tibble(
matchweek = 1:games,
minutes = pmin(90, pmax(0, rnorm(games, minutes_per_game, 15))),
# Poisson goals based on true rate
goals = rpois(games, true_g90 * minutes / 90)
) %>%
mutate(
cumulative_minutes = cumsum(minutes),
cumulative_goals = cumsum(goals),
observed_g90 = cumulative_goals / cumulative_minutes * 90
)
}
# Generate season for a player
true_rate <- 0.45
preseason_projection <- 0.40 # Slightly underestimated
preseason_std <- 0.10 # Uncertainty in preseason projection
season_data <- generate_season_data(true_g90 = true_rate, games = 38)
cat("True goals/90:", true_rate, "\n")
cat("Preseason projection:", preseason_projection, "\n")
cat("Final observed g/90:", round(tail(season_data$observed_g90, 1), 3), "\n\n")
# In-Season Projection Update Function (Bayesian-style)
update_projection <- function(prior_mean, prior_std, observed_g90, minutes_played,
reliability = 400) { # Higher = faster convergence
# Reliability parameter: how many minutes to "believe" new data fully
# posterior_mean = weighted average of prior and observed
# Weight for observed data based on sample size
obs_weight <- minutes_played / (minutes_played + reliability)
prior_weight <- 1 - obs_weight
# Posterior mean
posterior_mean <- prior_weight * prior_mean + obs_weight * observed_g90
# Posterior standard deviation shrinks with more data
# Based on combining uncertainties
obs_std <- sqrt(observed_g90 / max(1, minutes_played / 90)) # Poisson-based
posterior_std <- sqrt(1 / (1/prior_std^2 + minutes_played / (reliability * obs_std^2 + 1)))
list(mean = posterior_mean, std = posterior_std, obs_weight = obs_weight)
}
# Track projections through the season
projections <- tibble(
matchweek = 0, # Preseason
projection = preseason_projection,
ci_lower = preseason_projection - 1.96 * preseason_std,
ci_upper = preseason_projection + 1.96 * preseason_std,
std = preseason_std,
obs_weight = 0,
cumulative_minutes = 0,
observed_g90 = NA
)
for (week in 1:38) {
current <- season_data %>% filter(matchweek == week)
# Update projection
updated <- update_projection(
prior_mean = preseason_projection,
prior_std = preseason_std,
observed_g90 = current$observed_g90,
minutes_played = current$cumulative_minutes
)
projections <- bind_rows(projections, tibble(
matchweek = week,
projection = updated$mean,
ci_lower = updated$mean - 1.96 * updated$std,
ci_upper = updated$mean + 1.96 * updated$std,
std = updated$std,
obs_weight = updated$obs_weight,
cumulative_minutes = current$cumulative_minutes,
observed_g90 = current$observed_g90
))
}
cat("=== PROJECTION EVOLUTION ===\n")
key_weeks <- c(0, 5, 10, 19, 29, 38)
projection_summary <- projections %>%
filter(matchweek %in% key_weeks) %>%
mutate(across(c(projection, ci_lower, ci_upper, std), ~round(., 4)))
print(projection_summary)
# Calculate projection accuracy at different points
cat("\n=== PROJECTION ACCURACY ===\n")
final_g90 <- tail(season_data$observed_g90, 1)
accuracy <- projections %>%
filter(matchweek > 0) %>%
mutate(
error = projection - final_g90,
within_ci = final_g90 >= ci_lower & final_g90 <= ci_upper
)
# Calibration check
calibration <- accuracy %>%
summarise(
pct_in_95_ci = mean(within_ci) * 100,
avg_error = mean(error),
rmse = sqrt(mean(error^2)),
early_season_error = mean(error[matchweek <= 10]),
late_season_error = mean(error[matchweek > 28])
)
print(calibration)
# Visualization
par(mfrow = c(2, 2))
# 1. Projection evolution with CI
plot(projections$matchweek, projections$projection, type = "l",
col = "blue", lwd = 2, ylim = c(0.2, 0.7),
xlab = "Matchweek", ylab = "Goals per 90",
main = "In-Season Projection Evolution")
# Confidence interval
polygon(c(projections$matchweek, rev(projections$matchweek)),
c(projections$ci_lower, rev(projections$ci_upper)),
col = adjustcolor("blue", 0.2), border = NA)
# True rate and final observed
abline(h = true_rate, col = "red", lwd = 2, lty = 2)
abline(h = final_g90, col = "green", lwd = 2, lty = 3)
# Add observed running rate
lines(projections$matchweek[-1], projections$observed_g90[-1],
col = "orange", lwd = 1, lty = 1)
legend("topright",
legend = c("Projection", "95% CI", "True Rate", "Final Observed", "Running Obs"),
col = c("blue", adjustcolor("blue", 0.3), "red", "green", "orange"),
lty = c(1, NA, 2, 3, 1), lwd = c(2, NA, 2, 2, 1),
fill = c(NA, adjustcolor("blue", 0.3), NA, NA, NA),
border = c(NA, NA, NA, NA, NA), cex = 0.7)
# 2. Weight evolution
plot(projections$matchweek, projections$obs_weight * 100, type = "l",
col = "purple", lwd = 2,
xlab = "Matchweek", ylab = "Weight on Observed Data (%)",
main = "Data Weight Evolution")
abline(h = 50, col = "gray", lty = 2)
text(5, 55, "50% threshold", cex = 0.8, col = "gray")
# 3. Confidence interval width
plot(projections$matchweek, projections$std * 1.96 * 2, type = "l",
col = "darkgreen", lwd = 2,
xlab = "Matchweek", ylab = "95% CI Width",
main = "Uncertainty Reduction Over Season")
# 4. Error evolution
errors <- projections %>%
filter(matchweek > 0) %>%
mutate(error = abs(projection - final_g90))
plot(errors$matchweek, errors$error, type = "l",
col = "red", lwd = 2,
xlab = "Matchweek", ylab = "Absolute Error vs Final",
main = "Projection Error Evolution")
par(mfrow = c(1, 1))
# Multiple player simulation
cat("\n=== MULTI-PLAYER VALIDATION ===\n")
simulate_multiple_players <- function(n_players = 50) {
results <- list()
for (i in 1:n_players) {
# Random player characteristics
true_rate <- runif(1, 0.2, 0.7)
preseason_proj <- true_rate + rnorm(1, 0, 0.08) # Noisy preseason
preseason_std <- runif(1, 0.08, 0.15)
season <- generate_season_data(true_g90 = true_rate)
final_g90 <- tail(season$observed_g90, 1)
# Track key projection points
for (week in c(10, 19, 29, 38)) {
if (week <= nrow(season)) {
current <- season[week, ]
updated <- update_projection(
prior_mean = preseason_proj,
prior_std = preseason_std,
observed_g90 = current$observed_g90,
minutes_played = current$cumulative_minutes
)
results[[length(results) + 1]] <- data.frame(
player_id = i,
week = week,
true_rate = true_rate,
preseason_proj = preseason_proj,
in_season_proj = updated$mean,
ci_lower = updated$mean - 1.96 * updated$std,
ci_upper = updated$mean + 1.96 * updated$std,
final_g90 = final_g90
)
}
}
}
bind_rows(results)
}
multi_results <- simulate_multiple_players(100)
# Accuracy by week
accuracy_by_week <- multi_results %>%
mutate(
preseason_error = abs(preseason_proj - final_g90),
inseason_error = abs(in_season_proj - final_g90),
within_ci = final_g90 >= ci_lower & final_g90 <= ci_upper
) %>%
group_by(week) %>%
summarise(
preseason_rmse = sqrt(mean(preseason_error^2)),
inseason_rmse = sqrt(mean(inseason_error^2)),
improvement_pct = (preseason_rmse - inseason_rmse) / preseason_rmse * 100,
calibration = mean(within_ci) * 100,
.groups = "drop"
)
cat("\nProjection Accuracy by Matchweek:\n")
print(accuracy_by_week)Chapter Summary
Key Takeaways
- Aging curves model how performance changes over a career; different metrics peak at different ages
- Regression to the mean is essential - extreme performances tend to normalize
- Marcel projections provide a simple but effective framework combining weighting, aging, and regression
- Machine learning can capture non-linear patterns but requires more data and validation
- Ensemble methods often outperform any single projection approach
- In-season projections blend preseason expectations with observed performance
- Career trajectories help with long-term planning and contract decisions
- Uncertainty quantification is crucial - always report confidence intervals
Projection System Checklist
- Weight recent seasons more heavily
- Apply appropriate aging adjustments
- Regress to the mean based on sample size
- Use metric-specific reliability estimates
- Calculate confidence intervals
- Validate against historical accuracy
- Update projections as new data arrives
- Communicate uncertainty clearly