Capstone - Complete Analytics System
Learning Objectives
- Design and implement a complete recruitment analytics pipeline
- Build player databases with integrated metrics and valuations
- Create automated shortlisting and scoring systems
- Implement similarity search and replacement analysis
- Generate comprehensive scouting reports
This chapter brings together everything we've learned into a practical case study: building a complete player recruitment system from scratch. We'll create a production-ready pipeline that professional clubs use to identify, evaluate, and track transfer targets.
Recruitment System Architecture
A modern recruitment system consists of several interconnected components. We'll build each piece and integrate them into a cohesive pipeline.
Data Layer
- Player database
- Match event data
- Market valuations
- Contract information
Analytics Layer
- Metric calculations
- League adjustments
- Similarity models
- Projection models
Application Layer
- Search & filtering
- Shortlist management
- Report generation
- Comparison tools
Output Layer
- Scouting reports
- Radar charts
- Dashboards
- Alerts & notifications
# Python: Define recruitment system architecture
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any
from datetime import date
import pandas as pd
import numpy as np
@dataclass
class Player:
"""Core player entity for recruitment system."""
id: str
name: str
position: str
age: int
nationality: str
current_club: str
league: str
contract_expiry: date
market_value: float # in millions
stats: Dict[str, float] = field(default_factory=dict)
adjusted_stats: Dict[str, float] = field(default_factory=dict)
notes: List[str] = field(default_factory=list)
def to_dict(self) -> Dict:
"""Convert player to dictionary."""
return {
"id": self.id,
"name": self.name,
"position": self.position,
"age": self.age,
"nationality": self.nationality,
"club": self.current_club,
"league": self.league,
"contract_expiry": self.contract_expiry,
"market_value": self.market_value,
**self.stats
}
class RecruitmentDatabase:
"""Central database for recruitment system."""
def __init__(self):
self.players: Dict[str, Player] = {}
self.shortlists: Dict[str, List[str]] = {}
self._dataframe_cache = None
def add_player(self, player: Player):
"""Add player to database."""
self.players[player.id] = player
self._dataframe_cache = None # Invalidate cache
def get_player(self, player_id: str) -> Optional[Player]:
"""Retrieve player by ID."""
return self.players.get(player_id)
def to_dataframe(self) -> pd.DataFrame:
"""Convert all players to DataFrame."""
if self._dataframe_cache is None:
self._dataframe_cache = pd.DataFrame([
p.to_dict() for p in self.players.values()
])
return self._dataframe_cache
def create_shortlist(self, name: str):
"""Create new shortlist."""
self.shortlists[name] = []
def add_to_shortlist(self, shortlist_name: str, player_id: str):
"""Add player to shortlist."""
if shortlist_name in self.shortlists:
self.shortlists[shortlist_name].append(player_id)
print("Player and RecruitmentDatabase classes defined")# R: Define recruitment system architecture
library(tidyverse)
library(R6)
# Core Player class
Player <- R6Class("Player",
public = list(
id = NULL,
name = NULL,
position = NULL,
age = NULL,
nationality = NULL,
current_club = NULL,
league = NULL,
contract_expiry = NULL,
market_value = NULL,
stats = NULL,
adjusted_stats = NULL,
initialize = function(id, name, position, age, nationality,
current_club, league, contract_expiry, market_value) {
self$id <- id
self$name <- name
self$position <- position
self$age <- age
self$nationality <- nationality
self$current_club <- current_club
self$league <- league
self$contract_expiry <- contract_expiry
self$market_value <- market_value
self$stats <- list()
self$adjusted_stats <- list()
},
set_stats = function(stats_list) {
self$stats <- stats_list
},
to_df = function() {
tibble(
id = self$id,
name = self$name,
position = self$position,
age = self$age,
nationality = self$nationality,
club = self$current_club,
league = self$league,
contract_expiry = self$contract_expiry,
market_value = self$market_value
)
}
)
)
# Recruitment Database class
RecruitmentDB <- R6Class("RecruitmentDB",
public = list(
players = NULL,
shortlists = NULL,
initialize = function() {
self$players <- list()
self$shortlists <- list()
},
add_player = function(player) {
self$players[[player$id]] <- player
},
get_player = function(player_id) {
self$players[[player_id]]
},
search = function(filters) {
# Implementation in next section
},
create_shortlist = function(name) {
self$shortlists[[name]] <- list()
}
)
)
cat("Player and RecruitmentDB classes defined\n")Data Ingestion Pipeline
The first step is building a robust data ingestion pipeline that pulls from multiple sources and normalizes the data into our player database.
# Python: Build data ingestion pipeline
import pandas as pd
import numpy as np
from typing import Optional, Dict
from dataclasses import dataclass
import soccerdata as sd
@dataclass
class DataIngestion:
"""Pipeline for ingesting player data from multiple sources."""
league_multipliers: Dict[str, float] = None
def __post_init__(self):
if self.league_multipliers is None:
self.league_multipliers = {
"ENG-Premier League": 1.00,
"ESP-La Liga": 0.92,
"ITA-Serie A": 0.90,
"GER-Bundesliga": 0.90,
"FRA-Ligue 1": 0.80,
"NED-Eredivisie": 0.70,
"POR-Primeira Liga": 0.68,
"ENG-Championship": 0.65
}
def load_fbref_data(self, league: str, season: str = "2023-2024") -> Optional[pd.DataFrame]:
"""Load player data from FBref."""
try:
fbref = sd.FBref(leagues=[league], seasons=[season])
stats = fbref.read_player_season_stats(stat_type="standard")
return stats.reset_index()
except Exception as e:
print(f"Error loading FBref data: {e}")
return None
def calculate_per90(self, data: pd.DataFrame,
min_minutes: int = 450) -> pd.DataFrame:
"""Normalize statistics to per-90 minutes."""
df = data[data["Min"] >= min_minutes].copy()
df["nineties"] = df["Min"] / 90
per90_cols = ["Gls", "Ast", "xG", "xAG", "npxG"]
for col in per90_cols:
if col in df.columns:
df[f"{col}_p90"] = df[col] / df["nineties"]
df["goal_contribution_p90"] = (df["Gls"] + df["Ast"]) / df["nineties"]
return df
def adjust_for_league(self, data: pd.DataFrame,
league: str) -> pd.DataFrame:
"""Apply league quality adjustments."""
multiplier = self.league_multipliers.get(league, 0.75)
df = data.copy()
adj_cols = ["xG_p90", "xAG_p90", "npxG_p90"]
for col in adj_cols:
if col in df.columns:
df[f"adj_{col}"] = df[col] * multiplier
df["league_quality"] = multiplier
return df
def ingest_league(self, league: str,
season: str = "2023-2024") -> Optional[pd.DataFrame]:
"""Full ingestion pipeline for a league."""
data = self.load_fbref_data(league, season)
if data is None:
return None
data = self.calculate_per90(data)
data = self.adjust_for_league(data, league)
return data
def ingest_multiple_leagues(self, leagues: list,
season: str = "2023-2024") -> pd.DataFrame:
"""Ingest data from multiple leagues."""
all_data = []
for league in leagues:
data = self.ingest_league(league, season)
if data is not None:
data["source_league"] = league
all_data.append(data)
return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()
# Example usage
ingestion = DataIngestion()
print("Data ingestion pipeline ready")# R: Build data ingestion pipeline
library(tidyverse)
library(worldfootballR)
# Data Ingestion class
DataIngestion <- R6Class("DataIngestion",
public = list(
# League quality adjustments
league_multipliers = NULL,
initialize = function() {
self$league_multipliers <- c(
"Premier League" = 1.00,
"La Liga" = 0.92,
"Serie A" = 0.90,
"Bundesliga" = 0.90,
"Ligue 1" = 0.80,
"Eredivisie" = 0.70,
"Liga Portugal" = 0.68,
"Championship" = 0.65
)
},
# Load player data from FBref
load_fbref_data = function(league, season = 2024) {
tryCatch({
# Get standard stats
standard <- fb_big5_advanced_season_stats(
season_end_year = season,
stat_type = "standard",
team_or_player = "player"
)
# Filter to league
standard %>%
filter(Comp == league) %>%
select(
player = Player,
team = Squad,
position = Pos,
age = Age,
minutes = Min,
goals = Gls,
assists = Ast,
xg = xG,
xa = xAG,
npxg = npxG
)
}, error = function(e) {
message(paste("Error loading FBref data:", e$message))
NULL
})
},
# Normalize stats to per-90
calculate_per90 = function(data, min_minutes = 450) {
data %>%
filter(minutes >= min_minutes) %>%
mutate(
nineties = minutes / 90,
goals_p90 = goals / nineties,
assists_p90 = assists / nineties,
xg_p90 = xg / nineties,
xa_p90 = xa / nineties,
npxg_p90 = npxg / nineties,
goal_contribution_p90 = (goals + assists) / nineties
)
},
# Apply league adjustments
adjust_for_league = function(data, league) {
multiplier <- self$league_multipliers[[league]]
if (is.null(multiplier)) multiplier <- 0.75
data %>%
mutate(
adj_xg_p90 = xg_p90 * multiplier,
adj_xa_p90 = xa_p90 * multiplier,
adj_npxg_p90 = npxg_p90 * multiplier,
league_quality = multiplier
)
},
# Full ingestion pipeline
ingest_league = function(league, season = 2024) {
data <- self$load_fbref_data(league, season)
if (is.null(data)) return(NULL)
data %>%
self$calculate_per90() %>%
self$adjust_for_league(league)
}
)
)
# Example usage
ingestion <- DataIngestion$new()
cat("Data ingestion pipeline ready\n")Player Scoring & Ranking System
To efficiently filter thousands of players down to actionable shortlists, we need a systematic scoring system that weights metrics based on positional requirements.
# Python: Build position-based scoring system
import pandas as pd
import numpy as np
from typing import Dict, List, Optional
from scipy import stats
class PlayerScorer:
"""Position-based player scoring system."""
def __init__(self):
# Position-specific metric weights
self.position_weights = {
"Striker": {
"npxG_p90": 0.30,
"Gls_p90": 0.25,
"xAG_p90": 0.10,
"AerWon_p90": 0.10,
"Press_p90": 0.10,
"PrgC_p90": 0.15
},
"Winger": {
"xAG_p90": 0.25,
"PrgC_p90": 0.20,
"Succ_p90": 0.15, # Successful dribbles
"npxG_p90": 0.15,
"Crs_p90": 0.15,
"Press_p90": 0.10
},
"Central_Midfielder": {
"PrgP_p90": 0.25,
"Cmp%": 0.15,
"TklW_p90": 0.15,
"xAG_p90": 0.15,
"Press_p90": 0.15,
"npxG_p90": 0.15
},
"Center_Back": {
"AerWon_p90": 0.20,
"TklW_p90": 0.20,
"Int_p90": 0.15,
"PrgP_p90": 0.15,
"Clr_p90": 0.15,
"Blocks_p90": 0.15
}
}
def calculate_percentiles(self, data: pd.DataFrame,
metrics: List[str]) -> pd.DataFrame:
"""Calculate percentile ranks for metrics."""
df = data.copy()
for metric in metrics:
if metric in df.columns:
df[f"{metric}_pct"] = stats.rankdata(
df[metric], method="average"
) / len(df) * 100
return df
def score_players(self, data: pd.DataFrame,
position: str) -> pd.DataFrame:
"""Score players for a specific position."""
if position not in self.position_weights:
raise ValueError(f"Unknown position: {position}")
weights = self.position_weights[position]
metrics = list(weights.keys())
# Calculate percentiles
df = self.calculate_percentiles(data, metrics)
# Calculate weighted score
df["position_score"] = 0
for metric, weight in weights.items():
pct_col = f"{metric}_pct"
if pct_col in df.columns:
df["position_score"] += weight * df[pct_col]
# Rank players
df = df.sort_values("position_score", ascending=False)
df["rank"] = range(1, len(df) + 1)
return df
def generate_position_ranking(self, data: pd.DataFrame,
position: str,
top_n: int = 20) -> pd.DataFrame:
"""Generate top N players for a position."""
scored = self.score_players(data, position)
return scored.head(top_n)[[
"Player", "Squad", "Age", "Min",
"position_score", "rank"
]]
# Example usage
scorer = PlayerScorer()
print("Scoring system initialized with position weights:")
for pos, weights in scorer.position_weights.items():
print(f" {pos}: {list(weights.keys())}")# R: Build position-based scoring system
library(tidyverse)
# Define position-specific metric weights
position_weights <- list(
"Striker" = c(
npxg_p90 = 0.30,
goals_p90 = 0.25,
xa_p90 = 0.10,
aerial_wins_p90 = 0.10,
pressures_p90 = 0.10,
progressive_carries_p90 = 0.15
),
"Winger" = c(
xa_p90 = 0.25,
progressive_carries_p90 = 0.20,
successful_dribbles_p90 = 0.15,
npxg_p90 = 0.15,
crosses_p90 = 0.15,
pressures_p90 = 0.10
),
"Central_Midfielder" = c(
progressive_passes_p90 = 0.25,
pass_completion = 0.15,
tackles_won_p90 = 0.15,
xa_p90 = 0.15,
pressures_p90 = 0.15,
npxg_p90 = 0.15
),
"Center_Back" = c(
aerial_wins_p90 = 0.20,
tackles_won_p90 = 0.20,
interceptions_p90 = 0.15,
progressive_passes_p90 = 0.15,
clearances_p90 = 0.15,
blocks_p90 = 0.15
)
)
# Scoring function
score_player <- function(player_data, position, weights_list = position_weights) {
weights <- weights_list[[position]]
if (is.null(weights)) {
warning(paste("No weights defined for position:", position))
return(NA)
}
# Calculate weighted score
score <- 0
for (metric in names(weights)) {
if (metric %in% names(player_data)) {
# Percentile rank the metric
metric_value <- player_data[[metric]]
score <- score + weights[[metric]] * metric_value
}
}
return(score)
}
# Calculate percentile ranks for all metrics
calculate_percentiles <- function(data, metrics) {
for (metric in metrics) {
if (metric %in% names(data)) {
data[[paste0(metric, "_pct")]] <- percent_rank(data[[metric]]) * 100
}
}
return(data)
}
# Full scoring pipeline
score_players_for_position <- function(data, position,
weights_list = position_weights) {
weights <- weights_list[[position]]
metrics <- names(weights)
# Calculate percentiles
data <- calculate_percentiles(data, metrics)
# Calculate weighted score
data$position_score <- 0
for (metric in metrics) {
pct_col <- paste0(metric, "_pct")
if (pct_col %in% names(data)) {
data$position_score <- data$position_score +
weights[[metric]] * data[[pct_col]]
}
}
# Rank players
data %>%
arrange(desc(position_score)) %>%
mutate(rank = row_number())
}
cat("Scoring system defined with position-specific weights\n")Scoring system initialized with position weights:
Striker: [\'npxG_p90\', \'Gls_p90\', \'xAG_p90\', \'AerWon_p90\', \'Press_p90\', \'PrgC_p90\']
Winger: [\'xAG_p90\', \'PrgC_p90\', \'Succ_p90\', \'npxG_p90\', \'Crs_p90\', \'Press_p90\']
Central_Midfielder: [\'PrgP_p90\', \'Cmp%\', \'TklW_p90\', \'xAG_p90\', \'Press_p90\', \'npxG_p90\']
Center_Back: [\'AerWon_p90\', \'TklW_p90\', \'Int_p90\', \'PrgP_p90\', \'Clr_p90\', \'Blocks_p90\']Player Similarity Engine
Finding similar players is essential for replacement analysis and identifying alternatives. We'll implement cosine similarity on standardized metrics.
# Python: Build player similarity engine
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Optional, Dict
class SimilarityEngine:
"""Engine for finding similar players."""
def __init__(self, feature_cols: List[str]):
self.feature_cols = feature_cols
self.scaler = StandardScaler()
self.similarity_matrix = None
self.player_index = None
def fit(self, data: pd.DataFrame):
"""Fit the similarity model on player data."""
# Extract features
features = data[self.feature_cols].fillna(0)
# Standardize
scaled_features = self.scaler.fit_transform(features)
# Calculate similarity matrix
self.similarity_matrix = cosine_similarity(scaled_features)
# Store player index mapping
self.player_index = {
player: idx for idx, player in enumerate(data["Player"])
}
return self
def find_similar(self, target_player: str,
top_n: int = 10) -> Dict[str, float]:
"""Find most similar players to target."""
if target_player not in self.player_index:
raise ValueError(f"Player not found: {target_player}")
idx = self.player_index[target_player]
similarities = self.similarity_matrix[idx]
# Get indices sorted by similarity
similar_indices = np.argsort(similarities)[::-1]
# Build result dict (excluding self)
results = {}
count = 0
for i in similar_indices:
player_name = list(self.player_index.keys())[i]
if player_name != target_player:
results[player_name] = similarities[i]
count += 1
if count >= top_n:
break
return results
def find_similar_with_filters(self, target_player: str,
data: pd.DataFrame,
max_age: Optional[int] = None,
max_value: Optional[float] = None,
min_minutes: int = 900,
leagues: Optional[List[str]] = None,
top_n: int = 10) -> pd.DataFrame:
"""Find similar players with filters applied."""
# Get similarity scores
all_similar = self.find_similar(target_player, top_n=100)
# Create results dataframe
results = pd.DataFrame([
{"Player": name, "similarity": score}
for name, score in all_similar.items()
])
# Merge with full data
results = results.merge(data, on="Player", how="left")
# Apply filters
if max_age is not None:
results = results[results["Age"] <= max_age]
if max_value is not None:
results = results[results["market_value"] <= max_value]
if min_minutes is not None:
results = results[results["Min"] >= min_minutes]
if leagues is not None:
results = results[results["league"].isin(leagues)]
return results.head(top_n)[[
"Player", "similarity", "Age", "Squad",
"market_value", "Min"
]]
# Example usage
feature_cols = ["npxG_p90", "xAG_p90", "PrgC_p90", "PrgP_p90", "Press_p90"]
similarity_engine = SimilarityEngine(feature_cols)
print(f"Similarity engine initialized with features: {feature_cols}")# R: Build player similarity engine
library(tidyverse)
# Similarity Engine using cosine similarity
calculate_cosine_similarity <- function(vec1, vec2) {
sum(vec1 * vec2) / (sqrt(sum(vec1^2)) * sqrt(sum(vec2^2)))
}
# Build similarity matrix
build_similarity_matrix <- function(data, feature_cols) {
# Standardize features
feature_data <- data %>%
select(all_of(feature_cols)) %>%
mutate(across(everything(), ~scale(.)[,1]))
# Replace NA with 0
feature_data[is.na(feature_data)] <- 0
# Convert to matrix
feature_matrix <- as.matrix(feature_data)
# Calculate similarity for all pairs
n <- nrow(feature_matrix)
similarity_matrix <- matrix(0, n, n)
for (i in 1:n) {
for (j in 1:n) {
similarity_matrix[i, j] <- calculate_cosine_similarity(
feature_matrix[i,], feature_matrix[j,]
)
}
}
# Set row/column names
rownames(similarity_matrix) <- data$player
colnames(similarity_matrix) <- data$player
return(similarity_matrix)
}
# Find most similar players
find_similar_players <- function(target_player, similarity_matrix, top_n = 10) {
if (!target_player %in% rownames(similarity_matrix)) {
stop(paste("Player not found:", target_player))
}
similarities <- similarity_matrix[target_player, ]
similarities <- similarities[names(similarities) != target_player]
sorted_sim <- sort(similarities, decreasing = TRUE)
head(sorted_sim, top_n)
}
# Enhanced similarity with filters
find_similar_with_filters <- function(target_player, data, similarity_matrix,
max_age = NULL, max_value = NULL,
min_minutes = 900, top_n = 10) {
# Get base similar players
similarities <- find_similar_players(target_player, similarity_matrix, top_n = 50)
# Create result dataframe
results <- tibble(
player = names(similarities),
similarity = as.numeric(similarities)
) %>%
left_join(data, by = "player")
# Apply filters
if (!is.null(max_age)) {
results <- results %>% filter(age <= max_age)
}
if (!is.null(max_value)) {
results <- results %>% filter(market_value <= max_value)
}
results <- results %>% filter(minutes >= min_minutes)
# Return top N after filters
results %>%
head(top_n) %>%
select(player, similarity, age, team, market_value, minutes)
}
cat("Similarity engine functions defined\n")Replacement Analysis
# Python: Replacement analysis workflow
import pandas as pd
from typing import Optional, Tuple
class ReplacementAnalyzer:
"""Analyze and recommend player replacements."""
def __init__(self, similarity_engine: SimilarityEngine):
self.similarity_engine = similarity_engine
def find_replacements(self, departing_player: str,
data: pd.DataFrame,
budget: float,
position_filter: Optional[str] = None,
age_range: Tuple[int, int] = (18, 28),
top_n: int = 10) -> pd.DataFrame:
"""Find replacement candidates for a departing player."""
# Get similar players
candidates = self.similarity_engine.find_similar_with_filters(
departing_player,
data,
max_age=age_range[1],
max_value=budget,
top_n=50
)
# Filter by position
if position_filter and "Pos" in candidates.columns:
candidates = candidates[
candidates["Pos"].str.contains(position_filter, na=False)
]
# Filter by age range
candidates = candidates[
(candidates["Age"] >= age_range[0]) &
(candidates["Age"] <= age_range[1])
]
# Calculate recommendation scores
candidates = candidates.copy()
candidates["value_score"] = (
candidates["similarity"] * 100 /
(candidates["market_value"] + 1)
)
candidates["development_potential"] = (28 - candidates["Age"]) * 2
# Normalize scores
max_value_score = candidates["value_score"].max()
max_dev = candidates["development_potential"].max()
candidates["overall_recommendation"] = (
candidates["similarity"] * 0.5 +
(candidates["value_score"] / max_value_score) * 0.3 +
(candidates["development_potential"] / max_dev) * 0.2
)
return candidates.sort_values(
"overall_recommendation", ascending=False
).head(top_n)
def generate_report(self, departing_player: str,
replacements: pd.DataFrame) -> str:
"""Generate text report for replacements."""
report = f"""
=== REPLACEMENT ANALYSIS FOR: {departing_player} ===
TOP 5 RECOMMENDED REPLACEMENTS:
{"-" * 50}
"""
for i, (_, r) in enumerate(replacements.head(5).iterrows(), 1):
report += f"""
{i}. {r["Player"]} ({r.get("Squad", "N/A")})
Age: {r["Age"]} | Value: €{r.get("market_value", 0):.1f}M
Similarity: {r["similarity"]*100:.1f}% | Value Score: {r["value_score"]:.2f}
Recommendation Score: {r["overall_recommendation"]:.2f}
"""
return report
# Example
print("Replacement analysis system ready")# R: Replacement analysis workflow
library(tidyverse)
# Find replacement candidates for a departing player
find_replacements <- function(departing_player, data, similarity_matrix,
budget, position_filter = NULL,
age_range = c(18, 28)) {
# Find similar players
candidates <- find_similar_with_filters(
departing_player, data, similarity_matrix,
max_age = age_range[2],
max_value = budget,
min_minutes = 900,
top_n = 20
)
# Filter by position if specified
if (!is.null(position_filter)) {
candidates <- candidates %>%
filter(grepl(position_filter, position))
}
# Filter by age
candidates <- candidates %>%
filter(age >= age_range[1], age <= age_range[2])
# Calculate value score
candidates <- candidates %>%
mutate(
value_score = similarity * 100 / (market_value + 1),
development_potential = (28 - age) * 2,
overall_recommendation = similarity * 0.5 +
(value_score / max(value_score)) * 0.3 +
(development_potential / max(development_potential)) * 0.2
) %>%
arrange(desc(overall_recommendation))
return(candidates)
}
# Generate replacement report
generate_replacement_report <- function(departing_player, replacements) {
cat(sprintf("\n=== REPLACEMENT ANALYSIS FOR: %s ===\n\n", departing_player))
cat("TOP 5 RECOMMENDED REPLACEMENTS:\n")
cat(paste(rep("-", 50), collapse = ""), "\n")
for (i in 1:min(5, nrow(replacements))) {
r <- replacements[i, ]
cat(sprintf("%d. %s (%s)\n", i, r$player, r$team))
cat(sprintf(" Age: %d | Value: €%.1fM\n", r$age, r$market_value))
cat(sprintf(" Similarity: %.1f%% | Value Score: %.2f\n",
r$similarity * 100, r$value_score))
cat(sprintf(" Recommendation Score: %.2f\n\n", r$overall_recommendation))
}
}Scouting Report Generator
The final output of our recruitment system is comprehensive scouting reports that combine quantitative analysis with formatted output suitable for decision-makers.
# Python: Generate comprehensive scouting reports
import pandas as pd
import numpy as np
from typing import Dict, List, Optional
from datetime import datetime
import matplotlib.pyplot as plt
class ScoutingReportGenerator:
"""Generate comprehensive scouting reports."""
def __init__(self, similarity_engine: SimilarityEngine):
self.similarity_engine = similarity_engine
def create_radar_chart(self, player_data: Dict,
metrics: List[str],
title: str) -> plt.Figure:
"""Create radar chart for player profile."""
# Number of metrics
N = len(metrics)
angles = [n / float(N) * 2 * np.pi for n in range(N)]
angles += angles[:1] # Complete the loop
# Get values
values = [player_data.get(m, 0) for m in metrics]
values += values[:1]
# Create plot
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
ax.plot(angles, values, "o-", linewidth=2)
ax.fill(angles, values, alpha=0.25)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(metrics, size=10)
ax.set_title(title, size=14, fontweight="bold", y=1.08)
return fig
def generate_report(self, player_name: str,
data: pd.DataFrame) -> str:
"""Generate full scouting report."""
player = data[data["Player"] == player_name]
if player.empty:
raise ValueError(f"Player not found: {player_name}")
p = player.iloc[0]
# Build report sections
header = f"""
================================================================================
SCOUTING REPORT
================================================================================
Player: {p.get("Player", "N/A")}
Position: {p.get("Pos", "N/A")} | Age: {p.get("Age", "N/A")}
Current Club: {p.get("Squad", "N/A")} ({p.get("league", "N/A")})
Market Value: €{p.get("market_value", 0):.1f}M
Minutes Played: {p.get("Min", 0)} ({p.get("Min", 0)/90:.1f} 90s)
================================================================================
"""
performance = f"""
PERFORMANCE METRICS (Per 90 Minutes):
-------------------------------------
Goals: {p.get("Gls_p90", 0):.2f} | xG: {p.get("xG_p90", 0):.2f}
Assists: {p.get("Ast_p90", 0):.2f} | xA: {p.get("xAG_p90", 0):.2f}
Non-Penalty xG: {p.get("npxG_p90", 0):.2f}
Goal Contribution: {p.get("goal_contribution_p90", 0):.2f}
"""
# Get similar players
try:
similar = self.similarity_engine.find_similar(player_name, top_n=5)
similar_text = "\n".join([
f" - {name} ({score*100:.1f}% similar)"
for name, score in similar.items()
])
except:
similar_text = " (Unable to calculate)"
similar_section = f"""
SIMILAR PLAYERS:
----------------
{similar_text}
"""
recommendation = """
RECOMMENDATION:
---------------
[To be filled by scout based on video analysis]
Strengths:
-
Weaknesses:
-
Fit Assessment:
-
Risk Factors:
-
================================================================================
"""
report = header + performance + similar_section + recommendation
# Add metadata
report += f"""
Report Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}
Data Source: FBref / StatsBomb
"""
return report
def export_report(self, report: str, filename: str):
"""Export report to file."""
with open(filename, "w") as f:
f.write(report)
print(f"Report saved to {filename}")
# Example usage
print("Scouting report generator ready")# R: Generate comprehensive scouting reports
library(tidyverse)
library(ggplot2)
# Create radar chart for player
create_player_radar <- function(player_data, metrics, title) {
# Prepare data for radar chart
radar_data <- player_data %>%
select(all_of(metrics)) %>%
pivot_longer(everything(), names_to = "metric", values_to = "value")
# Create plot
ggplot(radar_data, aes(x = metric, y = value)) +
geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
coord_polar() +
theme_minimal() +
labs(title = title) +
theme(axis.text.x = element_text(size = 8))
}
# Generate full scouting report
generate_scouting_report <- function(player_name, data, similarity_matrix,
league_adjustments) {
player <- data %>% filter(player == player_name)
if (nrow(player) == 0) {
stop(paste("Player not found:", player_name))
}
# Build report
report <- list()
# Header section
report$header <- sprintf("
================================================================================
SCOUTING REPORT
================================================================================
Player: %s
Position: %s | Age: %d | Nationality: %s
Current Club: %s (%s)
Contract Expires: %s
Market Value: €%.1fM
Minutes Played: %d (%.1f 90s)
================================================================================
",
player$player,
player$position,
player$age,
player$nationality,
player$team,
player$league,
player$contract_expiry,
player$market_value,
player$minutes,
player$minutes / 90
)
# Performance metrics
report$performance <- sprintf("
PERFORMANCE METRICS (Per 90 Minutes):
-------------------------------------
Goals: %.2f | xG: %.2f | Overperformance: %+.2f
Assists: %.2f | xA: %.2f | Overperformance: %+.2f
Non-Penalty xG: %.2f
League-Adjusted Metrics (to Premier League):
Goals (adj): %.2f | xG (adj): %.2f | xA (adj): %.2f
",
player$goals_p90, player$xg_p90, player$goals_p90 - player$xg_p90,
player$assists_p90, player$xa_p90, player$assists_p90 - player$xa_p90,
player$npxg_p90,
player$adj_goals_p90, player$adj_xg_p90, player$adj_xa_p90
)
# Percentile rankings
report$rankings <- "
PERCENTILE RANKINGS (vs Position):
----------------------------------
See attached radar chart
"
# Similar players
similar <- find_similar_players(player_name, similarity_matrix, top_n = 5)
similar_text <- paste(
sprintf(" - %s (%.1f%% similar)", names(similar), similar * 100),
collapse = "\n"
)
report$similar <- sprintf("
SIMILAR PLAYERS:
----------------
%s
", similar_text)
# Recommendation
report$recommendation <- "
RECOMMENDATION:
---------------
[To be filled by scout based on video analysis]
Strengths:
-
Weaknesses:
-
Fit Assessment:
-
Risk Factors:
-
================================================================================
"
# Combine all sections
full_report <- paste(
report$header,
report$performance,
report$rankings,
report$similar,
report$recommendation,
sep = "\n"
)
return(full_report)
}Practice Exercises
Implement the complete recruitment system from scratch. Load data from FBref for the top 5 leagues, calculate per-90 metrics, apply league adjustments, and create a searchable database with filtering capabilities.
Design custom scoring weights for a modern "inverted full-back" role. Identify the key metrics that define this position and create a scoring system. Test it by finding the top 10 inverted full-backs in Europe.
A top-6 Premier League club is losing their starting striker (28 years old, €60M value). Use the replacement analysis system to identify the top 10 replacement candidates with a budget of €40M and maximum age of 26.
Summary
Key Takeaways
- System architecture: Modern recruitment systems have data, analytics, application, and output layers
- Data pipeline: Robust ingestion with league adjustments is foundational
- Position-specific scoring: Different metrics matter for different positions
- Similarity search: Cosine similarity on standardized metrics finds comparable players
- Report generation: Combine quantitative analysis with formatted output for decision-makers
System Components Built
- Player data model and database structure
- Multi-source data ingestion pipeline
- Position-based scoring and ranking system
- Player similarity engine with filtering
- Replacement analysis workflow
- Automated scouting report generator