Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples
0 of 60 chapters completed (0%)
Learning Objectives
  • Design and implement a complete recruitment analytics pipeline
  • Build player databases with integrated metrics and valuations
  • Create automated shortlisting and scoring systems
  • Implement similarity search and replacement analysis
  • Generate comprehensive scouting reports

This chapter brings together everything we've learned into a practical case study: building a complete player recruitment system from scratch. We'll create a production-ready pipeline that professional clubs use to identify, evaluate, and track transfer targets.

Recruitment System Architecture

A modern recruitment system consists of several interconnected components. We'll build each piece and integrate them into a cohesive pipeline.

System Components
Data Layer
  • Player database
  • Match event data
  • Market valuations
  • Contract information
Analytics Layer
  • Metric calculations
  • League adjustments
  • Similarity models
  • Projection models
Application Layer
  • Search & filtering
  • Shortlist management
  • Report generation
  • Comparison tools
Output Layer
  • Scouting reports
  • Radar charts
  • Dashboards
  • Alerts & notifications
recruitment_architecture.R / recruitment_architecture.py
# Python: Define recruitment system architecture
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any
from datetime import date
import pandas as pd
import numpy as np

@dataclass
class Player:
    """Core player entity for recruitment system."""
    id: str
    name: str
    position: str
    age: int
    nationality: str
    current_club: str
    league: str
    contract_expiry: date
    market_value: float  # in millions
    stats: Dict[str, float] = field(default_factory=dict)
    adjusted_stats: Dict[str, float] = field(default_factory=dict)
    notes: List[str] = field(default_factory=list)

    def to_dict(self) -> Dict:
        """Convert player to dictionary."""
        return {
            "id": self.id,
            "name": self.name,
            "position": self.position,
            "age": self.age,
            "nationality": self.nationality,
            "club": self.current_club,
            "league": self.league,
            "contract_expiry": self.contract_expiry,
            "market_value": self.market_value,
            **self.stats
        }

class RecruitmentDatabase:
    """Central database for recruitment system."""

    def __init__(self):
        self.players: Dict[str, Player] = {}
        self.shortlists: Dict[str, List[str]] = {}
        self._dataframe_cache = None

    def add_player(self, player: Player):
        """Add player to database."""
        self.players[player.id] = player
        self._dataframe_cache = None  # Invalidate cache

    def get_player(self, player_id: str) -> Optional[Player]:
        """Retrieve player by ID."""
        return self.players.get(player_id)

    def to_dataframe(self) -> pd.DataFrame:
        """Convert all players to DataFrame."""
        if self._dataframe_cache is None:
            self._dataframe_cache = pd.DataFrame([
                p.to_dict() for p in self.players.values()
            ])
        return self._dataframe_cache

    def create_shortlist(self, name: str):
        """Create new shortlist."""
        self.shortlists[name] = []

    def add_to_shortlist(self, shortlist_name: str, player_id: str):
        """Add player to shortlist."""
        if shortlist_name in self.shortlists:
            self.shortlists[shortlist_name].append(player_id)

print("Player and RecruitmentDatabase classes defined")
# R: Define recruitment system architecture
library(tidyverse)
library(R6)

# Core Player class
Player <- R6Class("Player",
  public = list(
    id = NULL,
    name = NULL,
    position = NULL,
    age = NULL,
    nationality = NULL,
    current_club = NULL,
    league = NULL,
    contract_expiry = NULL,
    market_value = NULL,
    stats = NULL,
    adjusted_stats = NULL,

    initialize = function(id, name, position, age, nationality,
                         current_club, league, contract_expiry, market_value) {
      self$id <- id
      self$name <- name
      self$position <- position
      self$age <- age
      self$nationality <- nationality
      self$current_club <- current_club
      self$league <- league
      self$contract_expiry <- contract_expiry
      self$market_value <- market_value
      self$stats <- list()
      self$adjusted_stats <- list()
    },

    set_stats = function(stats_list) {
      self$stats <- stats_list
    },

    to_df = function() {
      tibble(
        id = self$id,
        name = self$name,
        position = self$position,
        age = self$age,
        nationality = self$nationality,
        club = self$current_club,
        league = self$league,
        contract_expiry = self$contract_expiry,
        market_value = self$market_value
      )
    }
  )
)

# Recruitment Database class
RecruitmentDB <- R6Class("RecruitmentDB",
  public = list(
    players = NULL,
    shortlists = NULL,

    initialize = function() {
      self$players <- list()
      self$shortlists <- list()
    },

    add_player = function(player) {
      self$players[[player$id]] <- player
    },

    get_player = function(player_id) {
      self$players[[player_id]]
    },

    search = function(filters) {
      # Implementation in next section
    },

    create_shortlist = function(name) {
      self$shortlists[[name]] <- list()
    }
  )
)

cat("Player and RecruitmentDB classes defined\n")

Data Ingestion Pipeline

The first step is building a robust data ingestion pipeline that pulls from multiple sources and normalizes the data into our player database.

data_ingestion.R / data_ingestion.py
# Python: Build data ingestion pipeline
import pandas as pd
import numpy as np
from typing import Optional, Dict
from dataclasses import dataclass
import soccerdata as sd

@dataclass
class DataIngestion:
    """Pipeline for ingesting player data from multiple sources."""

    league_multipliers: Dict[str, float] = None

    def __post_init__(self):
        if self.league_multipliers is None:
            self.league_multipliers = {
                "ENG-Premier League": 1.00,
                "ESP-La Liga": 0.92,
                "ITA-Serie A": 0.90,
                "GER-Bundesliga": 0.90,
                "FRA-Ligue 1": 0.80,
                "NED-Eredivisie": 0.70,
                "POR-Primeira Liga": 0.68,
                "ENG-Championship": 0.65
            }

    def load_fbref_data(self, league: str, season: str = "2023-2024") -> Optional[pd.DataFrame]:
        """Load player data from FBref."""
        try:
            fbref = sd.FBref(leagues=[league], seasons=[season])
            stats = fbref.read_player_season_stats(stat_type="standard")
            return stats.reset_index()
        except Exception as e:
            print(f"Error loading FBref data: {e}")
            return None

    def calculate_per90(self, data: pd.DataFrame,
                        min_minutes: int = 450) -> pd.DataFrame:
        """Normalize statistics to per-90 minutes."""
        df = data[data["Min"] >= min_minutes].copy()

        df["nineties"] = df["Min"] / 90

        per90_cols = ["Gls", "Ast", "xG", "xAG", "npxG"]
        for col in per90_cols:
            if col in df.columns:
                df[f"{col}_p90"] = df[col] / df["nineties"]

        df["goal_contribution_p90"] = (df["Gls"] + df["Ast"]) / df["nineties"]

        return df

    def adjust_for_league(self, data: pd.DataFrame,
                          league: str) -> pd.DataFrame:
        """Apply league quality adjustments."""
        multiplier = self.league_multipliers.get(league, 0.75)
        df = data.copy()

        adj_cols = ["xG_p90", "xAG_p90", "npxG_p90"]
        for col in adj_cols:
            if col in df.columns:
                df[f"adj_{col}"] = df[col] * multiplier

        df["league_quality"] = multiplier
        return df

    def ingest_league(self, league: str,
                      season: str = "2023-2024") -> Optional[pd.DataFrame]:
        """Full ingestion pipeline for a league."""
        data = self.load_fbref_data(league, season)
        if data is None:
            return None

        data = self.calculate_per90(data)
        data = self.adjust_for_league(data, league)
        return data

    def ingest_multiple_leagues(self, leagues: list,
                                season: str = "2023-2024") -> pd.DataFrame:
        """Ingest data from multiple leagues."""
        all_data = []
        for league in leagues:
            data = self.ingest_league(league, season)
            if data is not None:
                data["source_league"] = league
                all_data.append(data)

        return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()

# Example usage
ingestion = DataIngestion()
print("Data ingestion pipeline ready")
# R: Build data ingestion pipeline
library(tidyverse)
library(worldfootballR)

# Data Ingestion class
DataIngestion <- R6Class("DataIngestion",
  public = list(
    # League quality adjustments
    league_multipliers = NULL,

    initialize = function() {
      self$league_multipliers <- c(
        "Premier League" = 1.00,
        "La Liga" = 0.92,
        "Serie A" = 0.90,
        "Bundesliga" = 0.90,
        "Ligue 1" = 0.80,
        "Eredivisie" = 0.70,
        "Liga Portugal" = 0.68,
        "Championship" = 0.65
      )
    },

    # Load player data from FBref
    load_fbref_data = function(league, season = 2024) {
      tryCatch({
        # Get standard stats
        standard <- fb_big5_advanced_season_stats(
          season_end_year = season,
          stat_type = "standard",
          team_or_player = "player"
        )

        # Filter to league
        standard %>%
          filter(Comp == league) %>%
          select(
            player = Player,
            team = Squad,
            position = Pos,
            age = Age,
            minutes = Min,
            goals = Gls,
            assists = Ast,
            xg = xG,
            xa = xAG,
            npxg = npxG
          )
      }, error = function(e) {
        message(paste("Error loading FBref data:", e$message))
        NULL
      })
    },

    # Normalize stats to per-90
    calculate_per90 = function(data, min_minutes = 450) {
      data %>%
        filter(minutes >= min_minutes) %>%
        mutate(
          nineties = minutes / 90,
          goals_p90 = goals / nineties,
          assists_p90 = assists / nineties,
          xg_p90 = xg / nineties,
          xa_p90 = xa / nineties,
          npxg_p90 = npxg / nineties,
          goal_contribution_p90 = (goals + assists) / nineties
        )
    },

    # Apply league adjustments
    adjust_for_league = function(data, league) {
      multiplier <- self$league_multipliers[[league]]
      if (is.null(multiplier)) multiplier <- 0.75

      data %>%
        mutate(
          adj_xg_p90 = xg_p90 * multiplier,
          adj_xa_p90 = xa_p90 * multiplier,
          adj_npxg_p90 = npxg_p90 * multiplier,
          league_quality = multiplier
        )
    },

    # Full ingestion pipeline
    ingest_league = function(league, season = 2024) {
      data <- self$load_fbref_data(league, season)
      if (is.null(data)) return(NULL)

      data %>%
        self$calculate_per90() %>%
        self$adjust_for_league(league)
    }
  )
)

# Example usage
ingestion <- DataIngestion$new()
cat("Data ingestion pipeline ready\n")

Player Scoring & Ranking System

To efficiently filter thousands of players down to actionable shortlists, we need a systematic scoring system that weights metrics based on positional requirements.

scoring_system.R / scoring_system.py
# Python: Build position-based scoring system
import pandas as pd
import numpy as np
from typing import Dict, List, Optional
from scipy import stats

class PlayerScorer:
    """Position-based player scoring system."""

    def __init__(self):
        # Position-specific metric weights
        self.position_weights = {
            "Striker": {
                "npxG_p90": 0.30,
                "Gls_p90": 0.25,
                "xAG_p90": 0.10,
                "AerWon_p90": 0.10,
                "Press_p90": 0.10,
                "PrgC_p90": 0.15
            },
            "Winger": {
                "xAG_p90": 0.25,
                "PrgC_p90": 0.20,
                "Succ_p90": 0.15,  # Successful dribbles
                "npxG_p90": 0.15,
                "Crs_p90": 0.15,
                "Press_p90": 0.10
            },
            "Central_Midfielder": {
                "PrgP_p90": 0.25,
                "Cmp%": 0.15,
                "TklW_p90": 0.15,
                "xAG_p90": 0.15,
                "Press_p90": 0.15,
                "npxG_p90": 0.15
            },
            "Center_Back": {
                "AerWon_p90": 0.20,
                "TklW_p90": 0.20,
                "Int_p90": 0.15,
                "PrgP_p90": 0.15,
                "Clr_p90": 0.15,
                "Blocks_p90": 0.15
            }
        }

    def calculate_percentiles(self, data: pd.DataFrame,
                              metrics: List[str]) -> pd.DataFrame:
        """Calculate percentile ranks for metrics."""
        df = data.copy()
        for metric in metrics:
            if metric in df.columns:
                df[f"{metric}_pct"] = stats.rankdata(
                    df[metric], method="average"
                ) / len(df) * 100
        return df

    def score_players(self, data: pd.DataFrame,
                      position: str) -> pd.DataFrame:
        """Score players for a specific position."""
        if position not in self.position_weights:
            raise ValueError(f"Unknown position: {position}")

        weights = self.position_weights[position]
        metrics = list(weights.keys())

        # Calculate percentiles
        df = self.calculate_percentiles(data, metrics)

        # Calculate weighted score
        df["position_score"] = 0
        for metric, weight in weights.items():
            pct_col = f"{metric}_pct"
            if pct_col in df.columns:
                df["position_score"] += weight * df[pct_col]

        # Rank players
        df = df.sort_values("position_score", ascending=False)
        df["rank"] = range(1, len(df) + 1)

        return df

    def generate_position_ranking(self, data: pd.DataFrame,
                                   position: str,
                                   top_n: int = 20) -> pd.DataFrame:
        """Generate top N players for a position."""
        scored = self.score_players(data, position)
        return scored.head(top_n)[[
            "Player", "Squad", "Age", "Min",
            "position_score", "rank"
        ]]

# Example usage
scorer = PlayerScorer()
print("Scoring system initialized with position weights:")
for pos, weights in scorer.position_weights.items():
    print(f"  {pos}: {list(weights.keys())}")
# R: Build position-based scoring system
library(tidyverse)

# Define position-specific metric weights
position_weights <- list(
  "Striker" = c(
    npxg_p90 = 0.30,
    goals_p90 = 0.25,
    xa_p90 = 0.10,
    aerial_wins_p90 = 0.10,
    pressures_p90 = 0.10,
    progressive_carries_p90 = 0.15
  ),
  "Winger" = c(
    xa_p90 = 0.25,
    progressive_carries_p90 = 0.20,
    successful_dribbles_p90 = 0.15,
    npxg_p90 = 0.15,
    crosses_p90 = 0.15,
    pressures_p90 = 0.10
  ),
  "Central_Midfielder" = c(
    progressive_passes_p90 = 0.25,
    pass_completion = 0.15,
    tackles_won_p90 = 0.15,
    xa_p90 = 0.15,
    pressures_p90 = 0.15,
    npxg_p90 = 0.15
  ),
  "Center_Back" = c(
    aerial_wins_p90 = 0.20,
    tackles_won_p90 = 0.20,
    interceptions_p90 = 0.15,
    progressive_passes_p90 = 0.15,
    clearances_p90 = 0.15,
    blocks_p90 = 0.15
  )
)

# Scoring function
score_player <- function(player_data, position, weights_list = position_weights) {
  weights <- weights_list[[position]]
  if (is.null(weights)) {
    warning(paste("No weights defined for position:", position))
    return(NA)
  }

  # Calculate weighted score
  score <- 0
  for (metric in names(weights)) {
    if (metric %in% names(player_data)) {
      # Percentile rank the metric
      metric_value <- player_data[[metric]]
      score <- score + weights[[metric]] * metric_value
    }
  }

  return(score)
}

# Calculate percentile ranks for all metrics
calculate_percentiles <- function(data, metrics) {
  for (metric in metrics) {
    if (metric %in% names(data)) {
      data[[paste0(metric, "_pct")]] <- percent_rank(data[[metric]]) * 100
    }
  }
  return(data)
}

# Full scoring pipeline
score_players_for_position <- function(data, position,
                                        weights_list = position_weights) {
  weights <- weights_list[[position]]
  metrics <- names(weights)

  # Calculate percentiles
  data <- calculate_percentiles(data, metrics)

  # Calculate weighted score
  data$position_score <- 0
  for (metric in metrics) {
    pct_col <- paste0(metric, "_pct")
    if (pct_col %in% names(data)) {
      data$position_score <- data$position_score +
        weights[[metric]] * data[[pct_col]]
    }
  }

  # Rank players
  data %>%
    arrange(desc(position_score)) %>%
    mutate(rank = row_number())
}

cat("Scoring system defined with position-specific weights\n")
Output
Scoring system initialized with position weights:
  Striker: [\'npxG_p90\', \'Gls_p90\', \'xAG_p90\', \'AerWon_p90\', \'Press_p90\', \'PrgC_p90\']
  Winger: [\'xAG_p90\', \'PrgC_p90\', \'Succ_p90\', \'npxG_p90\', \'Crs_p90\', \'Press_p90\']
  Central_Midfielder: [\'PrgP_p90\', \'Cmp%\', \'TklW_p90\', \'xAG_p90\', \'Press_p90\', \'npxG_p90\']
  Center_Back: [\'AerWon_p90\', \'TklW_p90\', \'Int_p90\', \'PrgP_p90\', \'Clr_p90\', \'Blocks_p90\']

Player Similarity Engine

Finding similar players is essential for replacement analysis and identifying alternatives. We'll implement cosine similarity on standardized metrics.

similarity_engine.R / similarity_engine.py
# Python: Build player similarity engine
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Optional, Dict

class SimilarityEngine:
    """Engine for finding similar players."""

    def __init__(self, feature_cols: List[str]):
        self.feature_cols = feature_cols
        self.scaler = StandardScaler()
        self.similarity_matrix = None
        self.player_index = None

    def fit(self, data: pd.DataFrame):
        """Fit the similarity model on player data."""
        # Extract features
        features = data[self.feature_cols].fillna(0)

        # Standardize
        scaled_features = self.scaler.fit_transform(features)

        # Calculate similarity matrix
        self.similarity_matrix = cosine_similarity(scaled_features)

        # Store player index mapping
        self.player_index = {
            player: idx for idx, player in enumerate(data["Player"])
        }

        return self

    def find_similar(self, target_player: str,
                     top_n: int = 10) -> Dict[str, float]:
        """Find most similar players to target."""
        if target_player not in self.player_index:
            raise ValueError(f"Player not found: {target_player}")

        idx = self.player_index[target_player]
        similarities = self.similarity_matrix[idx]

        # Get indices sorted by similarity
        similar_indices = np.argsort(similarities)[::-1]

        # Build result dict (excluding self)
        results = {}
        count = 0
        for i in similar_indices:
            player_name = list(self.player_index.keys())[i]
            if player_name != target_player:
                results[player_name] = similarities[i]
                count += 1
                if count >= top_n:
                    break

        return results

    def find_similar_with_filters(self, target_player: str,
                                   data: pd.DataFrame,
                                   max_age: Optional[int] = None,
                                   max_value: Optional[float] = None,
                                   min_minutes: int = 900,
                                   leagues: Optional[List[str]] = None,
                                   top_n: int = 10) -> pd.DataFrame:
        """Find similar players with filters applied."""
        # Get similarity scores
        all_similar = self.find_similar(target_player, top_n=100)

        # Create results dataframe
        results = pd.DataFrame([
            {"Player": name, "similarity": score}
            for name, score in all_similar.items()
        ])

        # Merge with full data
        results = results.merge(data, on="Player", how="left")

        # Apply filters
        if max_age is not None:
            results = results[results["Age"] <= max_age]
        if max_value is not None:
            results = results[results["market_value"] <= max_value]
        if min_minutes is not None:
            results = results[results["Min"] >= min_minutes]
        if leagues is not None:
            results = results[results["league"].isin(leagues)]

        return results.head(top_n)[[
            "Player", "similarity", "Age", "Squad",
            "market_value", "Min"
        ]]

# Example usage
feature_cols = ["npxG_p90", "xAG_p90", "PrgC_p90", "PrgP_p90", "Press_p90"]
similarity_engine = SimilarityEngine(feature_cols)
print(f"Similarity engine initialized with features: {feature_cols}")
# R: Build player similarity engine
library(tidyverse)

# Similarity Engine using cosine similarity
calculate_cosine_similarity <- function(vec1, vec2) {
  sum(vec1 * vec2) / (sqrt(sum(vec1^2)) * sqrt(sum(vec2^2)))
}

# Build similarity matrix
build_similarity_matrix <- function(data, feature_cols) {
  # Standardize features
  feature_data <- data %>%
    select(all_of(feature_cols)) %>%
    mutate(across(everything(), ~scale(.)[,1]))

  # Replace NA with 0
  feature_data[is.na(feature_data)] <- 0

  # Convert to matrix
  feature_matrix <- as.matrix(feature_data)

  # Calculate similarity for all pairs
  n <- nrow(feature_matrix)
  similarity_matrix <- matrix(0, n, n)

  for (i in 1:n) {
    for (j in 1:n) {
      similarity_matrix[i, j] <- calculate_cosine_similarity(
        feature_matrix[i,], feature_matrix[j,]
      )
    }
  }

  # Set row/column names
  rownames(similarity_matrix) <- data$player
  colnames(similarity_matrix) <- data$player

  return(similarity_matrix)
}

# Find most similar players
find_similar_players <- function(target_player, similarity_matrix, top_n = 10) {
  if (!target_player %in% rownames(similarity_matrix)) {
    stop(paste("Player not found:", target_player))
  }

  similarities <- similarity_matrix[target_player, ]
  similarities <- similarities[names(similarities) != target_player]

  sorted_sim <- sort(similarities, decreasing = TRUE)
  head(sorted_sim, top_n)
}

# Enhanced similarity with filters
find_similar_with_filters <- function(target_player, data, similarity_matrix,
                                       max_age = NULL, max_value = NULL,
                                       min_minutes = 900, top_n = 10) {
  # Get base similar players
  similarities <- find_similar_players(target_player, similarity_matrix, top_n = 50)

  # Create result dataframe
  results <- tibble(
    player = names(similarities),
    similarity = as.numeric(similarities)
  ) %>%
    left_join(data, by = "player")

  # Apply filters
  if (!is.null(max_age)) {
    results <- results %>% filter(age <= max_age)
  }
  if (!is.null(max_value)) {
    results <- results %>% filter(market_value <= max_value)
  }
  results <- results %>% filter(minutes >= min_minutes)

  # Return top N after filters
  results %>%
    head(top_n) %>%
    select(player, similarity, age, team, market_value, minutes)
}

cat("Similarity engine functions defined\n")

Replacement Analysis

replacement_analysis.R / replacement_analysis.py
# Python: Replacement analysis workflow
import pandas as pd
from typing import Optional, Tuple

class ReplacementAnalyzer:
    """Analyze and recommend player replacements."""

    def __init__(self, similarity_engine: SimilarityEngine):
        self.similarity_engine = similarity_engine

    def find_replacements(self, departing_player: str,
                          data: pd.DataFrame,
                          budget: float,
                          position_filter: Optional[str] = None,
                          age_range: Tuple[int, int] = (18, 28),
                          top_n: int = 10) -> pd.DataFrame:
        """Find replacement candidates for a departing player."""

        # Get similar players
        candidates = self.similarity_engine.find_similar_with_filters(
            departing_player,
            data,
            max_age=age_range[1],
            max_value=budget,
            top_n=50
        )

        # Filter by position
        if position_filter and "Pos" in candidates.columns:
            candidates = candidates[
                candidates["Pos"].str.contains(position_filter, na=False)
            ]

        # Filter by age range
        candidates = candidates[
            (candidates["Age"] >= age_range[0]) &
            (candidates["Age"] <= age_range[1])
        ]

        # Calculate recommendation scores
        candidates = candidates.copy()
        candidates["value_score"] = (
            candidates["similarity"] * 100 /
            (candidates["market_value"] + 1)
        )
        candidates["development_potential"] = (28 - candidates["Age"]) * 2

        # Normalize scores
        max_value_score = candidates["value_score"].max()
        max_dev = candidates["development_potential"].max()

        candidates["overall_recommendation"] = (
            candidates["similarity"] * 0.5 +
            (candidates["value_score"] / max_value_score) * 0.3 +
            (candidates["development_potential"] / max_dev) * 0.2
        )

        return candidates.sort_values(
            "overall_recommendation", ascending=False
        ).head(top_n)

    def generate_report(self, departing_player: str,
                        replacements: pd.DataFrame) -> str:
        """Generate text report for replacements."""
        report = f"""
=== REPLACEMENT ANALYSIS FOR: {departing_player} ===

TOP 5 RECOMMENDED REPLACEMENTS:
{"-" * 50}
"""
        for i, (_, r) in enumerate(replacements.head(5).iterrows(), 1):
            report += f"""
{i}. {r["Player"]} ({r.get("Squad", "N/A")})
   Age: {r["Age"]} | Value: €{r.get("market_value", 0):.1f}M
   Similarity: {r["similarity"]*100:.1f}% | Value Score: {r["value_score"]:.2f}
   Recommendation Score: {r["overall_recommendation"]:.2f}
"""
        return report

# Example
print("Replacement analysis system ready")
# R: Replacement analysis workflow
library(tidyverse)

# Find replacement candidates for a departing player
find_replacements <- function(departing_player, data, similarity_matrix,
                              budget, position_filter = NULL,
                              age_range = c(18, 28)) {

  # Find similar players
  candidates <- find_similar_with_filters(
    departing_player, data, similarity_matrix,
    max_age = age_range[2],
    max_value = budget,
    min_minutes = 900,
    top_n = 20
  )

  # Filter by position if specified
  if (!is.null(position_filter)) {
    candidates <- candidates %>%
      filter(grepl(position_filter, position))
  }

  # Filter by age
  candidates <- candidates %>%
    filter(age >= age_range[1], age <= age_range[2])

  # Calculate value score
  candidates <- candidates %>%
    mutate(
      value_score = similarity * 100 / (market_value + 1),
      development_potential = (28 - age) * 2,
      overall_recommendation = similarity * 0.5 +
                               (value_score / max(value_score)) * 0.3 +
                               (development_potential / max(development_potential)) * 0.2
    ) %>%
    arrange(desc(overall_recommendation))

  return(candidates)
}

# Generate replacement report
generate_replacement_report <- function(departing_player, replacements) {
  cat(sprintf("\n=== REPLACEMENT ANALYSIS FOR: %s ===\n\n", departing_player))

  cat("TOP 5 RECOMMENDED REPLACEMENTS:\n")
  cat(paste(rep("-", 50), collapse = ""), "\n")

  for (i in 1:min(5, nrow(replacements))) {
    r <- replacements[i, ]
    cat(sprintf("%d. %s (%s)\n", i, r$player, r$team))
    cat(sprintf("   Age: %d | Value: €%.1fM\n", r$age, r$market_value))
    cat(sprintf("   Similarity: %.1f%% | Value Score: %.2f\n",
                r$similarity * 100, r$value_score))
    cat(sprintf("   Recommendation Score: %.2f\n\n", r$overall_recommendation))
  }
}

Scouting Report Generator

The final output of our recruitment system is comprehensive scouting reports that combine quantitative analysis with formatted output suitable for decision-makers.

scouting_reports.R / scouting_reports.py
# Python: Generate comprehensive scouting reports
import pandas as pd
import numpy as np
from typing import Dict, List, Optional
from datetime import datetime
import matplotlib.pyplot as plt

class ScoutingReportGenerator:
    """Generate comprehensive scouting reports."""

    def __init__(self, similarity_engine: SimilarityEngine):
        self.similarity_engine = similarity_engine

    def create_radar_chart(self, player_data: Dict,
                           metrics: List[str],
                           title: str) -> plt.Figure:
        """Create radar chart for player profile."""
        # Number of metrics
        N = len(metrics)
        angles = [n / float(N) * 2 * np.pi for n in range(N)]
        angles += angles[:1]  # Complete the loop

        # Get values
        values = [player_data.get(m, 0) for m in metrics]
        values += values[:1]

        # Create plot
        fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
        ax.plot(angles, values, "o-", linewidth=2)
        ax.fill(angles, values, alpha=0.25)
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels(metrics, size=10)
        ax.set_title(title, size=14, fontweight="bold", y=1.08)

        return fig

    def generate_report(self, player_name: str,
                        data: pd.DataFrame) -> str:
        """Generate full scouting report."""

        player = data[data["Player"] == player_name]
        if player.empty:
            raise ValueError(f"Player not found: {player_name}")

        p = player.iloc[0]

        # Build report sections
        header = f"""
================================================================================
                         SCOUTING REPORT
================================================================================
Player: {p.get("Player", "N/A")}
Position: {p.get("Pos", "N/A")} | Age: {p.get("Age", "N/A")}
Current Club: {p.get("Squad", "N/A")} ({p.get("league", "N/A")})
Market Value: €{p.get("market_value", 0):.1f}M
Minutes Played: {p.get("Min", 0)} ({p.get("Min", 0)/90:.1f} 90s)
================================================================================
"""

        performance = f"""
PERFORMANCE METRICS (Per 90 Minutes):
-------------------------------------
Goals: {p.get("Gls_p90", 0):.2f}          | xG: {p.get("xG_p90", 0):.2f}
Assists: {p.get("Ast_p90", 0):.2f}        | xA: {p.get("xAG_p90", 0):.2f}
Non-Penalty xG: {p.get("npxG_p90", 0):.2f}
Goal Contribution: {p.get("goal_contribution_p90", 0):.2f}
"""

        # Get similar players
        try:
            similar = self.similarity_engine.find_similar(player_name, top_n=5)
            similar_text = "\n".join([
                f"  - {name} ({score*100:.1f}% similar)"
                for name, score in similar.items()
            ])
        except:
            similar_text = "  (Unable to calculate)"

        similar_section = f"""
SIMILAR PLAYERS:
----------------
{similar_text}
"""

        recommendation = """
RECOMMENDATION:
---------------
[To be filled by scout based on video analysis]

Strengths:
-

Weaknesses:
-

Fit Assessment:
-

Risk Factors:
-

================================================================================
"""
        report = header + performance + similar_section + recommendation

        # Add metadata
        report += f"""
Report Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}
Data Source: FBref / StatsBomb
"""

        return report

    def export_report(self, report: str, filename: str):
        """Export report to file."""
        with open(filename, "w") as f:
            f.write(report)
        print(f"Report saved to {filename}")

# Example usage
print("Scouting report generator ready")
# R: Generate comprehensive scouting reports
library(tidyverse)
library(ggplot2)

# Create radar chart for player
create_player_radar <- function(player_data, metrics, title) {
  # Prepare data for radar chart
  radar_data <- player_data %>%
    select(all_of(metrics)) %>%
    pivot_longer(everything(), names_to = "metric", values_to = "value")

  # Create plot
  ggplot(radar_data, aes(x = metric, y = value)) +
    geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
    coord_polar() +
    theme_minimal() +
    labs(title = title) +
    theme(axis.text.x = element_text(size = 8))
}

# Generate full scouting report
generate_scouting_report <- function(player_name, data, similarity_matrix,
                                      league_adjustments) {

  player <- data %>% filter(player == player_name)

  if (nrow(player) == 0) {
    stop(paste("Player not found:", player_name))
  }

  # Build report
  report <- list()

  # Header section
  report$header <- sprintf("
================================================================================
                         SCOUTING REPORT
================================================================================
Player: %s
Position: %s | Age: %d | Nationality: %s
Current Club: %s (%s)
Contract Expires: %s
Market Value: €%.1fM
Minutes Played: %d (%.1f 90s)
================================================================================
",
    player$player,
    player$position,
    player$age,
    player$nationality,
    player$team,
    player$league,
    player$contract_expiry,
    player$market_value,
    player$minutes,
    player$minutes / 90
  )

  # Performance metrics
  report$performance <- sprintf("
PERFORMANCE METRICS (Per 90 Minutes):
-------------------------------------
Goals: %.2f          | xG: %.2f           | Overperformance: %+.2f
Assists: %.2f        | xA: %.2f           | Overperformance: %+.2f
Non-Penalty xG: %.2f

League-Adjusted Metrics (to Premier League):
Goals (adj): %.2f    | xG (adj): %.2f     | xA (adj): %.2f
",
    player$goals_p90, player$xg_p90, player$goals_p90 - player$xg_p90,
    player$assists_p90, player$xa_p90, player$assists_p90 - player$xa_p90,
    player$npxg_p90,
    player$adj_goals_p90, player$adj_xg_p90, player$adj_xa_p90
  )

  # Percentile rankings
  report$rankings <- "
PERCENTILE RANKINGS (vs Position):
----------------------------------
See attached radar chart
"

  # Similar players
  similar <- find_similar_players(player_name, similarity_matrix, top_n = 5)
  similar_text <- paste(
    sprintf("  - %s (%.1f%% similar)", names(similar), similar * 100),
    collapse = "\n"
  )
  report$similar <- sprintf("
SIMILAR PLAYERS:
----------------
%s
", similar_text)

  # Recommendation
  report$recommendation <- "
RECOMMENDATION:
---------------
[To be filled by scout based on video analysis]

Strengths:
-

Weaknesses:
-

Fit Assessment:
-

Risk Factors:
-

================================================================================
"

  # Combine all sections
  full_report <- paste(
    report$header,
    report$performance,
    report$rankings,
    report$similar,
    report$recommendation,
    sep = "\n"
  )

  return(full_report)
}

Practice Exercises

Exercise 1: Build a Complete Recruitment Pipeline

Implement the complete recruitment system from scratch. Load data from FBref for the top 5 leagues, calculate per-90 metrics, apply league adjustments, and create a searchable database with filtering capabilities.

Exercise 2: Custom Position Weights

Design custom scoring weights for a modern "inverted full-back" role. Identify the key metrics that define this position and create a scoring system. Test it by finding the top 10 inverted full-backs in Europe.

Exercise 3: Replacement Analysis Case Study

A top-6 Premier League club is losing their starting striker (28 years old, €60M value). Use the replacement analysis system to identify the top 10 replacement candidates with a budget of €40M and maximum age of 26.

Summary

Key Takeaways
  • System architecture: Modern recruitment systems have data, analytics, application, and output layers
  • Data pipeline: Robust ingestion with league adjustments is foundational
  • Position-specific scoring: Different metrics matter for different positions
  • Similarity search: Cosine similarity on standardized metrics finds comparable players
  • Report generation: Combine quantitative analysis with formatted output for decision-makers
System Components Built
  • Player data model and database structure
  • Multi-source data ingestion pipeline
  • Position-based scoring and ranking system
  • Player similarity engine with filtering
  • Replacement analysis workflow
  • Automated scouting report generator