Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples

0 of 60 chapters completed (0%)

Learning Objectives

Design and implement a complete recruitment analytics pipeline
Build player databases with integrated metrics and valuations
Create automated shortlisting and scoring systems
Implement similarity search and replacement analysis
Generate comprehensive scouting reports

This chapter brings together everything we've learned into a practical case study: building a complete player recruitment system from scratch. We'll create a production-ready pipeline that professional clubs use to identify, evaluate, and track transfer targets.

Recruitment System Architecture

A modern recruitment system consists of several interconnected components. We'll build each piece and integrate them into a cohesive pipeline.

System Components

Data Layer

Player database
Match event data
Market valuations
Contract information

Analytics Layer

Metric calculations
League adjustments
Similarity models
Projection models

Application Layer

Search & filtering
Shortlist management
Report generation
Comparison tools

Output Layer

Scouting reports
Radar charts
Dashboards
Alerts & notifications

recruitment_architecture.R / recruitment_architecture.py

# Python: Define recruitment system architecture
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any
from datetime import date
import pandas as pd
import numpy as np

@dataclass
class Player:
    """Core player entity for recruitment system."""
    id: str
    name: str
    position: str
    age: int
    nationality: str
    current_club: str
    league: str
    contract_expiry: date
    market_value: float  # in millions
    stats: Dict[str, float] = field(default_factory=dict)
    adjusted_stats: Dict[str, float] = field(default_factory=dict)
    notes: List[str] = field(default_factory=list)

    def to_dict(self) -> Dict:
        """Convert player to dictionary."""
        return {
            "id": self.id,
            "name": self.name,
            "position": self.position,
            "age": self.age,
            "nationality": self.nationality,
            "club": self.current_club,
            "league": self.league,
            "contract_expiry": self.contract_expiry,
            "market_value": self.market_value,
            **self.stats
        }

class RecruitmentDatabase:
    """Central database for recruitment system."""

    def __init__(self):
        self.players: Dict[str, Player] = {}
        self.shortlists: Dict[str, List[str]] = {}
        self._dataframe_cache = None

    def add_player(self, player: Player):
        """Add player to database."""
        self.players[player.id] = player
        self._dataframe_cache = None  # Invalidate cache

    def get_player(self, player_id: str) -> Optional[Player]:
        """Retrieve player by ID."""
        return self.players.get(player_id)

    def to_dataframe(self) -> pd.DataFrame:
        """Convert all players to DataFrame."""
        if self._dataframe_cache is None:
            self._dataframe_cache = pd.DataFrame([
                p.to_dict() for p in self.players.values()
            ])
        return self._dataframe_cache

    def create_shortlist(self, name: str):
        """Create new shortlist."""
        self.shortlists[name] = []

    def add_to_shortlist(self, shortlist_name: str, player_id: str):
        """Add player to shortlist."""
        if shortlist_name in self.shortlists:
            self.shortlists[shortlist_name].append(player_id)

print("Player and RecruitmentDatabase classes defined")
# R: Define recruitment system architecture
library(tidyverse)
library(R6)

# Core Player class
Player <- R6Class("Player",
  public = list(
    id = NULL,
    name = NULL,
    position = NULL,
    age = NULL,
    nationality = NULL,
    current_club = NULL,
    league = NULL,
    contract_expiry = NULL,
    market_value = NULL,
    stats = NULL,
    adjusted_stats = NULL,

    initialize = function(id, name, position, age, nationality,
                         current_club, league, contract_expiry, market_value) {
      self$id <- id
      self$name <- name
      self$position <- position
      self$age <- age
      self$nationality <- nationality
      self$current_club <- current_club
      self$league <- league
      self$contract_expiry <- contract_expiry
      self$market_value <- market_value
      self$stats <- list()
      self$adjusted_stats <- list()
    },

    set_stats = function(stats_list) {
      self$stats <- stats_list
    },

    to_df = function() {
      tibble(
        id = self$id,
        name = self$name,
        position = self$position,
        age = self$age,
        nationality = self$nationality,
        club = self$current_club,
        league = self$league,
        contract_expiry = self$contract_expiry,
        market_value = self$market_value
      )
    }
  )
)

# Recruitment Database class
RecruitmentDB <- R6Class("RecruitmentDB",
  public = list(
    players = NULL,
    shortlists = NULL,

    initialize = function() {
      self$players <- list()
      self$shortlists <- list()
    },

    add_player = function(player) {
      self$players[[player$id]] <- player
    },

    get_player = function(player_id) {
      self$players[[player_id]]
    },

    search = function(filters) {
      # Implementation in next section
    },

    create_shortlist = function(name) {
      self$shortlists[[name]] <- list()
    }
  )
)

cat("Player and RecruitmentDB classes defined\n")

Data Ingestion Pipeline

The first step is building a robust data ingestion pipeline that pulls from multiple sources and normalizes the data into our player database.

data_ingestion.R / data_ingestion.py

# Python: Build data ingestion pipeline
import pandas as pd
import numpy as np
from typing import Optional, Dict
from dataclasses import dataclass
import soccerdata as sd

@dataclass
class DataIngestion:
    """Pipeline for ingesting player data from multiple sources."""

    league_multipliers: Dict[str, float] = None

    def __post_init__(self):
        if self.league_multipliers is None:
            self.league_multipliers = {
                "ENG-Premier League": 1.00,
                "ESP-La Liga": 0.92,
                "ITA-Serie A": 0.90,
                "GER-Bundesliga": 0.90,
                "FRA-Ligue 1": 0.80,
                "NED-Eredivisie": 0.70,
                "POR-Primeira Liga": 0.68,
                "ENG-Championship": 0.65
            }

    def load_fbref_data(self, league: str, season: str = "2023-2024") -> Optional[pd.DataFrame]:
        """Load player data from FBref."""
        try:
            fbref = sd.FBref(leagues=[league], seasons=[season])
            stats = fbref.read_player_season_stats(stat_type="standard")
            return stats.reset_index()
        except Exception as e:
            print(f"Error loading FBref data: {e}")
            return None

    def calculate_per90(self, data: pd.DataFrame,
                        min_minutes: int = 450) -> pd.DataFrame:
        """Normalize statistics to per-90 minutes."""
        df = data[data["Min"] >= min_minutes].copy()

        df["nineties"] = df["Min"] / 90

        per90_cols = ["Gls", "Ast", "xG", "xAG", "npxG"]
        for col in per90_cols:
            if col in df.columns:
                df[f"{col}_p90"] = df[col] / df["nineties"]

        df["goal_contribution_p90"] = (df["Gls"] + df["Ast"]) / df["nineties"]

        return df

    def adjust_for_league(self, data: pd.DataFrame,
                          league: str) -> pd.DataFrame:
        """Apply league quality adjustments."""
        multiplier = self.league_multipliers.get(league, 0.75)
        df = data.copy()

        adj_cols = ["xG_p90", "xAG_p90", "npxG_p90"]
        for col in adj_cols:
            if col in df.columns:
                df[f"adj_{col}"] = df[col] * multiplier

        df["league_quality"] = multiplier
        return df

    def ingest_league(self, league: str,
                      season: str = "2023-2024") -> Optional[pd.DataFrame]:
        """Full ingestion pipeline for a league."""
        data = self.load_fbref_data(league, season)
        if data is None:
            return None

        data = self.calculate_per90(data)
        data = self.adjust_for_league(data, league)
        return data

    def ingest_multiple_leagues(self, leagues: list,
                                season: str = "2023-2024") -> pd.DataFrame:
        """Ingest data from multiple leagues."""
        all_data = []
        for league in leagues:
            data = self.ingest_league(league, season)
            if data is not None:
                data["source_league"] = league
                all_data.append(data)

        return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()

# Example usage
ingestion = DataIngestion()
print("Data ingestion pipeline ready")
# R: Build data ingestion pipeline
library(tidyverse)
library(worldfootballR)

# Data Ingestion class
DataIngestion <- R6Class("DataIngestion",
  public = list(
    # League quality adjustments
    league_multipliers = NULL,

    initialize = function() {
      self$league_multipliers <- c(
        "Premier League" = 1.00,
        "La Liga" = 0.92,
        "Serie A" = 0.90,
        "Bundesliga" = 0.90,
        "Ligue 1" = 0.80,
        "Eredivisie" = 0.70,
        "Liga Portugal" = 0.68,
        "Championship" = 0.65
      )
    },

    # Load player data from FBref
    load_fbref_data = function(league, season = 2024) {
      tryCatch({
        # Get standard stats
        standard <- fb_big5_advanced_season_stats(
          season_end_year = season,
          stat_type = "standard",
          team_or_player = "player"
        )

        # Filter to league
        standard %>%
          filter(Comp == league) %>%
          select(
            player = Player,
            team = Squad,
            position = Pos,
            age = Age,
            minutes = Min,
            goals = Gls,
            assists = Ast,
            xg = xG,
            xa = xAG,
            npxg = npxG
          )
      }, error = function(e) {
        message(paste("Error loading FBref data:", e$message))
        NULL
      })
    },

    # Normalize stats to per-90
    calculate_per90 = function(data, min_minutes = 450) {
      data %>%
        filter(minutes >= min_minutes) %>%
        mutate(
          nineties = minutes / 90,
          goals_p90 = goals / nineties,
          assists_p90 = assists / nineties,
          xg_p90 = xg / nineties,
          xa_p90 = xa / nineties,
          npxg_p90 = npxg / nineties,
          goal_contribution_p90 = (goals + assists) / nineties
        )
    },

    # Apply league adjustments
    adjust_for_league = function(data, league) {
      multiplier <- self$league_multipliers[[league]]
      if (is.null(multiplier)) multiplier <- 0.75

      data %>%
        mutate(
          adj_xg_p90 = xg_p90 * multiplier,
          adj_xa_p90 = xa_p90 * multiplier,
          adj_npxg_p90 = npxg_p90 * multiplier,
          league_quality = multiplier
        )
    },

    # Full ingestion pipeline
    ingest_league = function(league, season = 2024) {
      data <- self$load_fbref_data(league, season)
      if (is.null(data)) return(NULL)

      data %>%
        self$calculate_per90() %>%
        self$adjust_for_league(league)
    }
  )
)

# Example usage
ingestion <- DataIngestion$new()
cat("Data ingestion pipeline ready\n")

Player Scoring & Ranking System

To efficiently filter thousands of players down to actionable shortlists, we need a systematic scoring system that weights metrics based on positional requirements.

scoring_system.R / scoring_system.py

# Python: Build position-based scoring system
import pandas as pd
import numpy as np
from typing import Dict, List, Optional
from scipy import stats

class PlayerScorer:
    """Position-based player scoring system."""

    def __init__(self):
        # Position-specific metric weights
        self.position_weights = {
            "Striker": {
                "npxG_p90": 0.30,
                "Gls_p90": 0.25,
                "xAG_p90": 0.10,
                "AerWon_p90": 0.10,
                "Press_p90": 0.10,
                "PrgC_p90": 0.15
            },
            "Winger": {
                "xAG_p90": 0.25,
                "PrgC_p90": 0.20,
                "Succ_p90": 0.15,  # Successful dribbles
                "npxG_p90": 0.15,
                "Crs_p90": 0.15,
                "Press_p90": 0.10
            },
            "Central_Midfielder": {
                "PrgP_p90": 0.25,
                "Cmp%": 0.15,
                "TklW_p90": 0.15,
                "xAG_p90": 0.15,
                "Press_p90": 0.15,
                "npxG_p90": 0.15
            },
            "Center_Back": {
                "AerWon_p90": 0.20,
                "TklW_p90": 0.20,
                "Int_p90": 0.15,
                "PrgP_p90": 0.15,
                "Clr_p90": 0.15,
                "Blocks_p90": 0.15
            }
        }

    def calculate_percentiles(self, data: pd.DataFrame,
                              metrics: List[str]) -> pd.DataFrame:
        """Calculate percentile ranks for metrics."""
        df = data.copy()
        for metric in metrics:
            if metric in df.columns:
                df[f"{metric}_pct"] = stats.rankdata(
                    df[metric], method="average"
                ) / len(df) * 100
        return df

    def score_players(self, data: pd.DataFrame,
                      position: str) -> pd.DataFrame:
        """Score players for a specific position."""
        if position not in self.position_weights:
            raise ValueError(f"Unknown position: {position}")

        weights = self.position_weights[position]
        metrics = list(weights.keys())

        # Calculate percentiles
        df = self.calculate_percentiles(data, metrics)

        # Calculate weighted score
        df["position_score"] = 0
        for metric, weight in weights.items():
            pct_col = f"{metric}_pct"
            if pct_col in df.columns:
                df["position_score"] += weight * df[pct_col]

        # Rank players
        df = df.sort_values("position_score", ascending=False)
        df["rank"] = range(1, len(df) + 1)

        return df

    def generate_position_ranking(self, data: pd.DataFrame,
                                   position: str,
                                   top_n: int = 20) -> pd.DataFrame:
        """Generate top N players for a position."""
        scored = self.score_players(data, position)
        return scored.head(top_n)[[
            "Player", "Squad", "Age", "Min",
            "position_score", "rank"
        ]]

# Example usage
scorer = PlayerScorer()
print("Scoring system initialized with position weights:")
for pos, weights in scorer.position_weights.items():
    print(f"  {pos}: {list(weights.keys())}")
# R: Build position-based scoring system
library(tidyverse)

# Define position-specific metric weights
position_weights <- list(
  "Striker" = c(
    npxg_p90 = 0.30,
    goals_p90 = 0.25,
    xa_p90 = 0.10,
    aerial_wins_p90 = 0.10,
    pressures_p90 = 0.10,
    progressive_carries_p90 = 0.15
  ),
  "Winger" = c(
    xa_p90 = 0.25,
    progressive_carries_p90 = 0.20,
    successful_dribbles_p90 = 0.15,
    npxg_p90 = 0.15,
    crosses_p90 = 0.15,
    pressures_p90 = 0.10
  ),
  "Central_Midfielder" = c(
    progressive_passes_p90 = 0.25,
    pass_completion = 0.15,
    tackles_won_p90 = 0.15,
    xa_p90 = 0.15,
    pressures_p90 = 0.15,
    npxg_p90 = 0.15
  ),
  "Center_Back" = c(
    aerial_wins_p90 = 0.20,
    tackles_won_p90 = 0.20,
    interceptions_p90 = 0.15,
    progressive_passes_p90 = 0.15,
    clearances_p90 = 0.15,
    blocks_p90 = 0.15
  )
)

# Scoring function
score_player <- function(player_data, position, weights_list = position_weights) {
  weights <- weights_list[[position]]
  if (is.null(weights)) {
    warning(paste("No weights defined for position:", position))
    return(NA)
  }

  # Calculate weighted score
  score <- 0
  for (metric in names(weights)) {
    if (metric %in% names(player_data)) {
      # Percentile rank the metric
      metric_value <- player_data[[metric]]
      score <- score + weights[[metric]] * metric_value
    }
  }

  return(score)
}

# Calculate percentile ranks for all metrics
calculate_percentiles <- function(data, metrics) {
  for (metric in metrics) {
    if (metric %in% names(data)) {
      data[[paste0(metric, "_pct")]] <- percent_rank(data[[metric]]) * 100
    }
  }
  return(data)
}

# Full scoring pipeline
score_players_for_position <- function(data, position,
                                        weights_list = position_weights) {
  weights <- weights_list[[position]]
  metrics <- names(weights)

  # Calculate percentiles
  data <- calculate_percentiles(data, metrics)

  # Calculate weighted score
  data$position_score <- 0
  for (metric in metrics) {
    pct_col <- paste0(metric, "_pct")
    if (pct_col %in% names(data)) {
      data$position_score <- data$position_score +
        weights[[metric]] * data[[pct_col]]
    }
  }

  # Rank players
  data %>%
    arrange(desc(position_score)) %>%
    mutate(rank = row_number())
}

cat("Scoring system defined with position-specific weights\n")

Output

Scoring system initialized with position weights:
  Striker: [\'npxG_p90\', \'Gls_p90\', \'xAG_p90\', \'AerWon_p90\', \'Press_p90\', \'PrgC_p90\']
  Winger: [\'xAG_p90\', \'PrgC_p90\', \'Succ_p90\', \'npxG_p90\', \'Crs_p90\', \'Press_p90\']
  Central_Midfielder: [\'PrgP_p90\', \'Cmp%\', \'TklW_p90\', \'xAG_p90\', \'Press_p90\', \'npxG_p90\']
  Center_Back: [\'AerWon_p90\', \'TklW_p90\', \'Int_p90\', \'PrgP_p90\', \'Clr_p90\', \'Blocks_p90\']

Player Similarity Engine

Finding similar players is essential for replacement analysis and identifying alternatives. We'll implement cosine similarity on standardized metrics.

similarity_engine.R / similarity_engine.py

# Python: Build player similarity engine
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Optional, Dict

class SimilarityEngine:
    """Engine for finding similar players."""

    def __init__(self, feature_cols: List[str]):
        self.feature_cols = feature_cols
        self.scaler = StandardScaler()
        self.similarity_matrix = None
        self.player_index = None

    def fit(self, data: pd.DataFrame):
        """Fit the similarity model on player data."""
        # Extract features
        features = data[self.feature_cols].fillna(0)

        # Standardize
        scaled_features = self.scaler.fit_transform(features)

        # Calculate similarity matrix
        self.similarity_matrix = cosine_similarity(scaled_features)

        # Store player index mapping
        self.player_index = {
            player: idx for idx, player in enumerate(data["Player"])
        }

        return self

    def find_similar(self, target_player: str,
                     top_n: int = 10) -> Dict[str, float]:
        """Find most similar players to target."""
        if target_player not in self.player_index:
            raise ValueError(f"Player not found: {target_player}")

        idx = self.player_index[target_player]
        similarities = self.similarity_matrix[idx]

        # Get indices sorted by similarity
        similar_indices = np.argsort(similarities)[::-1]

        # Build result dict (excluding self)
        results = {}
        count = 0
        for i in similar_indices:
            player_name = list(self.player_index.keys())[i]
            if player_name != target_player:
                results[player_name] = similarities[i]
                count += 1
                if count >= top_n:
                    break

        return results

    def find_similar_with_filters(self, target_player: str,
                                   data: pd.DataFrame,
                                   max_age: Optional[int] = None,
                                   max_value: Optional[float] = None,
                                   min_minutes: int = 900,
                                   leagues: Optional[List[str]] = None,
                                   top_n: int = 10) -> pd.DataFrame:
        """Find similar players with filters applied."""
        # Get similarity scores
        all_similar = self.find_similar(target_player, top_n=100)

        # Create results dataframe
        results = pd.DataFrame([
            {"Player": name, "similarity": score}
            for name, score in all_similar.items()
        ])

        # Merge with full data
        results = results.merge(data, on="Player", how="left")

        # Apply filters
        if max_age is not None:
            results = results[results["Age"] <= max_age]
        if max_value is not None:
            results = results[results["market_value"] <= max_value]
        if min_minutes is not None:
            results = results[results["Min"] >= min_minutes]
        if leagues is not None:
            results = results[results["league"].isin(leagues)]

        return results.head(top_n)[[
            "Player", "similarity", "Age", "Squad",
            "market_value", "Min"
        ]]

# Example usage
feature_cols = ["npxG_p90", "xAG_p90", "PrgC_p90", "PrgP_p90", "Press_p90"]
similarity_engine = SimilarityEngine(feature_cols)
print(f"Similarity engine initialized with features: {feature_cols}")
# R: Build player similarity engine
library(tidyverse)

# Similarity Engine using cosine similarity
calculate_cosine_similarity <- function(vec1, vec2) {
  sum(vec1 * vec2) / (sqrt(sum(vec1^2)) * sqrt(sum(vec2^2)))
}

# Build similarity matrix
build_similarity_matrix <- function(data, feature_cols) {
  # Standardize features
  feature_data <- data %>%
    select(all_of(feature_cols)) %>%
    mutate(across(everything(), ~scale(.)[,1]))

  # Replace NA with 0
  feature_data[is.na(feature_data)] <- 0

  # Convert to matrix
  feature_matrix <- as.matrix(feature_data)

  # Calculate similarity for all pairs
  n <- nrow(feature_matrix)
  similarity_matrix <- matrix(0, n, n)

  for (i in 1:n) {
    for (j in 1:n) {
      similarity_matrix[i, j] <- calculate_cosine_similarity(
        feature_matrix[i,], feature_matrix[j,]
      )
    }
  }

  # Set row/column names
  rownames(similarity_matrix) <- data$player
  colnames(similarity_matrix) <- data$player

  return(similarity_matrix)
}

# Find most similar players
find_similar_players <- function(target_player, similarity_matrix, top_n = 10) {
  if (!target_player %in% rownames(similarity_matrix)) {
    stop(paste("Player not found:", target_player))
  }

  similarities <- similarity_matrix[target_player, ]
  similarities <- similarities[names(similarities) != target_player]

  sorted_sim <- sort(similarities, decreasing = TRUE)
  head(sorted_sim, top_n)
}

# Enhanced similarity with filters
find_similar_with_filters <- function(target_player, data, similarity_matrix,
                                       max_age = NULL, max_value = NULL,
                                       min_minutes = 900, top_n = 10) {
  # Get base similar players
  similarities <- find_similar_players(target_player, similarity_matrix, top_n = 50)

  # Create result dataframe
  results <- tibble(
    player = names(similarities),
    similarity = as.numeric(similarities)
  ) %>%
    left_join(data, by = "player")

  # Apply filters
  if (!is.null(max_age)) {
    results <- results %>% filter(age <= max_age)
  }
  if (!is.null(max_value)) {
    results <- results %>% filter(market_value <= max_value)
  }
  results <- results %>% filter(minutes >= min_minutes)

  # Return top N after filters
  results %>%
    head(top_n) %>%
    select(player, similarity, age, team, market_value, minutes)
}

cat("Similarity engine functions defined\n")

Replacement Analysis

replacement_analysis.R / replacement_analysis.py

# Python: Replacement analysis workflow
import pandas as pd
from typing import Optional, Tuple

class ReplacementAnalyzer:
    """Analyze and recommend player replacements."""

    def __init__(self, similarity_engine: SimilarityEngine):
        self.similarity_engine = similarity_engine

    def find_replacements(self, departing_player: str,
                          data: pd.DataFrame,
                          budget: float,
                          position_filter: Optional[str] = None,
                          age_range: Tuple[int, int] = (18, 28),
                          top_n: int = 10) -> pd.DataFrame:
        """Find replacement candidates for a departing player."""

        # Get similar players
        candidates = self.similarity_engine.find_similar_with_filters(
            departing_player,
            data,
            max_age=age_range[1],
            max_value=budget,
            top_n=50
        )

        # Filter by position
        if position_filter and "Pos" in candidates.columns:
            candidates = candidates[
                candidates["Pos"].str.contains(position_filter, na=False)
            ]

        # Filter by age range
        candidates = candidates[
            (candidates["Age"] >= age_range[0]) &
            (candidates["Age"] <= age_range[1])
        ]

        # Calculate recommendation scores
        candidates = candidates.copy()
        candidates["value_score"] = (
            candidates["similarity"] * 100 /
            (candidates["market_value"] + 1)
        )
        candidates["development_potential"] = (28 - candidates["Age"]) * 2

        # Normalize scores
        max_value_score = candidates["value_score"].max()
        max_dev = candidates["development_potential"].max()

        candidates["overall_recommendation"] = (
            candidates["similarity"] * 0.5 +
            (candidates["value_score"] / max_value_score) * 0.3 +
            (candidates["development_potential"] / max_dev) * 0.2
        )

        return candidates.sort_values(
            "overall_recommendation", ascending=False
        ).head(top_n)

    def generate_report(self, departing_player: str,
                        replacements: pd.DataFrame) -> str:
        """Generate text report for replacements."""
        report = f"""
=== REPLACEMENT ANALYSIS FOR: {departing_player} ===

TOP 5 RECOMMENDED REPLACEMENTS:
{"-" * 50}
"""
        for i, (_, r) in enumerate(replacements.head(5).iterrows(), 1):
            report += f"""
{i}. {r["Player"]} ({r.get("Squad", "N/A")})
   Age: {r["Age"]} | Value: €{r.get("market_value", 0):.1f}M
   Similarity: {r["similarity"]*100:.1f}% | Value Score: {r["value_score"]:.2f}
   Recommendation Score: {r["overall_recommendation"]:.2f}
"""
        return report

# Example
print("Replacement analysis system ready")
# R: Replacement analysis workflow
library(tidyverse)

# Find replacement candidates for a departing player
find_replacements <- function(departing_player, data, similarity_matrix,
                              budget, position_filter = NULL,
                              age_range = c(18, 28)) {

  # Find similar players
  candidates <- find_similar_with_filters(
    departing_player, data, similarity_matrix,
    max_age = age_range[2],
    max_value = budget,
    min_minutes = 900,
    top_n = 20
  )

  # Filter by position if specified
  if (!is.null(position_filter)) {
    candidates <- candidates %>%
      filter(grepl(position_filter, position))
  }

  # Filter by age
  candidates <- candidates %>%
    filter(age >= age_range[1], age <= age_range[2])

  # Calculate value score
  candidates <- candidates %>%
    mutate(
      value_score = similarity * 100 / (market_value + 1),
      development_potential = (28 - age) * 2,
      overall_recommendation = similarity * 0.5 +
                               (value_score / max(value_score)) * 0.3 +
                               (development_potential / max(development_potential)) * 0.2
    ) %>%
    arrange(desc(overall_recommendation))

  return(candidates)
}

# Generate replacement report
generate_replacement_report <- function(departing_player, replacements) {
  cat(sprintf("\n=== REPLACEMENT ANALYSIS FOR: %s ===\n\n", departing_player))

  cat("TOP 5 RECOMMENDED REPLACEMENTS:\n")
  cat(paste(rep("-", 50), collapse = ""), "\n")

  for (i in 1:min(5, nrow(replacements))) {
    r <- replacements[i, ]
    cat(sprintf("%d. %s (%s)\n", i, r$player, r$team))
    cat(sprintf("   Age: %d | Value: €%.1fM\n", r$age, r$market_value))
    cat(sprintf("   Similarity: %.1f%% | Value Score: %.2f\n",
                r$similarity * 100, r$value_score))
    cat(sprintf("   Recommendation Score: %.2f\n\n", r$overall_recommendation))
  }
}

Scouting Report Generator

The final output of our recruitment system is comprehensive scouting reports that combine quantitative analysis with formatted output suitable for decision-makers.

scouting_reports.R / scouting_reports.py

# Python: Generate comprehensive scouting reports
import pandas as pd
import numpy as np
from typing import Dict, List, Optional
from datetime import datetime
import matplotlib.pyplot as plt

class ScoutingReportGenerator:
    """Generate comprehensive scouting reports."""

    def __init__(self, similarity_engine: SimilarityEngine):
        self.similarity_engine = similarity_engine

    def create_radar_chart(self, player_data: Dict,
                           metrics: List[str],
                           title: str) -> plt.Figure:
        """Create radar chart for player profile."""
        # Number of metrics
        N = len(metrics)
        angles = [n / float(N) * 2 * np.pi for n in range(N)]
        angles += angles[:1]  # Complete the loop

        # Get values
        values = [player_data.get(m, 0) for m in metrics]
        values += values[:1]

        # Create plot
        fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
        ax.plot(angles, values, "o-", linewidth=2)
        ax.fill(angles, values, alpha=0.25)
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels(metrics, size=10)
        ax.set_title(title, size=14, fontweight="bold", y=1.08)

        return fig

    def generate_report(self, player_name: str,
                        data: pd.DataFrame) -> str:
        """Generate full scouting report."""

        player = data[data["Player"] == player_name]
        if player.empty:
            raise ValueError(f"Player not found: {player_name}")

        p = player.iloc[0]

        # Build report sections
        header = f"""
================================================================================
                         SCOUTING REPORT
================================================================================
Player: {p.get("Player", "N/A")}
Position: {p.get("Pos", "N/A")} | Age: {p.get("Age", "N/A")}
Current Club: {p.get("Squad", "N/A")} ({p.get("league", "N/A")})
Market Value: €{p.get("market_value", 0):.1f}M
Minutes Played: {p.get("Min", 0)} ({p.get("Min", 0)/90:.1f} 90s)
================================================================================
"""

        performance = f"""
PERFORMANCE METRICS (Per 90 Minutes):
-------------------------------------
Goals: {p.get("Gls_p90", 0):.2f}          | xG: {p.get("xG_p90", 0):.2f}
Assists: {p.get("Ast_p90", 0):.2f}        | xA: {p.get("xAG_p90", 0):.2f}
Non-Penalty xG: {p.get("npxG_p90", 0):.2f}
Goal Contribution: {p.get("goal_contribution_p90", 0):.2f}
"""

        # Get similar players
        try:
            similar = self.similarity_engine.find_similar(player_name, top_n=5)
            similar_text = "\n".join([
                f"  - {name} ({score*100:.1f}% similar)"
                for name, score in similar.items()
            ])
        except:
            similar_text = "  (Unable to calculate)"

        similar_section = f"""
SIMILAR PLAYERS:
----------------
{similar_text}
"""

        recommendation = """
RECOMMENDATION:
---------------
[To be filled by scout based on video analysis]

Strengths:
-

Weaknesses:
-

Fit Assessment:
-

Risk Factors:
-

================================================================================
"""
        report = header + performance + similar_section + recommendation

        # Add metadata
        report += f"""
Report Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}
Data Source: FBref / StatsBomb
"""

        return report

    def export_report(self, report: str, filename: str):
        """Export report to file."""
        with open(filename, "w") as f:
            f.write(report)
        print(f"Report saved to {filename}")

# Example usage
print("Scouting report generator ready")
# R: Generate comprehensive scouting reports
library(tidyverse)
library(ggplot2)

# Create radar chart for player
create_player_radar <- function(player_data, metrics, title) {
  # Prepare data for radar chart
  radar_data <- player_data %>%
    select(all_of(metrics)) %>%
    pivot_longer(everything(), names_to = "metric", values_to = "value")

  # Create plot
  ggplot(radar_data, aes(x = metric, y = value)) +
    geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
    coord_polar() +
    theme_minimal() +
    labs(title = title) +
    theme(axis.text.x = element_text(size = 8))
}

# Generate full scouting report
generate_scouting_report <- function(player_name, data, similarity_matrix,
                                      league_adjustments) {

  player <- data %>% filter(player == player_name)

  if (nrow(player) == 0) {
    stop(paste("Player not found:", player_name))
  }

  # Build report
  report <- list()

  # Header section
  report$header <- sprintf("
================================================================================
                         SCOUTING REPORT
================================================================================
Player: %s
Position: %s | Age: %d | Nationality: %s
Current Club: %s (%s)
Contract Expires: %s
Market Value: €%.1fM
Minutes Played: %d (%.1f 90s)
================================================================================
",
    player$player,
    player$position,
    player$age,
    player$nationality,
    player$team,
    player$league,
    player$contract_expiry,
    player$market_value,
    player$minutes,
    player$minutes / 90
  )

  # Performance metrics
  report$performance <- sprintf("
PERFORMANCE METRICS (Per 90 Minutes):
-------------------------------------
Goals: %.2f          | xG: %.2f           | Overperformance: %+.2f
Assists: %.2f        | xA: %.2f           | Overperformance: %+.2f
Non-Penalty xG: %.2f

League-Adjusted Metrics (to Premier League):
Goals (adj): %.2f    | xG (adj): %.2f     | xA (adj): %.2f
",
    player$goals_p90, player$xg_p90, player$goals_p90 - player$xg_p90,
    player$assists_p90, player$xa_p90, player$assists_p90 - player$xa_p90,
    player$npxg_p90,
    player$adj_goals_p90, player$adj_xg_p90, player$adj_xa_p90
  )

  # Percentile rankings
  report$rankings <- "
PERCENTILE RANKINGS (vs Position):
----------------------------------
See attached radar chart
"

  # Similar players
  similar <- find_similar_players(player_name, similarity_matrix, top_n = 5)
  similar_text <- paste(
    sprintf("  - %s (%.1f%% similar)", names(similar), similar * 100),
    collapse = "\n"
  )
  report$similar <- sprintf("
SIMILAR PLAYERS:
----------------
%s
", similar_text)

  # Recommendation
  report$recommendation <- "
RECOMMENDATION:
---------------
[To be filled by scout based on video analysis]

Strengths:
-

Weaknesses:
-

Fit Assessment:
-

Risk Factors:
-

================================================================================
"

  # Combine all sections
  full_report <- paste(
    report$header,
    report$performance,
    report$rankings,
    report$similar,
    report$recommendation,
    sep = "\n"
  )

  return(full_report)
}

Practice Exercises

Exercise 1: Build a Complete Recruitment Pipeline

Implement the complete recruitment system from scratch. Load data from FBref for the top 5 leagues, calculate per-90 metrics, apply league adjustments, and create a searchable database with filtering capabilities.

Exercise 2: Custom Position Weights

Design custom scoring weights for a modern "inverted full-back" role. Identify the key metrics that define this position and create a scoring system. Test it by finding the top 10 inverted full-backs in Europe.

Exercise 3: Replacement Analysis Case Study

A top-6 Premier League club is losing their starting striker (28 years old, €60M value). Use the replacement analysis system to identify the top 10 replacement candidates with a budget of €40M and maximum age of 26.

Summary

Key Takeaways

System architecture: Modern recruitment systems have data, analytics, application, and output layers
Data pipeline: Robust ingestion with league adjustments is foundational
Position-specific scoring: Different metrics matter for different positions
Similarity search: Cosine similarity on standardized metrics finds comparable players
Report generation: Combine quantitative analysis with formatted output for decision-makers

System Components Built

Player data model and database structure
Multi-source data ingestion pipeline
Position-based scoring and ranking system
Player similarity engine with filtering
Replacement analysis workflow
Automated scouting report generator