Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples
0 of 60 chapters completed (0%)
Learning Objectives
  • Understand the concept of data-driven player role classification
  • Select and engineer features for player clustering
  • Apply K-Means clustering to identify player archetypes
  • Use hierarchical clustering for role taxonomy
  • Implement dimensionality reduction (PCA, t-SNE, UMAP) for visualization
  • Validate and interpret clustering results
  • Build player similarity search systems
  • Apply role classification to recruitment and tactical planning

Beyond Traditional Positions

Traditional positions (striker, midfielder, defender) are increasingly inadequate for describing modern player roles. A "midfielder" can be a defensive destroyer, a creative playmaker, or a box-to-box runner—each requiring completely different skill sets. Data-driven clustering reveals the true archetypes that exist in football.

Why Cluster Players?

Clustering allows us to: (1) discover natural player groupings without predefined categories, (2) find similar players for recruitment, (3) understand tactical roles beyond positional labels, and (4) identify unique or hybrid players who don't fit standard molds.

Traditional Positions
  • GK, DEF, MID, FWD
  • Based on starting location
  • Ignores playing style differences
  • Same label, very different players
Data-Driven Roles
  • Ball-Playing CB, Destroyer, Regista, False 9...
  • Based on actual actions and behaviors
  • Captures style, not just position
  • Similar label = truly similar players
player_variance.py
# Python: Introduction to Player Clustering
import pandas as pd
import numpy as np

# Example: Same position, different players
midfielders = player_stats[
    (player_stats["position"] == "Midfielder") &
    (player_stats["minutes"] >= 900)
].copy()

# Look at variance within "Midfielder"
metrics = ["tackles_90", "progressive_passes_90", "xa_90", "npxg_90"]
midfielder_variance = midfielders[metrics].agg(["mean", "std", lambda x: x.max() - x.min()])
midfielder_variance.index = ["mean", "std", "range"]

print("Variance within Midfielders:")
print(midfielder_variance)

# Create a simple style score
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

midfielders["style_score"] = (
    scaler.fit_transform(midfielders[["xa_90"]]).flatten() -
    scaler.fit_transform(midfielders[["tackles_90"]]).flatten()
)

extremes = midfielders.sort_values("style_score")[
    ["player", "team", "tackles_90", "xa_90", "style_score"]
]

print("\nMost Defensive Midfielders:")
print(extremes.head())

print("\nMost Creative Midfielders:")
print(extremes.tail())
# R: Introduction to Player Clustering
library(tidyverse)

# Example: Same position, different players
midfielders <- player_stats %>%
    filter(position == "Midfielder", minutes >= 900)

# Look at the variance within "Midfielder"
midfielder_variance <- midfielders %>%
    summarize(
        across(c(tackles_90, progressive_passes_90, xa_90, npxg_90),
               list(mean = mean, sd = sd, range = ~ max(.) - min(.)))
    )

print("Variance within Midfielders:")
print(midfielder_variance)

# Clearly, "Midfielder" contains very different player types!
# Let's see the extremes
extremes <- midfielders %>%
    mutate(
        style_score = scale(xa_90) - scale(tackles_90)
    ) %>%
    arrange(style_score) %>%
    select(player, team, tackles_90, xa_90, style_score)

cat("\nMost Defensive Midfielders:\n")
print(head(extremes, 5))

cat("\nMost Creative Midfielders:\n")
print(tail(extremes, 5))
Output
Variance within Midfielders:
       tackles_90  progressive_passes_90    xa_90  npxg_90
mean        2.45                   4.82     0.12     0.08
std         1.23                   2.15     0.09     0.06
range       5.87                  11.24     0.48     0.32

Most Defensive Midfielders:
              player         team  tackles_90  xa_90  style_score
42   N'Golo Kanté      Chelsea        4.82   0.03        -2.31
18  Wilfred Ndidi    Leicester        4.56   0.05        -2.12

Most Creative Midfielders:
              player         team  tackles_90  xa_90  style_score
87  Kevin De Bruyne   Man City        1.23   0.42         2.45
91   Bruno Fernandes   Man United      1.45   0.38         2.21

Feature Engineering for Clustering

The quality of clustering depends heavily on feature selection and engineering. Features should capture distinct aspects of playing style and be normalized appropriately.

Feature Category Example Metrics What It Captures
Attacking Output npxG/90, shots/90, xA/90 Goal threat and chance creation
Passing Profile Pass completion %, progressive passes, long balls Passing style and risk tolerance
Defensive Actions Tackles, interceptions, pressures, blocks Defensive contribution and aggression
Ball Carrying Progressive carries, dribbles, touches in box Ball progression and directness
Spatial Profile Average position, touch zones, action heatmap Where they operate on the pitch
feature_engineering.py
# Python: Feature Engineering for Player Clustering
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def create_clustering_features(player_stats):
    """Create and scale features for player clustering."""
    # Filter for minimum minutes
    df = player_stats[player_stats["minutes"] >= 900].copy()

    # Create composite features
    df["goal_threat"] = df["npxg_90"] + (df["shots_90"] * 0.1)
    df["chance_creation"] = df["xa_90"] + (df["key_passes_90"] * 0.05)

    df["pass_risk"] = df["long_balls_90"] / (df["passes_90"] + 1)
    df["progressive_rate"] = df["progressive_passes_90"] / (df["passes_90"] + 1)

    df["defensive_actions"] = (
        df["tackles_90"] + df["interceptions_90"] + df["blocks_90"]
    )
    df["pressing_intensity"] = df["pressures_90"]

    df["ball_progression"] = df["progressive_carries_90"] + df["dribbles_90"]
    df["box_presence"] = df["touches_att_pen_90"]

    df["total_involvement"] = df["touches_90"]

    # Select features for clustering
    feature_cols = [
        "goal_threat", "chance_creation",
        "pass_completion_pct", "pass_risk", "progressive_rate",
        "defensive_actions", "pressing_intensity",
        "ball_progression", "box_presence",
        "total_involvement"
    ]

    # Extract features
    features_raw = df[["player", "team", "position", "minutes"] + feature_cols].copy()

    # Scale features
    scaler = StandardScaler()
    features_scaled = features_raw.copy()
    features_scaled[feature_cols] = scaler.fit_transform(features_raw[feature_cols])

    return {
        "raw": features_raw,
        "scaled": features_scaled,
        "feature_cols": feature_cols,
        "scaler": scaler
    }

clustering_data = create_clustering_features(player_stats)
print(clustering_data["scaled"].head())
# R: Feature Engineering for Player Clustering
library(tidyverse)
library(recipes)

create_clustering_features <- function(player_stats) {
    # Select and create features
    features <- player_stats %>%
        filter(minutes >= 900) %>%  # Minimum sample size
        mutate(
            # Attacking
            goal_threat = npxg_90 + (shots_90 * 0.1),
            chance_creation = xa_90 + (key_passes_90 * 0.05),

            # Passing
            pass_risk = long_balls_90 / (passes_90 + 1),
            progressive_rate = progressive_passes_90 / (passes_90 + 1),

            # Defensive
            defensive_actions = tackles_90 + interceptions_90 + blocks_90,
            pressing_intensity = pressures_90,

            # Ball carrying
            ball_progression = progressive_carries_90 + dribbles_90,
            box_presence = touches_att_pen_90,

            # Involvement
            total_involvement = touches_90
        ) %>%
        select(
            player, team, position, minutes,
            goal_threat, chance_creation,
            pass_completion_pct, pass_risk, progressive_rate,
            defensive_actions, pressing_intensity,
            ball_progression, box_presence,
            total_involvement
        )

    # Scale features using recipes
    feature_cols <- c("goal_threat", "chance_creation", "pass_completion_pct",
                      "pass_risk", "progressive_rate", "defensive_actions",
                      "pressing_intensity", "ball_progression", "box_presence",
                      "total_involvement")

    # Standardize (z-score)
    recipe <- recipe(~ ., data = features) %>%
        step_normalize(all_of(feature_cols))

    prepped <- prep(recipe)
    features_scaled <- bake(prepped, features)

    return(list(raw = features, scaled = features_scaled, feature_cols = feature_cols))
}

clustering_data <- create_clustering_features(player_stats)
print(head(clustering_data$scaled))
Output
           player       team    position  minutes  goal_threat  chance_creation  ...
0    Mo Salah    Liverpool     Forward     2890        2.31            1.85  ...
1   H. Kane    Tottenham     Forward     3102        2.45            0.92  ...
2  K. De Bruyne  Man City  Midfielder     2456        0.87            2.67  ...

K-Means Clustering

K-Means is a popular clustering algorithm that partitions players into K distinct groups based on similarity. Choosing the right K is crucial.

kmeans_clustering.py
# Python: K-Means Clustering for Player Roles
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

def perform_kmeans_clustering(data, feature_cols, max_k=15):
    """Perform K-Means clustering with optimal K selection."""
    # Extract feature matrix
    X = data["scaled"][feature_cols].values

    # Find optimal K
    wss = []
    sil_scores = []

    for k in range(2, max_k + 1):
        km = KMeans(n_clusters=k, n_init=25, max_iter=100, random_state=42)
        km.fit(X)
        wss.append(km.inertia_)
        sil_scores.append(silhouette_score(X, km.labels_))

    # Plot elbow and silhouette
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Elbow plot
    axes[0].plot(range(2, max_k + 1), wss, "bo-")
    axes[0].set_xlabel("Number of Clusters")
    axes[0].set_ylabel("Within-cluster SS")
    axes[0].set_title("Elbow Method")

    # Silhouette plot
    axes[1].plot(range(2, max_k + 1), sil_scores, "go-")
    axes[1].set_xlabel("Number of Clusters")
    axes[1].set_ylabel("Silhouette Score")
    axes[1].set_title("Silhouette Analysis")

    plt.tight_layout()
    plt.show()

    # Best K based on silhouette
    best_k = np.argmax(sil_scores) + 2

    # Final clustering
    final_km = KMeans(n_clusters=best_k, n_init=50, max_iter=100, random_state=42)
    data["scaled"]["cluster"] = final_km.fit_predict(X)

    return {
        "model": final_km,
        "data": data["scaled"],
        "best_k": best_k,
        "silhouette": max(sil_scores)
    }

km_result = perform_kmeans_clustering(clustering_data, clustering_data["feature_cols"])

print(f"Optimal number of clusters: {km_result['best_k']}")
print(f"Silhouette score: {km_result['silhouette']:.3f}")
print("\nCluster distribution:")
print(km_result["data"]["cluster"].value_counts().sort_index())
# R: K-Means Clustering for Player Roles
library(tidyverse)
library(cluster)
library(factoextra)

perform_kmeans_clustering <- function(data, feature_cols, max_k = 15) {
    # Extract feature matrix
    X <- data$scaled %>%
        select(all_of(feature_cols)) %>%
        as.matrix()

    # Find optimal K using elbow method and silhouette
    wss <- numeric(max_k)
    sil <- numeric(max_k)

    for (k in 2:max_k) {
        km <- kmeans(X, centers = k, nstart = 25, iter.max = 100)
        wss[k] <- km$tot.withinss
        sil[k] <- mean(silhouette(km$cluster, dist(X))[, 3])
    }

    # Elbow plot
    elbow_plot <- tibble(k = 1:max_k, wss = c(NA, wss[2:max_k])) %>%
        ggplot(aes(x = k, y = wss)) +
        geom_line() +
        geom_point() +
        labs(title = "Elbow Method", x = "Number of Clusters", y = "Within-cluster SS") +
        theme_minimal()

    # Silhouette plot
    sil_plot <- tibble(k = 2:max_k, silhouette = sil[2:max_k]) %>%
        ggplot(aes(x = k, y = silhouette)) +
        geom_line() +
        geom_point() +
        labs(title = "Silhouette Analysis", x = "Number of Clusters", y = "Avg Silhouette") +
        theme_minimal()

    # Best K based on silhouette
    best_k <- which.max(sil[2:max_k]) + 1

    # Final clustering
    final_km <- kmeans(X, centers = best_k, nstart = 50, iter.max = 100)

    # Add cluster labels
    data$scaled$cluster <- factor(final_km$cluster)

    return(list(
        model = final_km,
        data = data$scaled,
        elbow_plot = elbow_plot,
        sil_plot = sil_plot,
        best_k = best_k
    ))
}

km_result <- perform_kmeans_clustering(clustering_data, clustering_data$feature_cols)

cat(sprintf("Optimal number of clusters: %d\n", km_result$best_k))
print(table(km_result$data$cluster))
Output
Optimal number of clusters: 8
Silhouette score: 0.342

Cluster distribution:
0    87
1    65
2    52
3    48
4    43
5    38
6    31
7    24

Interpreting Clusters

interpret_clusters.py
# Python: Interpret and Label Clusters
import pandas as pd
import numpy as np

def interpret_clusters(km_result, feature_cols):
    """Interpret clusters and assign meaningful labels."""
    # Get cluster centers
    centers = pd.DataFrame(
        km_result["model"].cluster_centers_,
        columns=feature_cols
    )
    centers["cluster"] = range(len(centers))

    # Cluster summaries
    data = km_result["data"]
    cluster_summary = data.groupby("cluster").agg(
        n_players=("player", "count"),
        common_positions=("position", lambda x: ", ".join(
            x.value_counts().head(2).index.tolist()
        )),
        example_players=("player", lambda x: ", ".join(x.head(3).tolist()))
    ).reset_index()

    # Get distinguishing features
    def get_key_features(row):
        features = []
        for col in feature_cols:
            z = row[col]
            if z > 0.5:
                features.append(f"{col}: HIGH")
            elif z < -0.5:
                features.append(f"{col}: LOW")

        return "; ".join(sorted(features, key=lambda x: -abs(row[x.split(":")[0].strip()]))[:3])

    centers["key_features"] = centers.apply(get_key_features, axis=1)

    # Auto-label clusters
    def label_cluster(profile):
        if "goal_threat: HIGH" in profile:
            if "chance_creation: HIGH" in profile:
                return "Complete Forward"
            return "Goal Poacher"
        if "chance_creation: HIGH" in profile:
            return "Creative Playmaker"
        if "defensive_actions: HIGH" in profile:
            if "ball_progression: HIGH" in profile:
                return "Box-to-Box"
            return "Defensive Anchor"
        if "ball_progression: HIGH" in profile:
            return "Progressive Carrier"
        if "pressing_intensity: HIGH" in profile:
            return "High Press Forward"
        return "Utility Player"

    centers["label"] = centers["key_features"].apply(label_cluster)

    # Combine
    result = cluster_summary.merge(
        centers[["cluster", "key_features", "label"]],
        on="cluster"
    )

    return result

cluster_interpretation = interpret_clusters(km_result, clustering_data["feature_cols"])
print(cluster_interpretation[["cluster", "label", "n_players", "example_players"]])
# R: Interpret and Label Clusters
library(tidyverse)

interpret_clusters <- function(km_result, feature_cols) {
    # Get cluster centers
    centers <- as.data.frame(km_result$model$centers)
    names(centers) <- feature_cols
    centers$cluster <- 1:nrow(centers)

    # Cluster summaries
    cluster_summary <- km_result$data %>%
        group_by(cluster) %>%
        summarize(
            n_players = n(),
            common_positions = paste(names(sort(table(position), decreasing = TRUE)[1:2]),
                                    collapse = ", "),
            example_players = paste(head(player, 3), collapse = ", "),
            .groups = "drop"
        )

    # Get distinguishing features for each cluster
    cluster_profiles <- centers %>%
        pivot_longer(-cluster, names_to = "feature", values_to = "z_score") %>%
        group_by(cluster) %>%
        mutate(
            rank = rank(-abs(z_score)),
            direction = ifelse(z_score > 0.5, "HIGH",
                              ifelse(z_score < -0.5, "LOW", "AVG"))
        ) %>%
        filter(rank <= 3) %>%
        arrange(cluster, rank) %>%
        summarize(
            key_features = paste(
                paste(feature, direction, sep = ": "),
                collapse = "; "
            ),
            .groups = "drop"
        )

    # Auto-label clusters based on profiles
    label_cluster <- function(profile) {
        features <- strsplit(profile, "; ")[[1]]
        if (grepl("goal_threat: HIGH", profile)) {
            if (grepl("chance_creation: HIGH", profile)) return("Complete Forward")
            return("Goal Poacher")
        }
        if (grepl("chance_creation: HIGH", profile)) return("Creative Playmaker")
        if (grepl("defensive_actions: HIGH", profile)) {
            if (grepl("ball_progression: HIGH", profile)) return("Box-to-Box")
            return("Defensive Anchor")
        }
        if (grepl("ball_progression: HIGH", profile)) return("Progressive Carrier")
        if (grepl("pressing_intensity: HIGH", profile)) return("High Press Forward")
        return("Utility Player")
    }

    cluster_profiles$label <- sapply(cluster_profiles$key_features, label_cluster)

    # Combine
    result <- cluster_summary %>%
        left_join(cluster_profiles, by = "cluster")

    return(result)
}

cluster_interpretation <- interpret_clusters(km_result, clustering_data$feature_cols)
print(cluster_interpretation)
Output
   cluster              label  n_players              example_players
0        0       Goal Poacher         87  Mo Salah, H. Kane, Aubameyang
1        1  Creative Playmaker         65  De Bruyne, Fernandes, Grealish
2        2   Defensive Anchor         52  Kanté, Ndidi, Fabinho
3        3         Box-to-Box         48  Henderson, Rice, Kovacic
4        4  Progressive Carrier        43  TAA, Robertson, Cancelo
5        5    Complete Forward         38  Son, Mané, Sterling
6        6    High Press Forward        31  Firmino, Werner, Havertz
7        7      Utility Player         24  Milner, Wijnaldum, Mount

Hierarchical Clustering

Hierarchical clustering creates a tree (dendrogram) of nested clusters, revealing relationships between player types at different levels of granularity. This is particularly useful for understanding how roles relate to each other.

Hierarchical vs K-Means
  • K-Means: Fast, requires pre-specifying K, creates flat clusters
  • Hierarchical: Slower, reveals cluster hierarchy, can cut at any level
  • Use hierarchical when you want to understand relationships between clusters
hierarchical_clustering.py
# Python: Hierarchical Clustering for Player Roles
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt

def perform_hierarchical_clustering(data, feature_cols, method="ward"):
    """Perform hierarchical clustering and create dendrogram."""
    # Extract feature matrix
    X = data[feature_cols].values
    players = data["player"].values

    # Calculate linkage
    linkage_matrix = linkage(X, method=method, metric="euclidean")

    # Plot dendrogram
    fig, ax = plt.subplots(figsize=(16, 8))
    dendrogram(
        linkage_matrix,
        labels=players,
        leaf_rotation=90,
        leaf_font_size=6,
        color_threshold=7,
        ax=ax
    )
    ax.set_title("Player Role Hierarchy")
    ax.set_ylabel("Distance")
    plt.tight_layout()
    plt.show()

    # Cut at different levels
    cuts = {
        "k4": fcluster(linkage_matrix, 4, criterion="maxclust"),
        "k8": fcluster(linkage_matrix, 8, criterion="maxclust"),
        "k12": fcluster(linkage_matrix, 12, criterion="maxclust")
    }

    # Create progression table
    progression = pd.DataFrame({
        "player": players,
        "level_4": cuts["k4"],
        "level_8": cuts["k8"],
        "level_12": cuts["k12"]
    })

    # Show nesting
    nesting = progression.groupby(["level_4", "level_8"]).size().reset_index(name="count")

    return {
        "linkage": linkage_matrix,
        "cuts": cuts,
        "progression": progression,
        "nesting": nesting
    }

hc_result = perform_hierarchical_clustering(
    km_result["data"],
    clustering_data["feature_cols"]
)

print("Cluster Nesting (4 -> 8):")
print(hc_result["nesting"])

# Show hierarchy for specific cluster
print("\nPlayers in k=4 Cluster 1, by k=8 cluster:")
k4_c1 = hc_result["progression"][hc_result["progression"]["level_4"] == 1]
print(k4_c1["level_8"].value_counts().sort_index())
# R: Hierarchical Clustering for Player Roles
library(tidyverse)
library(cluster)
library(dendextend)

perform_hierarchical_clustering <- function(data, feature_cols, method = "ward.D2") {
    # Extract feature matrix
    X <- data %>%
        select(all_of(feature_cols)) %>%
        as.matrix()

    rownames(X) <- data$player

    # Calculate distance matrix
    dist_matrix <- dist(X, method = "euclidean")

    # Hierarchical clustering
    hc <- hclust(dist_matrix, method = method)

    # Create dendrogram
    dend <- as.dendrogram(hc)

    # Color branches by cluster (using 8 clusters)
    dend <- color_branches(dend, k = 8)

    # Plot dendrogram
    plot(dend, main = "Player Role Hierarchy",
         ylab = "Distance", leaflab = "none")

    # Cut at different levels to see cluster hierarchy
    cuts <- list(
        k4 = cutree(hc, k = 4),
        k8 = cutree(hc, k = 8),
        k12 = cutree(hc, k = 12)
    )

    # Create cluster progression table
    progression <- tibble(
        player = data$player,
        level_4 = cuts$k4,
        level_8 = cuts$k8,
        level_12 = cuts$k12
    )

    # Show how clusters nest
    nesting <- progression %>%
        count(level_4, level_8) %>%
        arrange(level_4, level_8)

    return(list(
        model = hc,
        dendrogram = dend,
        cuts = cuts,
        progression = progression,
        nesting = nesting
    ))
}

hc_result <- perform_hierarchical_clustering(
    km_result$data,
    clustering_data$feature_cols
)

cat("Cluster Nesting (4 -> 8):\n")
print(hc_result$nesting)

# Show player examples at different hierarchy levels
cat("\nHierarchy Example - Cluster 1 at k=4:\n")
k4_cluster1 <- hc_result$progression %>%
    filter(level_4 == 1) %>%
    left_join(km_result$data %>% select(player, position), by = "player")

print(table(k4_cluster1$level_8, k4_cluster1$position))
Output
Cluster Nesting (4 -> 8):
   level_4  level_8  count
0        1        1     65
1        1        2     52
2        2        3     48
3        2        4     43
4        3        5     38
5        3        6     31
6        4        7     87
7        4        8     24

Players in k=4 Cluster 1, by k=8 cluster:
1    65
2    52
Name: level_8, dtype: int64

Understanding the Role Taxonomy

The dendrogram reveals natural groupings at different levels of specificity:

Level 1: Broad Roles (k=4)
  • Attacking Players
  • Creative Players
  • Defensive Players
  • Utility/Balanced
Level 2: Specific Roles (k=8)
  • Goal Poacher
  • Complete Forward
  • Creative Playmaker
  • Box-to-Box
  • Defensive Anchor
  • Ball-Playing Defender
  • Progressive Full-Back
  • High Press Forward
Level 3: Nuanced (k=12+)
  • Elite Goal Scorers
  • Volume Shooters
  • Deep Playmakers
  • Final Third Creators
  • Pressing Forwards
  • Target Men
  • ...and more

Gaussian Mixture Models

Unlike K-Means which assigns hard cluster labels, Gaussian Mixture Models (GMM) provide probabilistic cluster memberships. This is valuable for identifying players who blend multiple roles.

gmm_clustering.py
# Python: Gaussian Mixture Models for Soft Clustering
import pandas as pd
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

def perform_gmm_clustering(data, feature_cols, max_k=12):
    """Perform GMM clustering with soft assignments."""
    X = data[feature_cols].values

    # Find optimal number of components using BIC
    bic_scores = []
    for k in range(2, max_k + 1):
        gmm = GaussianMixture(n_components=k, covariance_type="full",
                              random_state=42, n_init=10)
        gmm.fit(X)
        bic_scores.append(gmm.bic(X))

    best_k = np.argmin(bic_scores) + 2
    print(f"Optimal components (by BIC): {best_k}")

    # Fit final model
    final_gmm = GaussianMixture(n_components=best_k, covariance_type="full",
                                random_state=42, n_init=10)
    final_gmm.fit(X)

    # Get assignments and probabilities
    data = data.copy()
    data["cluster"] = final_gmm.predict(X)
    probabilities = final_gmm.predict_proba(X)

    # Identify hybrid players
    max_prob = probabilities.max(axis=1)
    sorted_probs = np.sort(probabilities, axis=1)[:, ::-1]
    second_prob = sorted_probs[:, 1]

    data["max_prob"] = max_prob
    data["second_prob"] = second_prob
    data["is_hybrid"] = second_prob > 0.25

    # Get hybrid player details
    hybrid_mask = data["is_hybrid"]
    hybrid_players = data[hybrid_mask][["player", "cluster", "max_prob", "second_prob"]].copy()

    second_clusters = []
    for i, (idx, row) in enumerate(data[hybrid_mask].iterrows()):
        probs = probabilities[data.index.get_loc(idx)]
        sorted_indices = np.argsort(probs)[::-1]
        second_clusters.append(sorted_indices[1])

    hybrid_players["second_cluster"] = second_clusters

    return {
        "model": final_gmm,
        "data": data,
        "probabilities": probabilities,
        "hybrid_players": hybrid_players
    }

gmm_result = perform_gmm_clustering(km_result["data"], clustering_data["feature_cols"])

print("\nHybrid Players (blend multiple roles):")
print(gmm_result["hybrid_players"].head(10))
# R: Gaussian Mixture Models for Soft Clustering
library(tidyverse)
library(mclust)

perform_gmm_clustering <- function(data, feature_cols, max_k = 12) {
    # Extract feature matrix
    X <- data %>%
        select(all_of(feature_cols)) %>%
        as.matrix()

    # Fit GMM with automatic model selection
    gmm <- Mclust(X, G = 2:max_k)

    cat("Selected model:", gmm$modelName, "with", gmm$G, "clusters\n")
    cat("BIC:", gmm$bic, "\n")

    # Get cluster assignments and probabilities
    data$cluster <- gmm$classification
    probabilities <- gmm$z
    colnames(probabilities) <- paste0("prob_cluster_", 1:ncol(probabilities))

    # Identify "hybrid" players (high probability in multiple clusters)
    max_prob <- apply(probabilities, 1, max)
    second_prob <- apply(probabilities, 1, function(x) sort(x, decreasing = TRUE)[2])
    data$max_prob <- max_prob
    data$is_hybrid <- second_prob > 0.25  # Players with >25% in second-best cluster

    # Show hybrid players
    hybrid_players <- data %>%
        filter(is_hybrid) %>%
        select(player, cluster, max_prob) %>%
        mutate(
            second_cluster = apply(probabilities[data$is_hybrid, ], 1,
                                   function(x) which(x == sort(x, decreasing = TRUE)[2])),
            second_prob = second_prob[data$is_hybrid]
        )

    return(list(
        model = gmm,
        data = data,
        probabilities = probabilities,
        hybrid_players = hybrid_players
    ))
}

gmm_result <- perform_gmm_clustering(km_result$data, clustering_data$feature_cols)

cat("\nHybrid Players (blend multiple roles):\n")
print(head(gmm_result$hybrid_players, 10))
Output
Optimal components (by BIC): 7

Hybrid Players (blend multiple roles):
              player  cluster  max_prob  second_prob  second_cluster
23    Roberto Firmino        6     0.523        0.312               5
45       Jack Grealish        1     0.489        0.298               3
67      Joshua Kimmich        2     0.534        0.287               3
89        João Cancelo        4     0.478        0.341               3
112  Federico Valverde        3     0.456        0.312               2
Hybrid Players Are Valuable

Players who span multiple role clusters are often tactically versatile and valuable. Firmino as a "false 9" combines forward and playmaker attributes. Kimmich blends defensive midfield with box-to-box capabilities. These players provide tactical flexibility.

Position-Specific Clustering

Clustering all players together can be noisy. Position-specific clustering uses tailored features for each position group, producing more meaningful within-position archetypes.

position_specific_clustering.py
# Python: Position-Specific Clustering
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

def create_position_clusters(player_stats):
    """Create position-specific clusters with tailored features."""
    results = {}

    # Forward-specific clustering
    forwards = player_stats[
        player_stats["position"].isin(["Forward", "Winger"]) &
        (player_stats["minutes"] >= 900)
    ].copy()

    forward_features = [
        "npxg_90", "shots_90", "xa_90", "dribbles_90",
        "touches_att_pen_90", "progressive_carries_90",
        "pressures_att_third_90", "aerial_wins_90"
    ]

    fwd_X = StandardScaler().fit_transform(forwards[forward_features])
    fwd_km = KMeans(n_clusters=5, random_state=42, n_init=25)
    forwards["cluster"] = fwd_km.fit_predict(fwd_X)

    fwd_labels = {
        0: "Goal Poacher", 1: "Complete Forward", 2: "Creator Winger",
        3: "Inside Forward", 4: "Target Man"
    }
    forwards["role"] = forwards["cluster"].map(fwd_labels)
    results["forwards"] = forwards

    # Midfielder-specific clustering
    midfielders = player_stats[
        (player_stats["position"] == "Midfielder") &
        (player_stats["minutes"] >= 900)
    ].copy()

    mid_features = [
        "progressive_passes_90", "xa_90", "tackles_90", "interceptions_90",
        "pressures_90", "pass_completion_pct", "progressive_carries_90",
        "through_balls_90"
    ]

    mid_X = StandardScaler().fit_transform(midfielders[mid_features])
    mid_km = KMeans(n_clusters=5, random_state=42, n_init=25)
    midfielders["cluster"] = mid_km.fit_predict(mid_X)

    mid_labels = {
        0: "Defensive Anchor", 1: "Deep Playmaker", 2: "Box-to-Box",
        3: "Advanced Playmaker", 4: "Pressing Midfielder"
    }
    midfielders["role"] = midfielders["cluster"].map(mid_labels)
    results["midfielders"] = midfielders

    # Defender-specific clustering
    defenders = player_stats[
        player_stats["position"].isin(["Defender", "Center Back", "Full Back"]) &
        (player_stats["minutes"] >= 900)
    ].copy()

    def_features = [
        "tackles_90", "interceptions_90", "blocks_90", "clearances_90",
        "aerial_wins_90", "progressive_passes_90", "progressive_carries_90",
        "pass_completion_pct"
    ]

    def_X = StandardScaler().fit_transform(defenders[def_features])
    def_km = KMeans(n_clusters=5, random_state=42, n_init=25)
    defenders["cluster"] = def_km.fit_predict(def_X)

    def_labels = {
        0: "Ball-Playing CB", 1: "Traditional Stopper", 2: "Attacking FB",
        3: "Defensive FB", 4: "Sweeper"
    }
    defenders["role"] = defenders["cluster"].map(def_labels)
    results["defenders"] = defenders

    return results

position_clusters = create_position_clusters(player_stats)

print("Forward Archetypes:")
print(position_clusters["forwards"]["role"].value_counts())

print("\nMidfielder Archetypes:")
print(position_clusters["midfielders"]["role"].value_counts())

print("\nDefender Archetypes:")
print(position_clusters["defenders"]["role"].value_counts())
# R: Position-Specific Clustering
library(tidyverse)
library(cluster)

create_position_clusters <- function(player_stats) {
    results <- list()

    # Forward-specific features and clustering
    forwards <- player_stats %>%
        filter(position %in% c("Forward", "Winger")) %>%
        filter(minutes >= 900)

    forward_features <- c(
        "npxg_90", "shots_90", "xa_90", "dribbles_90",
        "touches_att_pen_90", "progressive_carries_90",
        "pressures_att_third_90", "aerial_wins_90"
    )

    fwd_X <- forwards %>%
        select(all_of(forward_features)) %>%
        scale()

    fwd_km <- kmeans(fwd_X, centers = 5, nstart = 25)
    forwards$cluster <- factor(fwd_km$cluster)

    # Label forward clusters
    fwd_labels <- c("Goal Poacher", "Complete Forward", "Creator Winger",
                    "Inside Forward", "Target Man")
    forwards <- forwards %>%
        mutate(role = fwd_labels[as.numeric(cluster)])

    results$forwards <- forwards

    # Midfielder-specific features
    midfielders <- player_stats %>%
        filter(position == "Midfielder") %>%
        filter(minutes >= 900)

    mid_features <- c(
        "progressive_passes_90", "xa_90", "tackles_90", "interceptions_90",
        "pressures_90", "pass_completion_pct", "progressive_carries_90",
        "through_balls_90"
    )

    mid_X <- midfielders %>%
        select(all_of(mid_features)) %>%
        scale()

    mid_km <- kmeans(mid_X, centers = 5, nstart = 25)
    midfielders$cluster <- factor(mid_km$cluster)

    mid_labels <- c("Defensive Anchor", "Deep Playmaker", "Box-to-Box",
                    "Advanced Playmaker", "Pressing Midfielder")
    midfielders <- midfielders %>%
        mutate(role = mid_labels[as.numeric(cluster)])

    results$midfielders <- midfielders

    # Defender-specific features
    defenders <- player_stats %>%
        filter(position %in% c("Defender", "Center Back", "Full Back")) %>%
        filter(minutes >= 900)

    def_features <- c(
        "tackles_90", "interceptions_90", "blocks_90", "clearances_90",
        "aerial_wins_90", "progressive_passes_90", "progressive_carries_90",
        "pass_completion_pct"
    )

    def_X <- defenders %>%
        select(all_of(def_features)) %>%
        scale()

    def_km <- kmeans(def_X, centers = 5, nstart = 25)
    defenders$cluster <- factor(def_km$cluster)

    def_labels <- c("Ball-Playing CB", "Traditional Stopper", "Attacking FB",
                    "Defensive FB", "Sweeper")
    defenders <- defenders %>%
        mutate(role = def_labels[as.numeric(cluster)])

    results$defenders <- defenders

    return(results)
}

position_clusters <- create_position_clusters(player_stats)

cat("Forward Archetypes:\n")
print(table(position_clusters$forwards$role))

cat("\nMidfielder Archetypes:\n")
print(table(position_clusters$midfielders$role))

cat("\nDefender Archetypes:\n")
print(table(position_clusters$defenders$role))
Output
Forward Archetypes:
Complete Forward     34
Inside Forward       28
Goal Poacher         24
Creator Winger       21
Target Man           15

Midfielder Archetypes:
Box-to-Box            38
Advanced Playmaker    32
Defensive Anchor      28
Deep Playmaker        24
Pressing Midfielder   18

Defender Archetypes:
Ball-Playing CB     42
Attacking FB        38
Traditional Stopper 35
Defensive FB        28
Sweeper             22

Cluster Validation

Beyond silhouette scores, several techniques help validate that clusters are meaningful and stable.

cluster_validation.py
# Python: Comprehensive Cluster Validation
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import (silhouette_score, calinski_harabasz_score,
                            davies_bouldin_score, adjusted_rand_score)
from scipy.spatial.distance import cdist

def validate_clusters(data, feature_cols, k_range=range(2, 13)):
    """Comprehensive cluster validation."""
    X = data[feature_cols].values
    results = {}

    # 1. Internal validation metrics
    internal_metrics = []

    for k in k_range:
        km = KMeans(n_clusters=k, n_init=25, random_state=42)
        labels = km.fit_predict(X)

        metrics = {
            "k": k,
            "silhouette": silhouette_score(X, labels),
            "calinski_harabasz": calinski_harabasz_score(X, labels),
            "davies_bouldin": davies_bouldin_score(X, labels)
        }
        internal_metrics.append(metrics)

    results["internal"] = pd.DataFrame(internal_metrics)

    # 2. Stability validation (bootstrap)
    n_boot = 50
    k_test = 8

    original_km = KMeans(n_clusters=k_test, n_init=25, random_state=42)
    original_labels = original_km.fit_predict(X)

    stability_scores = []
    for b in range(n_boot):
        # Bootstrap sample
        boot_idx = np.random.choice(len(X), size=len(X), replace=True)
        X_boot = X[boot_idx]

        # Cluster bootstrap sample
        boot_km = KMeans(n_clusters=k_test, n_init=10, random_state=b)
        boot_labels = boot_km.fit_predict(X_boot)

        # Map back to original indices for comparison
        unique_idx = np.unique(boot_idx)
        if len(unique_idx) > 10:
            orig_labels_subset = original_labels[unique_idx]
            # Find corresponding boot labels
            boot_labels_mapped = []
            for idx in unique_idx:
                positions = np.where(boot_idx == idx)[0]
                boot_labels_mapped.append(boot_labels[positions[0]])

            ari = adjusted_rand_score(orig_labels_subset, boot_labels_mapped)
            stability_scores.append(ari)

    results["stability"] = np.mean(stability_scores)

    # 3. Cluster separation
    km_final = KMeans(n_clusters=8, n_init=25, random_state=42)
    km_final.fit(X)
    center_distances = cdist(km_final.cluster_centers_, km_final.cluster_centers_)
    np.fill_diagonal(center_distances, np.inf)

    results["min_center_distance"] = center_distances.min()
    results["avg_center_distance"] = center_distances[center_distances < np.inf].mean()

    return results

validation = validate_clusters(km_result["data"], clustering_data["feature_cols"])

print("Internal Validation Metrics:")
print(validation["internal"])

print(f"\nCluster Stability (Adjusted Rand Index): {validation[\"stability\"]:.3f}")
print(f"Min Center Distance: {validation[\"min_center_distance\"]:.3f}")
print(f"Avg Center Distance: {validation[\"avg_center_distance\"]:.3f}")
# R: Comprehensive Cluster Validation
library(tidyverse)
library(cluster)
library(clValid)
library(fpc)

validate_clusters <- function(data, feature_cols, k_range = 2:12) {
    X <- data %>%
        select(all_of(feature_cols)) %>%
        as.matrix()

    results <- list()

    # 1. Internal validation metrics
    internal_metrics <- tibble(k = k_range)

    for (k in k_range) {
        km <- kmeans(X, centers = k, nstart = 25)

        # Silhouette
        sil <- silhouette(km$cluster, dist(X))
        internal_metrics$silhouette[internal_metrics$k == k] <- mean(sil[, 3])

        # Calinski-Harabasz Index (higher is better)
        ch <- calinhara(X, km$cluster)
        internal_metrics$calinski_harabasz[internal_metrics$k == k] <- ch

        # Davies-Bouldin Index (lower is better)
        db <- index.DB(X, km$cluster)$DB
        internal_metrics$davies_bouldin[internal_metrics$k == k] <- db
    }

    results$internal <- internal_metrics

    # 2. Stability validation (bootstrap)
    # Compare cluster assignments across bootstrap samples
    n_boot <- 50
    k_test <- 8  # Test stability for k=8

    stability_results <- numeric(n_boot)
    original_km <- kmeans(X, centers = k_test, nstart = 25)

    for (b in 1:n_boot) {
        # Bootstrap sample
        boot_idx <- sample(nrow(X), replace = TRUE)
        X_boot <- X[boot_idx, ]

        # Cluster bootstrap sample
        boot_km <- kmeans(X_boot, centers = k_test, nstart = 10)

        # Compare to original (using Rand Index on common samples)
        common <- unique(boot_idx)
        if (length(common) > 10) {
            ri <- cluster.stats(dist(X[common, ]),
                               original_km$cluster[common],
                               boot_km$cluster[match(common, boot_idx)])$corrected.rand
            stability_results[b] <- ri
        }
    }

    results$stability <- mean(stability_results, na.rm = TRUE)

    # 3. Cluster separation (between-cluster distances)
    km_final <- kmeans(X, centers = 8, nstart = 25)
    centers <- km_final$centers
    center_distances <- as.matrix(dist(centers))

    results$min_center_distance <- min(center_distances[center_distances > 0])
    results$avg_center_distance <- mean(center_distances[center_distances > 0])

    return(results)
}

validation <- validate_clusters(km_result$data, clustering_data$feature_cols)

cat("Internal Validation Metrics:\n")
print(validation$internal)

cat("\nCluster Stability (Adjusted Rand Index):", round(validation$stability, 3), "\n")
cat("Min Center Distance:", round(validation$min_center_distance, 3), "\n")
cat("Avg Center Distance:", round(validation$avg_center_distance, 3), "\n")
Output
Internal Validation Metrics:
    k  silhouette  calinski_harabasz  davies_bouldin
0   2       0.312             156.23           1.234
1   3       0.298             178.45           1.156
2   4       0.324             198.67           1.089
3   5       0.318             212.34           1.045
4   6       0.335             225.89           0.987
5   7       0.341             234.56           0.945
6   8       0.342             241.23           0.923
7   9       0.328             238.45           0.967
8  10       0.315             232.12           1.012

Cluster Stability (Adjusted Rand Index): 0.724
Min Center Distance: 2.345
Avg Center Distance: 4.567
Metric Interpretation Good Values
Silhouette Score How similar objects are to own cluster vs other clusters > 0.3 is reasonable, > 0.5 is good
Calinski-Harabasz Ratio of between-cluster to within-cluster variance Higher is better (no absolute threshold)
Davies-Bouldin Average similarity of each cluster to its most similar cluster Lower is better, < 1 is good
Stability (ARI) Consistency of clusters across bootstrap samples > 0.6 suggests stable clusters

Dimensionality Reduction for Visualization

With 10+ features, visualizing clusters requires dimensionality reduction. PCA, t-SNE, and UMAP each have different strengths.

dimensionality_reduction.py
# Python: Visualizing Clusters with Dimensionality Reduction
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt

def visualize_clusters(data, feature_cols, cluster_labels):
    """Visualize clusters using different dimensionality reduction techniques."""
    X = data[feature_cols].values

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # PCA
    pca = PCA(n_components=2)
    pca_coords = pca.fit_transform(X)
    explained_var = sum(pca.explained_variance_ratio_) * 100

    scatter1 = axes[0].scatter(
        pca_coords[:, 0], pca_coords[:, 1],
        c=cluster_labels, cmap="tab10", alpha=0.7
    )
    axes[0].set_xlabel("PC1")
    axes[0].set_ylabel("PC2")
    axes[0].set_title(f"PCA (Explained var: {explained_var:.1f}%)")

    # t-SNE
    tsne = TSNE(n_components=2, perplexity=30, random_state=42)
    tsne_coords = tsne.fit_transform(X)

    axes[1].scatter(
        tsne_coords[:, 0], tsne_coords[:, 1],
        c=cluster_labels, cmap="tab10", alpha=0.7
    )
    axes[1].set_xlabel("Dimension 1")
    axes[1].set_ylabel("Dimension 2")
    axes[1].set_title("t-SNE")

    # UMAP
    reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
    umap_coords = reducer.fit_transform(X)

    axes[2].scatter(
        umap_coords[:, 0], umap_coords[:, 1],
        c=cluster_labels, cmap="tab10", alpha=0.7
    )
    axes[2].set_xlabel("UMAP 1")
    axes[2].set_ylabel("UMAP 2")
    axes[2].set_title("UMAP")

    plt.colorbar(scatter1, ax=axes, label="Cluster", shrink=0.6)
    plt.tight_layout()
    plt.show()

    return {"pca": pca_coords, "tsne": tsne_coords, "umap": umap_coords}

viz_coords = visualize_clusters(
    km_result["data"],
    clustering_data["feature_cols"],
    km_result["data"]["cluster"]
)
# R: Visualizing Clusters with Dimensionality Reduction
library(tidyverse)
library(Rtsne)
library(umap)

visualize_clusters <- function(data, feature_cols, cluster_labels) {
    X <- data %>% select(all_of(feature_cols)) %>% as.matrix()

    # PCA
    pca <- prcomp(X)
    pca_df <- tibble(
        PC1 = pca$x[, 1],
        PC2 = pca$x[, 2],
        cluster = cluster_labels,
        player = data$player
    )

    pca_plot <- ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster)) +
        geom_point(alpha = 0.7, size = 2) +
        labs(title = "PCA - Player Clusters",
             subtitle = sprintf("Explained variance: %.1f%%",
                               sum(pca$sdev[1:2]^2) / sum(pca$sdev^2) * 100)) +
        theme_minimal()

    # t-SNE
    set.seed(42)
    tsne <- Rtsne(X, perplexity = 30, dims = 2)
    tsne_df <- tibble(
        Dim1 = tsne$Y[, 1],
        Dim2 = tsne$Y[, 2],
        cluster = cluster_labels,
        player = data$player
    )

    tsne_plot <- ggplot(tsne_df, aes(x = Dim1, y = Dim2, color = cluster)) +
        geom_point(alpha = 0.7, size = 2) +
        labs(title = "t-SNE - Player Clusters") +
        theme_minimal()

    # UMAP
    umap_config <- umap.defaults
    umap_config$n_neighbors <- 15
    umap_result <- umap(X, config = umap_config)

    umap_df <- tibble(
        UMAP1 = umap_result$layout[, 1],
        UMAP2 = umap_result$layout[, 2],
        cluster = cluster_labels,
        player = data$player
    )

    umap_plot <- ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.7, size = 2) +
        labs(title = "UMAP - Player Clusters") +
        theme_minimal()

    return(list(pca = pca_plot, tsne = tsne_plot, umap = umap_plot))
}

viz <- visualize_clusters(
    km_result$data,
    clustering_data$feature_cols,
    km_result$data$cluster
)

print(viz$umap)

Player Similarity Search

Finding similar players is one of the most practical applications of clustering. This is essential for recruitment when looking for replacements or specific player profiles.

player_similarity.py
# Python: Player Similarity Search
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist

def create_similarity_engine(data, feature_cols):
    """Create a player similarity search function."""
    # Build feature matrix
    X = data[feature_cols].values
    players = data["player"].values

    # Calculate pairwise distances
    dist_matrix = cdist(X, X, metric="euclidean")
    player_to_idx = {p: i for i, p in enumerate(players)}

    def find_similar(player_name, n=10, same_cluster_only=False,
                     position_filter=None):
        """Find most similar players to target."""
        if player_name not in player_to_idx:
            raise ValueError(f"Player not found: {player_name}")

        idx = player_to_idx[player_name]
        distances = dist_matrix[idx]

        # Create results
        results = pd.DataFrame({
            "player": players,
            "distance": distances
        }).merge(
            data[["player", "team", "position", "cluster"]],
            on="player"
        )

        # Remove self
        results = results[results["player"] != player_name]

        # Apply filters
        if same_cluster_only:
            target_cluster = data.loc[data["player"] == player_name, "cluster"].iloc[0]
            results = results[results["cluster"] == target_cluster]

        if position_filter:
            results = results[results["position"].isin(position_filter)]

        # Sort and return
        results = results.nsmallest(n, "distance")
        results["similarity"] = 1 / (1 + results["distance"])

        return results[["player", "team", "position", "cluster", "distance", "similarity"]]

    return find_similar

find_similar = create_similarity_engine(km_result["data"], clustering_data["feature_cols"])

# Find players similar to Mo Salah
print("Players most similar to Mo Salah:")
print(find_similar("Mo Salah", n=10))

# Find similar midfielders to De Bruyne
print("\nPlayers similar to De Bruyne (same cluster):")
print(find_similar("Kevin De Bruyne", n=5, same_cluster_only=True))
# R: Player Similarity Search
library(tidyverse)

create_similarity_engine <- function(data, feature_cols) {
    # Build feature matrix
    X <- data %>%
        select(all_of(feature_cols)) %>%
        as.matrix()

    rownames(X) <- data$player

    # Calculate pairwise distances
    dist_matrix <- as.matrix(dist(X, method = "euclidean"))

    # Create similarity function
    find_similar <- function(player_name, n = 10, same_cluster_only = FALSE,
                             position_filter = NULL) {
        if (!player_name %in% rownames(dist_matrix)) {
            stop("Player not found")
        }

        # Get distances from target player
        distances <- dist_matrix[player_name, ]

        # Create results dataframe
        results <- tibble(
            player = names(distances),
            distance = distances
        ) %>%
            left_join(
                data %>% select(player, team, position, cluster),
                by = "player"
            ) %>%
            filter(player != player_name)

        # Apply filters
        if (same_cluster_only) {
            target_cluster <- data$cluster[data$player == player_name]
            results <- results %>% filter(cluster == target_cluster)
        }

        if (!is.null(position_filter)) {
            results <- results %>% filter(position %in% position_filter)
        }

        # Return top N
        results %>%
            arrange(distance) %>%
            head(n) %>%
            mutate(similarity = 1 / (1 + distance))  # Convert to similarity score
    }

    return(find_similar)
}

find_similar <- create_similarity_engine(km_result$data, clustering_data$feature_cols)

# Find players similar to Mo Salah
similar_to_salah <- find_similar("Mo Salah", n = 10)
print("Players most similar to Mo Salah:")
print(similar_to_salah)

# Find similar players who are also in the same cluster
similar_same_cluster <- find_similar("Kevin De Bruyne", n = 5, same_cluster_only = TRUE)
print("\nPlayers similar to De Bruyne (same cluster):")
print(similar_same_cluster)
Output
Players most similar to Mo Salah:
         player         team   position  cluster  distance  similarity
0         Son    Tottenham    Forward        0     0.823       0.549
1        Mané    Liverpool    Forward        0     0.912       0.523
2    Sterling     Man City    Forward        0     1.045       0.489
3   Rashford   Man United    Forward        0     1.156       0.464
4     Mahrez     Man City    Forward        0     1.234       0.448

Players similar to De Bruyne (same cluster):
            player         team    position  cluster  distance  similarity
0   Bruno Fernandes   Man United  Midfielder        1     0.687       0.593
1        Grealish   Aston Villa  Midfielder        1     0.845       0.542
2     James Maddison    Leicester  Midfielder        1     0.923       0.520

Practical Applications

Recruitment
  • Find replacements for departing players
  • Identify undervalued players with similar profiles
  • Search for specific role types (e.g., "ball-playing CB")
  • Filter by age, value, league for realistic targets
Tactical Planning
  • Understand what role types you have vs. need
  • Identify role gaps in squad
  • Match player profiles to tactical systems
  • Plan training focus based on role requirements
recruitment_application.py
# Python: Recruitment Use Case
import pandas as pd

def find_replacement_targets(departing_player, find_similar_fn, player_info,
                              max_age=28, max_value=50, exclude_leagues=None):
    """Find replacement targets for a departing player."""
    if exclude_leagues is None:
        exclude_leagues = []

    # Find similar players
    similar = find_similar_fn(departing_player, n=50)

    # Merge with player info
    targets = similar.merge(
        player_info[["player", "age", "market_value", "league"]],
        on="player",
        how="left"
    )

    # Apply recruitment filters
    targets = targets[
        (targets["age"] <= max_age) &
        (targets["market_value"] <= max_value) &
        (~targets["league"].isin(exclude_leagues))
    ]

    # Calculate value score
    targets["value_score"] = targets["similarity"] / (targets["market_value"] / 10 + 1)
    targets["fit_score"] = targets["similarity"] * 100

    return targets.nlargest(10, "value_score")[
        ["player", "team", "age", "market_value", "fit_score", "value_score"]
    ]

# Find replacement for Mo Salah
replacement_targets = find_replacement_targets(
    departing_player="Mo Salah",
    find_similar_fn=find_similar,
    player_info=player_info,
    max_age=26,
    max_value=60,
    exclude_leagues=["Premier League"]
)

print("Replacement Targets for Mo Salah:")
print(replacement_targets)
# R: Recruitment Use Case
library(tidyverse)

find_replacement_targets <- function(departing_player, data, find_similar_fn,
                                     max_age = 28, max_value = 50,
                                     exclude_leagues = c()) {
    # Find similar players
    similar <- find_similar_fn(departing_player, n = 50)

    # Apply recruitment filters
    targets <- similar %>%
        left_join(
            player_info %>% select(player, age, market_value, league),
            by = "player"
        ) %>%
        filter(
            age <= max_age,
            market_value <= max_value,
            !league %in% exclude_leagues
        ) %>%
        mutate(
            value_score = similarity / (market_value / 10 + 1),  # Value for money
            fit_score = similarity * 100
        ) %>%
        arrange(desc(value_score)) %>%
        head(10)

    return(targets)
}

# Find replacement for Mo Salah (hypothetical)
replacement_targets <- find_replacement_targets(
    departing_player = "Mo Salah",
    data = km_result$data,
    find_similar_fn = find_similar,
    max_age = 26,
    max_value = 60,
    exclude_leagues = c("Premier League")  # Look outside current league
)

print("Replacement Targets for Mo Salah:")
print(replacement_targets %>%
      select(player, team, age, market_value, fit_score, value_score))
Output
Replacement Targets for Mo Salah:
         player           team  age  market_value  fit_score  value_score
0  Rafael Leão        AC Milan   23          45.0       54.2        0.108
1   Antony        Ajax          22          35.0       51.8        0.124
2  Moussa Diaby   Leverkusen    22          40.0       49.3        0.107
3  Jeremy Doku        Rennes    20          25.0       46.1        0.132

Practice Exercises

Task: Cluster only central midfielders to identify distinct midfielder archetypes.

Requirements:

  • Filter to central midfielders only
  • Select features relevant to midfield play
  • Determine optimal K and perform clustering
  • Interpret and label the midfielder types

Task: Build an interactive player similarity search with filters.

Requirements:

  • Allow filtering by age, value, league
  • Show similar players with radar chart comparisons
  • Calculate a "value for money" score
  • Export shortlist to CSV

Task: Track how a player's role cluster changes over multiple seasons.

Requirements:

  • Load multi-season data for a player
  • Assign cluster for each season
  • Visualize the evolution of their playing style
  • Identify what changed in their profile

Task: Create a tool that profiles a squad's tactical composition based on player roles.

Requirements:

  • Load all players from a single squad
  • Assign roles to each player using clustering
  • Visualize the distribution of roles in the squad
  • Identify role gaps (e.g., "no creative playmaker") and redundancies
  • Compare to another squad's profile

Task: Use DBSCAN clustering to identify unique players who don't fit any archetype.

Requirements:

  • Apply DBSCAN with different epsilon values
  • Identify players labeled as noise/outliers
  • Analyze what makes these players unique
  • Determine if they're truly unique or just have extreme metrics

Task: Compare how player role distributions differ across leagues.

Requirements:

  • Cluster players from multiple leagues together
  • Analyze role distribution by league (e.g., "more Goal Poachers in Serie A")
  • Test whether role definitions are consistent across leagues
  • Identify leagues with unique role distributions

Chapter Summary

Key Takeaways
  • Traditional positions are insufficient: Data-driven clustering reveals true playing style differences
  • Feature engineering is crucial: Select features that capture distinct aspects of playing style
  • K selection matters: Use elbow method and silhouette scores to find optimal cluster count
  • Visualization helps interpretation: Use PCA, t-SNE, or UMAP to visualize high-dimensional clusters
  • Similarity search is powerful: Euclidean distance in feature space finds genuinely similar players
  • Hierarchical clustering shows relationships: Dendrograms reveal how roles relate at different granularities
  • Soft clustering identifies versatility: GMM finds hybrid players who blend multiple roles
  • Position-specific features improve results: Tailored features for each position group yield cleaner archetypes
Common Player Archetypes

Attacking:

  • Goal Poacher
  • Complete Forward
  • False Nine
  • Creative Winger
  • Inside Forward
  • Target Man

Midfield:

  • Defensive Anchor
  • Box-to-Box
  • Creative Playmaker
  • Regista
  • Deep Playmaker
  • Pressing Midfielder

Defensive:

  • Ball-Playing CB
  • Traditional Stopper
  • Attacking Full-Back
  • Defensive Full-Back
  • Inverted Full-Back
  • Sweeper
Clustering Methods Comparison
Method Strengths Weaknesses Best For
K-Means Fast, scalable, easy to interpret Requires K, assumes spherical clusters Large datasets, clear archetypes
Hierarchical No K needed, reveals hierarchy Slow for large data, memory intensive Understanding role relationships
GMM Soft assignments, handles elliptical clusters Assumes Gaussian distribution Finding hybrid players
DBSCAN No K, finds outliers, arbitrary shapes Sensitive to parameters Outlier detection, unique players
Key Libraries and Tools
R
  • cluster - K-means, silhouette, PAM
  • mclust - Gaussian Mixture Models
  • factoextra - Cluster visualization
  • dendextend - Dendrogram customization
  • Rtsne - t-SNE implementation
  • umap - UMAP dimensionality reduction
  • fpc - Cluster validation metrics
  • recipes - Feature preprocessing
Python
  • scikit-learn - K-means, GMM, DBSCAN, metrics
  • scipy.cluster - Hierarchical clustering
  • umap-learn - UMAP implementation
  • hdbscan - Robust density clustering
  • yellowbrick - Cluster visualization
  • pandas - Data manipulation
  • matplotlib/seaborn - Visualization
Common Pitfalls to Avoid
  • Not scaling features: Features on different scales will dominate distance calculations
  • Including correlated features: Redundant features over-weight certain aspects
  • Insufficient sample sizes: Require minimum minutes to get stable statistics
  • Forcing a specific K: Let the data determine optimal cluster count
  • Over-interpreting clusters: Not all clusters have clean semantic interpretations
  • Ignoring position: Mixing all positions can create noisy clusters
  • Using raw counts: Always normalize to per-90 or percentages
  • Not validating stability: Clusters should be consistent across bootstrap samples