Capstone - Complete Analytics System
- Understand the concept of data-driven player role classification
- Select and engineer features for player clustering
- Apply K-Means clustering to identify player archetypes
- Use hierarchical clustering for role taxonomy
- Implement dimensionality reduction (PCA, t-SNE, UMAP) for visualization
- Validate and interpret clustering results
- Build player similarity search systems
- Apply role classification to recruitment and tactical planning
Beyond Traditional Positions
Traditional positions (striker, midfielder, defender) are increasingly inadequate for describing modern player roles. A "midfielder" can be a defensive destroyer, a creative playmaker, or a box-to-box runner—each requiring completely different skill sets. Data-driven clustering reveals the true archetypes that exist in football.
Why Cluster Players?
Clustering allows us to: (1) discover natural player groupings without predefined categories, (2) find similar players for recruitment, (3) understand tactical roles beyond positional labels, and (4) identify unique or hybrid players who don't fit standard molds.
- GK, DEF, MID, FWD
- Based on starting location
- Ignores playing style differences
- Same label, very different players
- Ball-Playing CB, Destroyer, Regista, False 9...
- Based on actual actions and behaviors
- Captures style, not just position
- Similar label = truly similar players
# Python: Introduction to Player Clustering
import pandas as pd
import numpy as np
# Example: Same position, different players
midfielders = player_stats[
(player_stats["position"] == "Midfielder") &
(player_stats["minutes"] >= 900)
].copy()
# Look at variance within "Midfielder"
metrics = ["tackles_90", "progressive_passes_90", "xa_90", "npxg_90"]
midfielder_variance = midfielders[metrics].agg(["mean", "std", lambda x: x.max() - x.min()])
midfielder_variance.index = ["mean", "std", "range"]
print("Variance within Midfielders:")
print(midfielder_variance)
# Create a simple style score
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
midfielders["style_score"] = (
scaler.fit_transform(midfielders[["xa_90"]]).flatten() -
scaler.fit_transform(midfielders[["tackles_90"]]).flatten()
)
extremes = midfielders.sort_values("style_score")[
["player", "team", "tackles_90", "xa_90", "style_score"]
]
print("\nMost Defensive Midfielders:")
print(extremes.head())
print("\nMost Creative Midfielders:")
print(extremes.tail())# R: Introduction to Player Clustering
library(tidyverse)
# Example: Same position, different players
midfielders <- player_stats %>%
filter(position == "Midfielder", minutes >= 900)
# Look at the variance within "Midfielder"
midfielder_variance <- midfielders %>%
summarize(
across(c(tackles_90, progressive_passes_90, xa_90, npxg_90),
list(mean = mean, sd = sd, range = ~ max(.) - min(.)))
)
print("Variance within Midfielders:")
print(midfielder_variance)
# Clearly, "Midfielder" contains very different player types!
# Let's see the extremes
extremes <- midfielders %>%
mutate(
style_score = scale(xa_90) - scale(tackles_90)
) %>%
arrange(style_score) %>%
select(player, team, tackles_90, xa_90, style_score)
cat("\nMost Defensive Midfielders:\n")
print(head(extremes, 5))
cat("\nMost Creative Midfielders:\n")
print(tail(extremes, 5))Variance within Midfielders:
tackles_90 progressive_passes_90 xa_90 npxg_90
mean 2.45 4.82 0.12 0.08
std 1.23 2.15 0.09 0.06
range 5.87 11.24 0.48 0.32
Most Defensive Midfielders:
player team tackles_90 xa_90 style_score
42 N'Golo Kanté Chelsea 4.82 0.03 -2.31
18 Wilfred Ndidi Leicester 4.56 0.05 -2.12
Most Creative Midfielders:
player team tackles_90 xa_90 style_score
87 Kevin De Bruyne Man City 1.23 0.42 2.45
91 Bruno Fernandes Man United 1.45 0.38 2.21Feature Engineering for Clustering
The quality of clustering depends heavily on feature selection and engineering. Features should capture distinct aspects of playing style and be normalized appropriately.
| Feature Category | Example Metrics | What It Captures |
|---|---|---|
| Attacking Output | npxG/90, shots/90, xA/90 | Goal threat and chance creation |
| Passing Profile | Pass completion %, progressive passes, long balls | Passing style and risk tolerance |
| Defensive Actions | Tackles, interceptions, pressures, blocks | Defensive contribution and aggression |
| Ball Carrying | Progressive carries, dribbles, touches in box | Ball progression and directness |
| Spatial Profile | Average position, touch zones, action heatmap | Where they operate on the pitch |
# Python: Feature Engineering for Player Clustering
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
def create_clustering_features(player_stats):
"""Create and scale features for player clustering."""
# Filter for minimum minutes
df = player_stats[player_stats["minutes"] >= 900].copy()
# Create composite features
df["goal_threat"] = df["npxg_90"] + (df["shots_90"] * 0.1)
df["chance_creation"] = df["xa_90"] + (df["key_passes_90"] * 0.05)
df["pass_risk"] = df["long_balls_90"] / (df["passes_90"] + 1)
df["progressive_rate"] = df["progressive_passes_90"] / (df["passes_90"] + 1)
df["defensive_actions"] = (
df["tackles_90"] + df["interceptions_90"] + df["blocks_90"]
)
df["pressing_intensity"] = df["pressures_90"]
df["ball_progression"] = df["progressive_carries_90"] + df["dribbles_90"]
df["box_presence"] = df["touches_att_pen_90"]
df["total_involvement"] = df["touches_90"]
# Select features for clustering
feature_cols = [
"goal_threat", "chance_creation",
"pass_completion_pct", "pass_risk", "progressive_rate",
"defensive_actions", "pressing_intensity",
"ball_progression", "box_presence",
"total_involvement"
]
# Extract features
features_raw = df[["player", "team", "position", "minutes"] + feature_cols].copy()
# Scale features
scaler = StandardScaler()
features_scaled = features_raw.copy()
features_scaled[feature_cols] = scaler.fit_transform(features_raw[feature_cols])
return {
"raw": features_raw,
"scaled": features_scaled,
"feature_cols": feature_cols,
"scaler": scaler
}
clustering_data = create_clustering_features(player_stats)
print(clustering_data["scaled"].head())# R: Feature Engineering for Player Clustering
library(tidyverse)
library(recipes)
create_clustering_features <- function(player_stats) {
# Select and create features
features <- player_stats %>%
filter(minutes >= 900) %>% # Minimum sample size
mutate(
# Attacking
goal_threat = npxg_90 + (shots_90 * 0.1),
chance_creation = xa_90 + (key_passes_90 * 0.05),
# Passing
pass_risk = long_balls_90 / (passes_90 + 1),
progressive_rate = progressive_passes_90 / (passes_90 + 1),
# Defensive
defensive_actions = tackles_90 + interceptions_90 + blocks_90,
pressing_intensity = pressures_90,
# Ball carrying
ball_progression = progressive_carries_90 + dribbles_90,
box_presence = touches_att_pen_90,
# Involvement
total_involvement = touches_90
) %>%
select(
player, team, position, minutes,
goal_threat, chance_creation,
pass_completion_pct, pass_risk, progressive_rate,
defensive_actions, pressing_intensity,
ball_progression, box_presence,
total_involvement
)
# Scale features using recipes
feature_cols <- c("goal_threat", "chance_creation", "pass_completion_pct",
"pass_risk", "progressive_rate", "defensive_actions",
"pressing_intensity", "ball_progression", "box_presence",
"total_involvement")
# Standardize (z-score)
recipe <- recipe(~ ., data = features) %>%
step_normalize(all_of(feature_cols))
prepped <- prep(recipe)
features_scaled <- bake(prepped, features)
return(list(raw = features, scaled = features_scaled, feature_cols = feature_cols))
}
clustering_data <- create_clustering_features(player_stats)
print(head(clustering_data$scaled)) player team position minutes goal_threat chance_creation ...
0 Mo Salah Liverpool Forward 2890 2.31 1.85 ...
1 H. Kane Tottenham Forward 3102 2.45 0.92 ...
2 K. De Bruyne Man City Midfielder 2456 0.87 2.67 ...K-Means Clustering
K-Means is a popular clustering algorithm that partitions players into K distinct groups based on similarity. Choosing the right K is crucial.
# Python: K-Means Clustering for Player Roles
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
def perform_kmeans_clustering(data, feature_cols, max_k=15):
"""Perform K-Means clustering with optimal K selection."""
# Extract feature matrix
X = data["scaled"][feature_cols].values
# Find optimal K
wss = []
sil_scores = []
for k in range(2, max_k + 1):
km = KMeans(n_clusters=k, n_init=25, max_iter=100, random_state=42)
km.fit(X)
wss.append(km.inertia_)
sil_scores.append(silhouette_score(X, km.labels_))
# Plot elbow and silhouette
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Elbow plot
axes[0].plot(range(2, max_k + 1), wss, "bo-")
axes[0].set_xlabel("Number of Clusters")
axes[0].set_ylabel("Within-cluster SS")
axes[0].set_title("Elbow Method")
# Silhouette plot
axes[1].plot(range(2, max_k + 1), sil_scores, "go-")
axes[1].set_xlabel("Number of Clusters")
axes[1].set_ylabel("Silhouette Score")
axes[1].set_title("Silhouette Analysis")
plt.tight_layout()
plt.show()
# Best K based on silhouette
best_k = np.argmax(sil_scores) + 2
# Final clustering
final_km = KMeans(n_clusters=best_k, n_init=50, max_iter=100, random_state=42)
data["scaled"]["cluster"] = final_km.fit_predict(X)
return {
"model": final_km,
"data": data["scaled"],
"best_k": best_k,
"silhouette": max(sil_scores)
}
km_result = perform_kmeans_clustering(clustering_data, clustering_data["feature_cols"])
print(f"Optimal number of clusters: {km_result['best_k']}")
print(f"Silhouette score: {km_result['silhouette']:.3f}")
print("\nCluster distribution:")
print(km_result["data"]["cluster"].value_counts().sort_index())# R: K-Means Clustering for Player Roles
library(tidyverse)
library(cluster)
library(factoextra)
perform_kmeans_clustering <- function(data, feature_cols, max_k = 15) {
# Extract feature matrix
X <- data$scaled %>%
select(all_of(feature_cols)) %>%
as.matrix()
# Find optimal K using elbow method and silhouette
wss <- numeric(max_k)
sil <- numeric(max_k)
for (k in 2:max_k) {
km <- kmeans(X, centers = k, nstart = 25, iter.max = 100)
wss[k] <- km$tot.withinss
sil[k] <- mean(silhouette(km$cluster, dist(X))[, 3])
}
# Elbow plot
elbow_plot <- tibble(k = 1:max_k, wss = c(NA, wss[2:max_k])) %>%
ggplot(aes(x = k, y = wss)) +
geom_line() +
geom_point() +
labs(title = "Elbow Method", x = "Number of Clusters", y = "Within-cluster SS") +
theme_minimal()
# Silhouette plot
sil_plot <- tibble(k = 2:max_k, silhouette = sil[2:max_k]) %>%
ggplot(aes(x = k, y = silhouette)) +
geom_line() +
geom_point() +
labs(title = "Silhouette Analysis", x = "Number of Clusters", y = "Avg Silhouette") +
theme_minimal()
# Best K based on silhouette
best_k <- which.max(sil[2:max_k]) + 1
# Final clustering
final_km <- kmeans(X, centers = best_k, nstart = 50, iter.max = 100)
# Add cluster labels
data$scaled$cluster <- factor(final_km$cluster)
return(list(
model = final_km,
data = data$scaled,
elbow_plot = elbow_plot,
sil_plot = sil_plot,
best_k = best_k
))
}
km_result <- perform_kmeans_clustering(clustering_data, clustering_data$feature_cols)
cat(sprintf("Optimal number of clusters: %d\n", km_result$best_k))
print(table(km_result$data$cluster))Optimal number of clusters: 8
Silhouette score: 0.342
Cluster distribution:
0 87
1 65
2 52
3 48
4 43
5 38
6 31
7 24Interpreting Clusters
# Python: Interpret and Label Clusters
import pandas as pd
import numpy as np
def interpret_clusters(km_result, feature_cols):
"""Interpret clusters and assign meaningful labels."""
# Get cluster centers
centers = pd.DataFrame(
km_result["model"].cluster_centers_,
columns=feature_cols
)
centers["cluster"] = range(len(centers))
# Cluster summaries
data = km_result["data"]
cluster_summary = data.groupby("cluster").agg(
n_players=("player", "count"),
common_positions=("position", lambda x: ", ".join(
x.value_counts().head(2).index.tolist()
)),
example_players=("player", lambda x: ", ".join(x.head(3).tolist()))
).reset_index()
# Get distinguishing features
def get_key_features(row):
features = []
for col in feature_cols:
z = row[col]
if z > 0.5:
features.append(f"{col}: HIGH")
elif z < -0.5:
features.append(f"{col}: LOW")
return "; ".join(sorted(features, key=lambda x: -abs(row[x.split(":")[0].strip()]))[:3])
centers["key_features"] = centers.apply(get_key_features, axis=1)
# Auto-label clusters
def label_cluster(profile):
if "goal_threat: HIGH" in profile:
if "chance_creation: HIGH" in profile:
return "Complete Forward"
return "Goal Poacher"
if "chance_creation: HIGH" in profile:
return "Creative Playmaker"
if "defensive_actions: HIGH" in profile:
if "ball_progression: HIGH" in profile:
return "Box-to-Box"
return "Defensive Anchor"
if "ball_progression: HIGH" in profile:
return "Progressive Carrier"
if "pressing_intensity: HIGH" in profile:
return "High Press Forward"
return "Utility Player"
centers["label"] = centers["key_features"].apply(label_cluster)
# Combine
result = cluster_summary.merge(
centers[["cluster", "key_features", "label"]],
on="cluster"
)
return result
cluster_interpretation = interpret_clusters(km_result, clustering_data["feature_cols"])
print(cluster_interpretation[["cluster", "label", "n_players", "example_players"]])# R: Interpret and Label Clusters
library(tidyverse)
interpret_clusters <- function(km_result, feature_cols) {
# Get cluster centers
centers <- as.data.frame(km_result$model$centers)
names(centers) <- feature_cols
centers$cluster <- 1:nrow(centers)
# Cluster summaries
cluster_summary <- km_result$data %>%
group_by(cluster) %>%
summarize(
n_players = n(),
common_positions = paste(names(sort(table(position), decreasing = TRUE)[1:2]),
collapse = ", "),
example_players = paste(head(player, 3), collapse = ", "),
.groups = "drop"
)
# Get distinguishing features for each cluster
cluster_profiles <- centers %>%
pivot_longer(-cluster, names_to = "feature", values_to = "z_score") %>%
group_by(cluster) %>%
mutate(
rank = rank(-abs(z_score)),
direction = ifelse(z_score > 0.5, "HIGH",
ifelse(z_score < -0.5, "LOW", "AVG"))
) %>%
filter(rank <= 3) %>%
arrange(cluster, rank) %>%
summarize(
key_features = paste(
paste(feature, direction, sep = ": "),
collapse = "; "
),
.groups = "drop"
)
# Auto-label clusters based on profiles
label_cluster <- function(profile) {
features <- strsplit(profile, "; ")[[1]]
if (grepl("goal_threat: HIGH", profile)) {
if (grepl("chance_creation: HIGH", profile)) return("Complete Forward")
return("Goal Poacher")
}
if (grepl("chance_creation: HIGH", profile)) return("Creative Playmaker")
if (grepl("defensive_actions: HIGH", profile)) {
if (grepl("ball_progression: HIGH", profile)) return("Box-to-Box")
return("Defensive Anchor")
}
if (grepl("ball_progression: HIGH", profile)) return("Progressive Carrier")
if (grepl("pressing_intensity: HIGH", profile)) return("High Press Forward")
return("Utility Player")
}
cluster_profiles$label <- sapply(cluster_profiles$key_features, label_cluster)
# Combine
result <- cluster_summary %>%
left_join(cluster_profiles, by = "cluster")
return(result)
}
cluster_interpretation <- interpret_clusters(km_result, clustering_data$feature_cols)
print(cluster_interpretation) cluster label n_players example_players
0 0 Goal Poacher 87 Mo Salah, H. Kane, Aubameyang
1 1 Creative Playmaker 65 De Bruyne, Fernandes, Grealish
2 2 Defensive Anchor 52 Kanté, Ndidi, Fabinho
3 3 Box-to-Box 48 Henderson, Rice, Kovacic
4 4 Progressive Carrier 43 TAA, Robertson, Cancelo
5 5 Complete Forward 38 Son, Mané, Sterling
6 6 High Press Forward 31 Firmino, Werner, Havertz
7 7 Utility Player 24 Milner, Wijnaldum, MountHierarchical Clustering
Hierarchical clustering creates a tree (dendrogram) of nested clusters, revealing relationships between player types at different levels of granularity. This is particularly useful for understanding how roles relate to each other.
Hierarchical vs K-Means
- K-Means: Fast, requires pre-specifying K, creates flat clusters
- Hierarchical: Slower, reveals cluster hierarchy, can cut at any level
- Use hierarchical when you want to understand relationships between clusters
# Python: Hierarchical Clustering for Player Roles
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
def perform_hierarchical_clustering(data, feature_cols, method="ward"):
"""Perform hierarchical clustering and create dendrogram."""
# Extract feature matrix
X = data[feature_cols].values
players = data["player"].values
# Calculate linkage
linkage_matrix = linkage(X, method=method, metric="euclidean")
# Plot dendrogram
fig, ax = plt.subplots(figsize=(16, 8))
dendrogram(
linkage_matrix,
labels=players,
leaf_rotation=90,
leaf_font_size=6,
color_threshold=7,
ax=ax
)
ax.set_title("Player Role Hierarchy")
ax.set_ylabel("Distance")
plt.tight_layout()
plt.show()
# Cut at different levels
cuts = {
"k4": fcluster(linkage_matrix, 4, criterion="maxclust"),
"k8": fcluster(linkage_matrix, 8, criterion="maxclust"),
"k12": fcluster(linkage_matrix, 12, criterion="maxclust")
}
# Create progression table
progression = pd.DataFrame({
"player": players,
"level_4": cuts["k4"],
"level_8": cuts["k8"],
"level_12": cuts["k12"]
})
# Show nesting
nesting = progression.groupby(["level_4", "level_8"]).size().reset_index(name="count")
return {
"linkage": linkage_matrix,
"cuts": cuts,
"progression": progression,
"nesting": nesting
}
hc_result = perform_hierarchical_clustering(
km_result["data"],
clustering_data["feature_cols"]
)
print("Cluster Nesting (4 -> 8):")
print(hc_result["nesting"])
# Show hierarchy for specific cluster
print("\nPlayers in k=4 Cluster 1, by k=8 cluster:")
k4_c1 = hc_result["progression"][hc_result["progression"]["level_4"] == 1]
print(k4_c1["level_8"].value_counts().sort_index())# R: Hierarchical Clustering for Player Roles
library(tidyverse)
library(cluster)
library(dendextend)
perform_hierarchical_clustering <- function(data, feature_cols, method = "ward.D2") {
# Extract feature matrix
X <- data %>%
select(all_of(feature_cols)) %>%
as.matrix()
rownames(X) <- data$player
# Calculate distance matrix
dist_matrix <- dist(X, method = "euclidean")
# Hierarchical clustering
hc <- hclust(dist_matrix, method = method)
# Create dendrogram
dend <- as.dendrogram(hc)
# Color branches by cluster (using 8 clusters)
dend <- color_branches(dend, k = 8)
# Plot dendrogram
plot(dend, main = "Player Role Hierarchy",
ylab = "Distance", leaflab = "none")
# Cut at different levels to see cluster hierarchy
cuts <- list(
k4 = cutree(hc, k = 4),
k8 = cutree(hc, k = 8),
k12 = cutree(hc, k = 12)
)
# Create cluster progression table
progression <- tibble(
player = data$player,
level_4 = cuts$k4,
level_8 = cuts$k8,
level_12 = cuts$k12
)
# Show how clusters nest
nesting <- progression %>%
count(level_4, level_8) %>%
arrange(level_4, level_8)
return(list(
model = hc,
dendrogram = dend,
cuts = cuts,
progression = progression,
nesting = nesting
))
}
hc_result <- perform_hierarchical_clustering(
km_result$data,
clustering_data$feature_cols
)
cat("Cluster Nesting (4 -> 8):\n")
print(hc_result$nesting)
# Show player examples at different hierarchy levels
cat("\nHierarchy Example - Cluster 1 at k=4:\n")
k4_cluster1 <- hc_result$progression %>%
filter(level_4 == 1) %>%
left_join(km_result$data %>% select(player, position), by = "player")
print(table(k4_cluster1$level_8, k4_cluster1$position))Cluster Nesting (4 -> 8):
level_4 level_8 count
0 1 1 65
1 1 2 52
2 2 3 48
3 2 4 43
4 3 5 38
5 3 6 31
6 4 7 87
7 4 8 24
Players in k=4 Cluster 1, by k=8 cluster:
1 65
2 52
Name: level_8, dtype: int64Understanding the Role Taxonomy
The dendrogram reveals natural groupings at different levels of specificity:
- Attacking Players
- Creative Players
- Defensive Players
- Utility/Balanced
- Goal Poacher
- Complete Forward
- Creative Playmaker
- Box-to-Box
- Defensive Anchor
- Ball-Playing Defender
- Progressive Full-Back
- High Press Forward
- Elite Goal Scorers
- Volume Shooters
- Deep Playmakers
- Final Third Creators
- Pressing Forwards
- Target Men
- ...and more
Gaussian Mixture Models
Unlike K-Means which assigns hard cluster labels, Gaussian Mixture Models (GMM) provide probabilistic cluster memberships. This is valuable for identifying players who blend multiple roles.
# Python: Gaussian Mixture Models for Soft Clustering
import pandas as pd
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
def perform_gmm_clustering(data, feature_cols, max_k=12):
"""Perform GMM clustering with soft assignments."""
X = data[feature_cols].values
# Find optimal number of components using BIC
bic_scores = []
for k in range(2, max_k + 1):
gmm = GaussianMixture(n_components=k, covariance_type="full",
random_state=42, n_init=10)
gmm.fit(X)
bic_scores.append(gmm.bic(X))
best_k = np.argmin(bic_scores) + 2
print(f"Optimal components (by BIC): {best_k}")
# Fit final model
final_gmm = GaussianMixture(n_components=best_k, covariance_type="full",
random_state=42, n_init=10)
final_gmm.fit(X)
# Get assignments and probabilities
data = data.copy()
data["cluster"] = final_gmm.predict(X)
probabilities = final_gmm.predict_proba(X)
# Identify hybrid players
max_prob = probabilities.max(axis=1)
sorted_probs = np.sort(probabilities, axis=1)[:, ::-1]
second_prob = sorted_probs[:, 1]
data["max_prob"] = max_prob
data["second_prob"] = second_prob
data["is_hybrid"] = second_prob > 0.25
# Get hybrid player details
hybrid_mask = data["is_hybrid"]
hybrid_players = data[hybrid_mask][["player", "cluster", "max_prob", "second_prob"]].copy()
second_clusters = []
for i, (idx, row) in enumerate(data[hybrid_mask].iterrows()):
probs = probabilities[data.index.get_loc(idx)]
sorted_indices = np.argsort(probs)[::-1]
second_clusters.append(sorted_indices[1])
hybrid_players["second_cluster"] = second_clusters
return {
"model": final_gmm,
"data": data,
"probabilities": probabilities,
"hybrid_players": hybrid_players
}
gmm_result = perform_gmm_clustering(km_result["data"], clustering_data["feature_cols"])
print("\nHybrid Players (blend multiple roles):")
print(gmm_result["hybrid_players"].head(10))# R: Gaussian Mixture Models for Soft Clustering
library(tidyverse)
library(mclust)
perform_gmm_clustering <- function(data, feature_cols, max_k = 12) {
# Extract feature matrix
X <- data %>%
select(all_of(feature_cols)) %>%
as.matrix()
# Fit GMM with automatic model selection
gmm <- Mclust(X, G = 2:max_k)
cat("Selected model:", gmm$modelName, "with", gmm$G, "clusters\n")
cat("BIC:", gmm$bic, "\n")
# Get cluster assignments and probabilities
data$cluster <- gmm$classification
probabilities <- gmm$z
colnames(probabilities) <- paste0("prob_cluster_", 1:ncol(probabilities))
# Identify "hybrid" players (high probability in multiple clusters)
max_prob <- apply(probabilities, 1, max)
second_prob <- apply(probabilities, 1, function(x) sort(x, decreasing = TRUE)[2])
data$max_prob <- max_prob
data$is_hybrid <- second_prob > 0.25 # Players with >25% in second-best cluster
# Show hybrid players
hybrid_players <- data %>%
filter(is_hybrid) %>%
select(player, cluster, max_prob) %>%
mutate(
second_cluster = apply(probabilities[data$is_hybrid, ], 1,
function(x) which(x == sort(x, decreasing = TRUE)[2])),
second_prob = second_prob[data$is_hybrid]
)
return(list(
model = gmm,
data = data,
probabilities = probabilities,
hybrid_players = hybrid_players
))
}
gmm_result <- perform_gmm_clustering(km_result$data, clustering_data$feature_cols)
cat("\nHybrid Players (blend multiple roles):\n")
print(head(gmm_result$hybrid_players, 10))Optimal components (by BIC): 7
Hybrid Players (blend multiple roles):
player cluster max_prob second_prob second_cluster
23 Roberto Firmino 6 0.523 0.312 5
45 Jack Grealish 1 0.489 0.298 3
67 Joshua Kimmich 2 0.534 0.287 3
89 João Cancelo 4 0.478 0.341 3
112 Federico Valverde 3 0.456 0.312 2Hybrid Players Are Valuable
Players who span multiple role clusters are often tactically versatile and valuable. Firmino as a "false 9" combines forward and playmaker attributes. Kimmich blends defensive midfield with box-to-box capabilities. These players provide tactical flexibility.
Position-Specific Clustering
Clustering all players together can be noisy. Position-specific clustering uses tailored features for each position group, producing more meaningful within-position archetypes.
# Python: Position-Specific Clustering
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
def create_position_clusters(player_stats):
"""Create position-specific clusters with tailored features."""
results = {}
# Forward-specific clustering
forwards = player_stats[
player_stats["position"].isin(["Forward", "Winger"]) &
(player_stats["minutes"] >= 900)
].copy()
forward_features = [
"npxg_90", "shots_90", "xa_90", "dribbles_90",
"touches_att_pen_90", "progressive_carries_90",
"pressures_att_third_90", "aerial_wins_90"
]
fwd_X = StandardScaler().fit_transform(forwards[forward_features])
fwd_km = KMeans(n_clusters=5, random_state=42, n_init=25)
forwards["cluster"] = fwd_km.fit_predict(fwd_X)
fwd_labels = {
0: "Goal Poacher", 1: "Complete Forward", 2: "Creator Winger",
3: "Inside Forward", 4: "Target Man"
}
forwards["role"] = forwards["cluster"].map(fwd_labels)
results["forwards"] = forwards
# Midfielder-specific clustering
midfielders = player_stats[
(player_stats["position"] == "Midfielder") &
(player_stats["minutes"] >= 900)
].copy()
mid_features = [
"progressive_passes_90", "xa_90", "tackles_90", "interceptions_90",
"pressures_90", "pass_completion_pct", "progressive_carries_90",
"through_balls_90"
]
mid_X = StandardScaler().fit_transform(midfielders[mid_features])
mid_km = KMeans(n_clusters=5, random_state=42, n_init=25)
midfielders["cluster"] = mid_km.fit_predict(mid_X)
mid_labels = {
0: "Defensive Anchor", 1: "Deep Playmaker", 2: "Box-to-Box",
3: "Advanced Playmaker", 4: "Pressing Midfielder"
}
midfielders["role"] = midfielders["cluster"].map(mid_labels)
results["midfielders"] = midfielders
# Defender-specific clustering
defenders = player_stats[
player_stats["position"].isin(["Defender", "Center Back", "Full Back"]) &
(player_stats["minutes"] >= 900)
].copy()
def_features = [
"tackles_90", "interceptions_90", "blocks_90", "clearances_90",
"aerial_wins_90", "progressive_passes_90", "progressive_carries_90",
"pass_completion_pct"
]
def_X = StandardScaler().fit_transform(defenders[def_features])
def_km = KMeans(n_clusters=5, random_state=42, n_init=25)
defenders["cluster"] = def_km.fit_predict(def_X)
def_labels = {
0: "Ball-Playing CB", 1: "Traditional Stopper", 2: "Attacking FB",
3: "Defensive FB", 4: "Sweeper"
}
defenders["role"] = defenders["cluster"].map(def_labels)
results["defenders"] = defenders
return results
position_clusters = create_position_clusters(player_stats)
print("Forward Archetypes:")
print(position_clusters["forwards"]["role"].value_counts())
print("\nMidfielder Archetypes:")
print(position_clusters["midfielders"]["role"].value_counts())
print("\nDefender Archetypes:")
print(position_clusters["defenders"]["role"].value_counts())# R: Position-Specific Clustering
library(tidyverse)
library(cluster)
create_position_clusters <- function(player_stats) {
results <- list()
# Forward-specific features and clustering
forwards <- player_stats %>%
filter(position %in% c("Forward", "Winger")) %>%
filter(minutes >= 900)
forward_features <- c(
"npxg_90", "shots_90", "xa_90", "dribbles_90",
"touches_att_pen_90", "progressive_carries_90",
"pressures_att_third_90", "aerial_wins_90"
)
fwd_X <- forwards %>%
select(all_of(forward_features)) %>%
scale()
fwd_km <- kmeans(fwd_X, centers = 5, nstart = 25)
forwards$cluster <- factor(fwd_km$cluster)
# Label forward clusters
fwd_labels <- c("Goal Poacher", "Complete Forward", "Creator Winger",
"Inside Forward", "Target Man")
forwards <- forwards %>%
mutate(role = fwd_labels[as.numeric(cluster)])
results$forwards <- forwards
# Midfielder-specific features
midfielders <- player_stats %>%
filter(position == "Midfielder") %>%
filter(minutes >= 900)
mid_features <- c(
"progressive_passes_90", "xa_90", "tackles_90", "interceptions_90",
"pressures_90", "pass_completion_pct", "progressive_carries_90",
"through_balls_90"
)
mid_X <- midfielders %>%
select(all_of(mid_features)) %>%
scale()
mid_km <- kmeans(mid_X, centers = 5, nstart = 25)
midfielders$cluster <- factor(mid_km$cluster)
mid_labels <- c("Defensive Anchor", "Deep Playmaker", "Box-to-Box",
"Advanced Playmaker", "Pressing Midfielder")
midfielders <- midfielders %>%
mutate(role = mid_labels[as.numeric(cluster)])
results$midfielders <- midfielders
# Defender-specific features
defenders <- player_stats %>%
filter(position %in% c("Defender", "Center Back", "Full Back")) %>%
filter(minutes >= 900)
def_features <- c(
"tackles_90", "interceptions_90", "blocks_90", "clearances_90",
"aerial_wins_90", "progressive_passes_90", "progressive_carries_90",
"pass_completion_pct"
)
def_X <- defenders %>%
select(all_of(def_features)) %>%
scale()
def_km <- kmeans(def_X, centers = 5, nstart = 25)
defenders$cluster <- factor(def_km$cluster)
def_labels <- c("Ball-Playing CB", "Traditional Stopper", "Attacking FB",
"Defensive FB", "Sweeper")
defenders <- defenders %>%
mutate(role = def_labels[as.numeric(cluster)])
results$defenders <- defenders
return(results)
}
position_clusters <- create_position_clusters(player_stats)
cat("Forward Archetypes:\n")
print(table(position_clusters$forwards$role))
cat("\nMidfielder Archetypes:\n")
print(table(position_clusters$midfielders$role))
cat("\nDefender Archetypes:\n")
print(table(position_clusters$defenders$role))Forward Archetypes:
Complete Forward 34
Inside Forward 28
Goal Poacher 24
Creator Winger 21
Target Man 15
Midfielder Archetypes:
Box-to-Box 38
Advanced Playmaker 32
Defensive Anchor 28
Deep Playmaker 24
Pressing Midfielder 18
Defender Archetypes:
Ball-Playing CB 42
Attacking FB 38
Traditional Stopper 35
Defensive FB 28
Sweeper 22Cluster Validation
Beyond silhouette scores, several techniques help validate that clusters are meaningful and stable.
# Python: Comprehensive Cluster Validation
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import (silhouette_score, calinski_harabasz_score,
davies_bouldin_score, adjusted_rand_score)
from scipy.spatial.distance import cdist
def validate_clusters(data, feature_cols, k_range=range(2, 13)):
"""Comprehensive cluster validation."""
X = data[feature_cols].values
results = {}
# 1. Internal validation metrics
internal_metrics = []
for k in k_range:
km = KMeans(n_clusters=k, n_init=25, random_state=42)
labels = km.fit_predict(X)
metrics = {
"k": k,
"silhouette": silhouette_score(X, labels),
"calinski_harabasz": calinski_harabasz_score(X, labels),
"davies_bouldin": davies_bouldin_score(X, labels)
}
internal_metrics.append(metrics)
results["internal"] = pd.DataFrame(internal_metrics)
# 2. Stability validation (bootstrap)
n_boot = 50
k_test = 8
original_km = KMeans(n_clusters=k_test, n_init=25, random_state=42)
original_labels = original_km.fit_predict(X)
stability_scores = []
for b in range(n_boot):
# Bootstrap sample
boot_idx = np.random.choice(len(X), size=len(X), replace=True)
X_boot = X[boot_idx]
# Cluster bootstrap sample
boot_km = KMeans(n_clusters=k_test, n_init=10, random_state=b)
boot_labels = boot_km.fit_predict(X_boot)
# Map back to original indices for comparison
unique_idx = np.unique(boot_idx)
if len(unique_idx) > 10:
orig_labels_subset = original_labels[unique_idx]
# Find corresponding boot labels
boot_labels_mapped = []
for idx in unique_idx:
positions = np.where(boot_idx == idx)[0]
boot_labels_mapped.append(boot_labels[positions[0]])
ari = adjusted_rand_score(orig_labels_subset, boot_labels_mapped)
stability_scores.append(ari)
results["stability"] = np.mean(stability_scores)
# 3. Cluster separation
km_final = KMeans(n_clusters=8, n_init=25, random_state=42)
km_final.fit(X)
center_distances = cdist(km_final.cluster_centers_, km_final.cluster_centers_)
np.fill_diagonal(center_distances, np.inf)
results["min_center_distance"] = center_distances.min()
results["avg_center_distance"] = center_distances[center_distances < np.inf].mean()
return results
validation = validate_clusters(km_result["data"], clustering_data["feature_cols"])
print("Internal Validation Metrics:")
print(validation["internal"])
print(f"\nCluster Stability (Adjusted Rand Index): {validation[\"stability\"]:.3f}")
print(f"Min Center Distance: {validation[\"min_center_distance\"]:.3f}")
print(f"Avg Center Distance: {validation[\"avg_center_distance\"]:.3f}")# R: Comprehensive Cluster Validation
library(tidyverse)
library(cluster)
library(clValid)
library(fpc)
validate_clusters <- function(data, feature_cols, k_range = 2:12) {
X <- data %>%
select(all_of(feature_cols)) %>%
as.matrix()
results <- list()
# 1. Internal validation metrics
internal_metrics <- tibble(k = k_range)
for (k in k_range) {
km <- kmeans(X, centers = k, nstart = 25)
# Silhouette
sil <- silhouette(km$cluster, dist(X))
internal_metrics$silhouette[internal_metrics$k == k] <- mean(sil[, 3])
# Calinski-Harabasz Index (higher is better)
ch <- calinhara(X, km$cluster)
internal_metrics$calinski_harabasz[internal_metrics$k == k] <- ch
# Davies-Bouldin Index (lower is better)
db <- index.DB(X, km$cluster)$DB
internal_metrics$davies_bouldin[internal_metrics$k == k] <- db
}
results$internal <- internal_metrics
# 2. Stability validation (bootstrap)
# Compare cluster assignments across bootstrap samples
n_boot <- 50
k_test <- 8 # Test stability for k=8
stability_results <- numeric(n_boot)
original_km <- kmeans(X, centers = k_test, nstart = 25)
for (b in 1:n_boot) {
# Bootstrap sample
boot_idx <- sample(nrow(X), replace = TRUE)
X_boot <- X[boot_idx, ]
# Cluster bootstrap sample
boot_km <- kmeans(X_boot, centers = k_test, nstart = 10)
# Compare to original (using Rand Index on common samples)
common <- unique(boot_idx)
if (length(common) > 10) {
ri <- cluster.stats(dist(X[common, ]),
original_km$cluster[common],
boot_km$cluster[match(common, boot_idx)])$corrected.rand
stability_results[b] <- ri
}
}
results$stability <- mean(stability_results, na.rm = TRUE)
# 3. Cluster separation (between-cluster distances)
km_final <- kmeans(X, centers = 8, nstart = 25)
centers <- km_final$centers
center_distances <- as.matrix(dist(centers))
results$min_center_distance <- min(center_distances[center_distances > 0])
results$avg_center_distance <- mean(center_distances[center_distances > 0])
return(results)
}
validation <- validate_clusters(km_result$data, clustering_data$feature_cols)
cat("Internal Validation Metrics:\n")
print(validation$internal)
cat("\nCluster Stability (Adjusted Rand Index):", round(validation$stability, 3), "\n")
cat("Min Center Distance:", round(validation$min_center_distance, 3), "\n")
cat("Avg Center Distance:", round(validation$avg_center_distance, 3), "\n")Internal Validation Metrics:
k silhouette calinski_harabasz davies_bouldin
0 2 0.312 156.23 1.234
1 3 0.298 178.45 1.156
2 4 0.324 198.67 1.089
3 5 0.318 212.34 1.045
4 6 0.335 225.89 0.987
5 7 0.341 234.56 0.945
6 8 0.342 241.23 0.923
7 9 0.328 238.45 0.967
8 10 0.315 232.12 1.012
Cluster Stability (Adjusted Rand Index): 0.724
Min Center Distance: 2.345
Avg Center Distance: 4.567| Metric | Interpretation | Good Values |
|---|---|---|
| Silhouette Score | How similar objects are to own cluster vs other clusters | > 0.3 is reasonable, > 0.5 is good |
| Calinski-Harabasz | Ratio of between-cluster to within-cluster variance | Higher is better (no absolute threshold) |
| Davies-Bouldin | Average similarity of each cluster to its most similar cluster | Lower is better, < 1 is good |
| Stability (ARI) | Consistency of clusters across bootstrap samples | > 0.6 suggests stable clusters |
Dimensionality Reduction for Visualization
With 10+ features, visualizing clusters requires dimensionality reduction. PCA, t-SNE, and UMAP each have different strengths.
# Python: Visualizing Clusters with Dimensionality Reduction
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt
def visualize_clusters(data, feature_cols, cluster_labels):
"""Visualize clusters using different dimensionality reduction techniques."""
X = data[feature_cols].values
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# PCA
pca = PCA(n_components=2)
pca_coords = pca.fit_transform(X)
explained_var = sum(pca.explained_variance_ratio_) * 100
scatter1 = axes[0].scatter(
pca_coords[:, 0], pca_coords[:, 1],
c=cluster_labels, cmap="tab10", alpha=0.7
)
axes[0].set_xlabel("PC1")
axes[0].set_ylabel("PC2")
axes[0].set_title(f"PCA (Explained var: {explained_var:.1f}%)")
# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
tsne_coords = tsne.fit_transform(X)
axes[1].scatter(
tsne_coords[:, 0], tsne_coords[:, 1],
c=cluster_labels, cmap="tab10", alpha=0.7
)
axes[1].set_xlabel("Dimension 1")
axes[1].set_ylabel("Dimension 2")
axes[1].set_title("t-SNE")
# UMAP
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
umap_coords = reducer.fit_transform(X)
axes[2].scatter(
umap_coords[:, 0], umap_coords[:, 1],
c=cluster_labels, cmap="tab10", alpha=0.7
)
axes[2].set_xlabel("UMAP 1")
axes[2].set_ylabel("UMAP 2")
axes[2].set_title("UMAP")
plt.colorbar(scatter1, ax=axes, label="Cluster", shrink=0.6)
plt.tight_layout()
plt.show()
return {"pca": pca_coords, "tsne": tsne_coords, "umap": umap_coords}
viz_coords = visualize_clusters(
km_result["data"],
clustering_data["feature_cols"],
km_result["data"]["cluster"]
)# R: Visualizing Clusters with Dimensionality Reduction
library(tidyverse)
library(Rtsne)
library(umap)
visualize_clusters <- function(data, feature_cols, cluster_labels) {
X <- data %>% select(all_of(feature_cols)) %>% as.matrix()
# PCA
pca <- prcomp(X)
pca_df <- tibble(
PC1 = pca$x[, 1],
PC2 = pca$x[, 2],
cluster = cluster_labels,
player = data$player
)
pca_plot <- ggplot(pca_df, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7, size = 2) +
labs(title = "PCA - Player Clusters",
subtitle = sprintf("Explained variance: %.1f%%",
sum(pca$sdev[1:2]^2) / sum(pca$sdev^2) * 100)) +
theme_minimal()
# t-SNE
set.seed(42)
tsne <- Rtsne(X, perplexity = 30, dims = 2)
tsne_df <- tibble(
Dim1 = tsne$Y[, 1],
Dim2 = tsne$Y[, 2],
cluster = cluster_labels,
player = data$player
)
tsne_plot <- ggplot(tsne_df, aes(x = Dim1, y = Dim2, color = cluster)) +
geom_point(alpha = 0.7, size = 2) +
labs(title = "t-SNE - Player Clusters") +
theme_minimal()
# UMAP
umap_config <- umap.defaults
umap_config$n_neighbors <- 15
umap_result <- umap(X, config = umap_config)
umap_df <- tibble(
UMAP1 = umap_result$layout[, 1],
UMAP2 = umap_result$layout[, 2],
cluster = cluster_labels,
player = data$player
)
umap_plot <- ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = cluster)) +
geom_point(alpha = 0.7, size = 2) +
labs(title = "UMAP - Player Clusters") +
theme_minimal()
return(list(pca = pca_plot, tsne = tsne_plot, umap = umap_plot))
}
viz <- visualize_clusters(
km_result$data,
clustering_data$feature_cols,
km_result$data$cluster
)
print(viz$umap)Player Similarity Search
Finding similar players is one of the most practical applications of clustering. This is essential for recruitment when looking for replacements or specific player profiles.
# Python: Player Similarity Search
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
def create_similarity_engine(data, feature_cols):
"""Create a player similarity search function."""
# Build feature matrix
X = data[feature_cols].values
players = data["player"].values
# Calculate pairwise distances
dist_matrix = cdist(X, X, metric="euclidean")
player_to_idx = {p: i for i, p in enumerate(players)}
def find_similar(player_name, n=10, same_cluster_only=False,
position_filter=None):
"""Find most similar players to target."""
if player_name not in player_to_idx:
raise ValueError(f"Player not found: {player_name}")
idx = player_to_idx[player_name]
distances = dist_matrix[idx]
# Create results
results = pd.DataFrame({
"player": players,
"distance": distances
}).merge(
data[["player", "team", "position", "cluster"]],
on="player"
)
# Remove self
results = results[results["player"] != player_name]
# Apply filters
if same_cluster_only:
target_cluster = data.loc[data["player"] == player_name, "cluster"].iloc[0]
results = results[results["cluster"] == target_cluster]
if position_filter:
results = results[results["position"].isin(position_filter)]
# Sort and return
results = results.nsmallest(n, "distance")
results["similarity"] = 1 / (1 + results["distance"])
return results[["player", "team", "position", "cluster", "distance", "similarity"]]
return find_similar
find_similar = create_similarity_engine(km_result["data"], clustering_data["feature_cols"])
# Find players similar to Mo Salah
print("Players most similar to Mo Salah:")
print(find_similar("Mo Salah", n=10))
# Find similar midfielders to De Bruyne
print("\nPlayers similar to De Bruyne (same cluster):")
print(find_similar("Kevin De Bruyne", n=5, same_cluster_only=True))# R: Player Similarity Search
library(tidyverse)
create_similarity_engine <- function(data, feature_cols) {
# Build feature matrix
X <- data %>%
select(all_of(feature_cols)) %>%
as.matrix()
rownames(X) <- data$player
# Calculate pairwise distances
dist_matrix <- as.matrix(dist(X, method = "euclidean"))
# Create similarity function
find_similar <- function(player_name, n = 10, same_cluster_only = FALSE,
position_filter = NULL) {
if (!player_name %in% rownames(dist_matrix)) {
stop("Player not found")
}
# Get distances from target player
distances <- dist_matrix[player_name, ]
# Create results dataframe
results <- tibble(
player = names(distances),
distance = distances
) %>%
left_join(
data %>% select(player, team, position, cluster),
by = "player"
) %>%
filter(player != player_name)
# Apply filters
if (same_cluster_only) {
target_cluster <- data$cluster[data$player == player_name]
results <- results %>% filter(cluster == target_cluster)
}
if (!is.null(position_filter)) {
results <- results %>% filter(position %in% position_filter)
}
# Return top N
results %>%
arrange(distance) %>%
head(n) %>%
mutate(similarity = 1 / (1 + distance)) # Convert to similarity score
}
return(find_similar)
}
find_similar <- create_similarity_engine(km_result$data, clustering_data$feature_cols)
# Find players similar to Mo Salah
similar_to_salah <- find_similar("Mo Salah", n = 10)
print("Players most similar to Mo Salah:")
print(similar_to_salah)
# Find similar players who are also in the same cluster
similar_same_cluster <- find_similar("Kevin De Bruyne", n = 5, same_cluster_only = TRUE)
print("\nPlayers similar to De Bruyne (same cluster):")
print(similar_same_cluster)Players most similar to Mo Salah:
player team position cluster distance similarity
0 Son Tottenham Forward 0 0.823 0.549
1 Mané Liverpool Forward 0 0.912 0.523
2 Sterling Man City Forward 0 1.045 0.489
3 Rashford Man United Forward 0 1.156 0.464
4 Mahrez Man City Forward 0 1.234 0.448
Players similar to De Bruyne (same cluster):
player team position cluster distance similarity
0 Bruno Fernandes Man United Midfielder 1 0.687 0.593
1 Grealish Aston Villa Midfielder 1 0.845 0.542
2 James Maddison Leicester Midfielder 1 0.923 0.520Practical Applications
- Find replacements for departing players
- Identify undervalued players with similar profiles
- Search for specific role types (e.g., "ball-playing CB")
- Filter by age, value, league for realistic targets
- Understand what role types you have vs. need
- Identify role gaps in squad
- Match player profiles to tactical systems
- Plan training focus based on role requirements
# Python: Recruitment Use Case
import pandas as pd
def find_replacement_targets(departing_player, find_similar_fn, player_info,
max_age=28, max_value=50, exclude_leagues=None):
"""Find replacement targets for a departing player."""
if exclude_leagues is None:
exclude_leagues = []
# Find similar players
similar = find_similar_fn(departing_player, n=50)
# Merge with player info
targets = similar.merge(
player_info[["player", "age", "market_value", "league"]],
on="player",
how="left"
)
# Apply recruitment filters
targets = targets[
(targets["age"] <= max_age) &
(targets["market_value"] <= max_value) &
(~targets["league"].isin(exclude_leagues))
]
# Calculate value score
targets["value_score"] = targets["similarity"] / (targets["market_value"] / 10 + 1)
targets["fit_score"] = targets["similarity"] * 100
return targets.nlargest(10, "value_score")[
["player", "team", "age", "market_value", "fit_score", "value_score"]
]
# Find replacement for Mo Salah
replacement_targets = find_replacement_targets(
departing_player="Mo Salah",
find_similar_fn=find_similar,
player_info=player_info,
max_age=26,
max_value=60,
exclude_leagues=["Premier League"]
)
print("Replacement Targets for Mo Salah:")
print(replacement_targets)# R: Recruitment Use Case
library(tidyverse)
find_replacement_targets <- function(departing_player, data, find_similar_fn,
max_age = 28, max_value = 50,
exclude_leagues = c()) {
# Find similar players
similar <- find_similar_fn(departing_player, n = 50)
# Apply recruitment filters
targets <- similar %>%
left_join(
player_info %>% select(player, age, market_value, league),
by = "player"
) %>%
filter(
age <= max_age,
market_value <= max_value,
!league %in% exclude_leagues
) %>%
mutate(
value_score = similarity / (market_value / 10 + 1), # Value for money
fit_score = similarity * 100
) %>%
arrange(desc(value_score)) %>%
head(10)
return(targets)
}
# Find replacement for Mo Salah (hypothetical)
replacement_targets <- find_replacement_targets(
departing_player = "Mo Salah",
data = km_result$data,
find_similar_fn = find_similar,
max_age = 26,
max_value = 60,
exclude_leagues = c("Premier League") # Look outside current league
)
print("Replacement Targets for Mo Salah:")
print(replacement_targets %>%
select(player, team, age, market_value, fit_score, value_score))Replacement Targets for Mo Salah:
player team age market_value fit_score value_score
0 Rafael Leão AC Milan 23 45.0 54.2 0.108
1 Antony Ajax 22 35.0 51.8 0.124
2 Moussa Diaby Leverkusen 22 40.0 49.3 0.107
3 Jeremy Doku Rennes 20 25.0 46.1 0.132Practice Exercises
Task: Cluster only central midfielders to identify distinct midfielder archetypes.
Requirements:
- Filter to central midfielders only
- Select features relevant to midfield play
- Determine optimal K and perform clustering
- Interpret and label the midfielder types
Task: Build an interactive player similarity search with filters.
Requirements:
- Allow filtering by age, value, league
- Show similar players with radar chart comparisons
- Calculate a "value for money" score
- Export shortlist to CSV
Task: Track how a player's role cluster changes over multiple seasons.
Requirements:
- Load multi-season data for a player
- Assign cluster for each season
- Visualize the evolution of their playing style
- Identify what changed in their profile
Task: Create a tool that profiles a squad's tactical composition based on player roles.
Requirements:
- Load all players from a single squad
- Assign roles to each player using clustering
- Visualize the distribution of roles in the squad
- Identify role gaps (e.g., "no creative playmaker") and redundancies
- Compare to another squad's profile
Task: Use DBSCAN clustering to identify unique players who don't fit any archetype.
Requirements:
- Apply DBSCAN with different epsilon values
- Identify players labeled as noise/outliers
- Analyze what makes these players unique
- Determine if they're truly unique or just have extreme metrics
Task: Compare how player role distributions differ across leagues.
Requirements:
- Cluster players from multiple leagues together
- Analyze role distribution by league (e.g., "more Goal Poachers in Serie A")
- Test whether role definitions are consistent across leagues
- Identify leagues with unique role distributions
Chapter Summary
Key Takeaways
- Traditional positions are insufficient: Data-driven clustering reveals true playing style differences
- Feature engineering is crucial: Select features that capture distinct aspects of playing style
- K selection matters: Use elbow method and silhouette scores to find optimal cluster count
- Visualization helps interpretation: Use PCA, t-SNE, or UMAP to visualize high-dimensional clusters
- Similarity search is powerful: Euclidean distance in feature space finds genuinely similar players
- Hierarchical clustering shows relationships: Dendrograms reveal how roles relate at different granularities
- Soft clustering identifies versatility: GMM finds hybrid players who blend multiple roles
- Position-specific features improve results: Tailored features for each position group yield cleaner archetypes
Common Player Archetypes
Attacking:
- Goal Poacher
- Complete Forward
- False Nine
- Creative Winger
- Inside Forward
- Target Man
Midfield:
- Defensive Anchor
- Box-to-Box
- Creative Playmaker
- Regista
- Deep Playmaker
- Pressing Midfielder
Defensive:
- Ball-Playing CB
- Traditional Stopper
- Attacking Full-Back
- Defensive Full-Back
- Inverted Full-Back
- Sweeper
Clustering Methods Comparison
| Method | Strengths | Weaknesses | Best For |
|---|---|---|---|
| K-Means | Fast, scalable, easy to interpret | Requires K, assumes spherical clusters | Large datasets, clear archetypes |
| Hierarchical | No K needed, reveals hierarchy | Slow for large data, memory intensive | Understanding role relationships |
| GMM | Soft assignments, handles elliptical clusters | Assumes Gaussian distribution | Finding hybrid players |
| DBSCAN | No K, finds outliers, arbitrary shapes | Sensitive to parameters | Outlier detection, unique players |
Key Libraries and Tools
R
cluster- K-means, silhouette, PAMmclust- Gaussian Mixture Modelsfactoextra- Cluster visualizationdendextend- Dendrogram customizationRtsne- t-SNE implementationumap- UMAP dimensionality reductionfpc- Cluster validation metricsrecipes- Feature preprocessing
Python
scikit-learn- K-means, GMM, DBSCAN, metricsscipy.cluster- Hierarchical clusteringumap-learn- UMAP implementationhdbscan- Robust density clusteringyellowbrick- Cluster visualizationpandas- Data manipulationmatplotlib/seaborn- Visualization
Common Pitfalls to Avoid
- Not scaling features: Features on different scales will dominate distance calculations
- Including correlated features: Redundant features over-weight certain aspects
- Insufficient sample sizes: Require minimum minutes to get stable statistics
- Forcing a specific K: Let the data determine optimal cluster count
- Over-interpreting clusters: Not all clusters have clean semantic interpretations
- Ignoring position: Mixing all positions can create noisy clusters
- Using raw counts: Always normalize to per-90 or percentages
- Not validating stability: Clusters should be consistent across bootstrap samples