Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples

0 of 60 chapters completed (0%)

Deep Learning for Football

Deep learning has revolutionized many domains, and football analytics is no exception. Neural networks can learn complex patterns from tracking data, predict match outcomes with high accuracy, and generate player embeddings that capture playing style. This chapter introduces deep learning fundamentals and their applications to football.

Learning Objectives

Understand neural network fundamentals for sports analytics
Build feedforward networks for match outcome prediction
Apply recurrent networks (LSTM/GRU) to sequence data
Create player embeddings using neural networks
Use graph neural networks for team analysis
Implement attention mechanisms for event data

Prerequisites

This chapter assumes familiarity with Python, basic machine learning concepts, and linear algebra. We'll use PyTorch and TensorFlow/Keras for implementations.

Deep Learning Fundamentals

Deep learning uses neural networks with multiple layers to learn hierarchical representations of data. For football, this enables learning complex patterns from raw event data that traditional methods might miss.

Feedforward Networks

Basic neural networks for tabular data and predictions.

Use: Match prediction, player ratings

Recurrent Networks

Process sequential data with memory of past events.

Use: Event sequences, match progression

Graph Networks

Learn from graph-structured data (players as nodes).

Use: Pass networks, team interactions

dl_fundamentals

# Python: Deep learning with PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class MatchPredictor(nn.Module):
    """Simple feedforward network for match outcome prediction."""

    def __init__(self, input_dim=20, hidden_dims=[64, 32], num_classes=3):
        super().__init__()

        layers = []
        prev_dim = input_dim

        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.3),
                nn.BatchNorm1d(hidden_dim)
            ])
            prev_dim = hidden_dim

        layers.append(nn.Linear(prev_dim, num_classes))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

# Create model
model = MatchPredictor(input_dim=20, hidden_dims=[64, 32], num_classes=3)
print(model)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop example
def train_epoch(model, dataloader, criterion, optimizer):
    model.train()
    total_loss = 0

    for batch_x, batch_y in dataloader:
        optimizer.zero_grad()
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    return total_loss / len(dataloader)

# Example training
# for epoch in range(50):
#     loss = train_epoch(model, train_loader, criterion, optimizer)
#     print(f"Epoch {epoch+1}, Loss: {loss:.4f}")
# R: Deep learning with keras/tensorflow
library(keras)
library(tensorflow)

# Simple feedforward network example
model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = c(20)) %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 3, activation = "softmax")  # 3 classes: W/D/L

# Compile model
model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.001),
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

# Model summary
summary(model)

# Training would be:
# history <- model %>% fit(
#   x_train, y_train,
#   epochs = 50,
#   batch_size = 32,
#   validation_split = 0.2
# )

Output

MatchPredictor(
  (network): Sequential(
    (0): Linear(in_features=20, out_features=64)
    (1): ReLU()
    (2): Dropout(p=0.3)
    (3): BatchNorm1d(64)
    (4): Linear(in_features=64, out_features=32)
    (5): ReLU()
    (6): Dropout(p=0.3)
    (7): BatchNorm1d(32)
    (8): Linear(in_features=32, out_features=3)
  )
)

Match Outcome Prediction

Neural networks can combine multiple feature types (team stats, recent form, head-to-head records) to predict match outcomes more accurately than traditional models.

match_prediction

# Python: Match prediction with PyTorch
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

class MatchPredictorWithEmbeddings(nn.Module):
    """
    Neural network for match prediction with team embeddings.
    """

    def __init__(self, num_teams, num_numerical_features,
                 team_embed_dim=16, hidden_dims=[128, 64]):
        super().__init__()

        # Team embeddings
        self.team_embedding = nn.Embedding(num_teams, team_embed_dim)

        # Calculate input dimension for dense layers
        # numerical + home_embed + away_embed
        input_dim = num_numerical_features + 2 * team_embed_dim

        # Build dense layers
        layers = []
        prev_dim = input_dim

        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.BatchNorm1d(hidden_dim),
                nn.Dropout(0.3)
            ])
            prev_dim = hidden_dim

        # Output layer (3 classes: home win, draw, away win)
        layers.append(nn.Linear(prev_dim, 3))

        self.classifier = nn.Sequential(*layers)

    def forward(self, numerical_features, home_team_id, away_team_id):
        # Get team embeddings
        home_embed = self.team_embedding(home_team_id)
        away_embed = self.team_embedding(away_team_id)

        # Concatenate all features
        x = torch.cat([numerical_features, home_embed, away_embed], dim=1)

        # Pass through classifier
        return self.classifier(x)

# Prepare data
def prepare_match_data(matches_df):
    """Prepare features for match prediction."""

    # Encode teams
    le = LabelEncoder()
    all_teams = pd.concat([matches_df["home_team"], matches_df["away_team"]])
    le.fit(all_teams)

    matches_df["home_team_id"] = le.transform(matches_df["home_team"])
    matches_df["away_team_id"] = le.transform(matches_df["away_team"])

    # Numerical features
    numerical_cols = [
        "home_xg_avg", "home_xga_avg", "away_xg_avg", "away_xga_avg",
        "home_form", "away_form", "home_goals_avg", "away_goals_avg",
        "home_shots_avg", "away_shots_avg"
    ]

    X_numerical = matches_df[numerical_cols].values
    X_home_team = matches_df["home_team_id"].values
    X_away_team = matches_df["away_team_id"].values

    # Target: 0=away win, 1=draw, 2=home win
    y = matches_df["result"].map({"H": 2, "D": 1, "A": 0}).values

    return X_numerical, X_home_team, X_away_team, y, le

# Training function
def train_match_predictor(model, train_loader, val_loader, epochs=50):
    """Train match prediction model."""

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, patience=5, factor=0.5
    )

    best_val_acc = 0
    history = {"train_loss": [], "val_loss": [], "val_acc": []}

    for epoch in range(epochs):
        # Training
        model.train()
        train_loss = 0

        for numerical, home_id, away_id, target in train_loader:
            optimizer.zero_grad()
            output = model(numerical, home_id, away_id)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        # Validation
        model.eval()
        val_loss = 0
        correct = 0
        total = 0

        with torch.no_grad():
            for numerical, home_id, away_id, target in val_loader:
                output = model(numerical, home_id, away_id)
                val_loss += criterion(output, target).item()
                _, predicted = torch.max(output, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()

        val_acc = correct / total
        scheduler.step(val_loss)

        history["train_loss"].append(train_loss / len(train_loader))
        history["val_loss"].append(val_loss / len(val_loader))
        history["val_acc"].append(val_acc)

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), "best_match_model.pt")

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: Val Acc = {val_acc:.3f}")

    return history

# Example usage
model = MatchPredictorWithEmbeddings(
    num_teams=40,
    num_numerical_features=10,
    team_embed_dim=16
)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
# R: Match prediction with keras
library(keras)
library(tidyverse)

# Prepare match features
prepare_match_features <- function(matches) {
  matches %>%
    mutate(
      # Team strength metrics
      home_xg_for_avg = home_xg_for / matches_played,
      home_xg_against_avg = home_xg_against / matches_played,
      away_xg_for_avg = away_xg_for / matches_played,
      away_xg_against_avg = away_xg_against / matches_played,

      # Form (last 5 matches)
      home_form = home_points_l5 / 15,  # Normalized 0-1
      away_form = away_points_l5 / 15,

      # Historical head-to-head
      h2h_home_win_rate = h2h_home_wins / (h2h_home_wins + h2h_draws + h2h_away_wins + 0.1),

      # Relative strength
      xg_diff = home_xg_for_avg - away_xg_for_avg,
      form_diff = home_form - away_form
    ) %>%
    select(home_xg_for_avg, home_xg_against_avg, away_xg_for_avg,
           away_xg_against_avg, home_form, away_form,
           h2h_home_win_rate, xg_diff, form_diff)
}

# Build model with embedding for categorical features
build_match_model <- function(num_numerical = 9, num_teams = 40,
                              team_embed_dim = 8) {
  # Numerical input
  numerical_input <- layer_input(shape = num_numerical, name = "numerical")

  # Team embedding inputs
  home_team_input <- layer_input(shape = 1, name = "home_team")
  away_team_input <- layer_input(shape = 1, name = "away_team")

  # Embedding layer (shared)
  team_embedding <- layer_embedding(
    input_dim = num_teams,
    output_dim = team_embed_dim,
    name = "team_embedding"
  )

  home_embed <- home_team_input %>%
    team_embedding() %>%
    layer_flatten()

  away_embed <- away_team_input %>%
    team_embedding() %>%
    layer_flatten()

  # Concatenate all features
  combined <- layer_concatenate(list(numerical_input, home_embed, away_embed))

  # Dense layers
  output <- combined %>%
    layer_dense(64, activation = "relu") %>%
    layer_dropout(0.3) %>%
    layer_dense(32, activation = "relu") %>%
    layer_dense(3, activation = "softmax")  # W/D/L

  keras_model(
    inputs = list(numerical_input, home_team_input, away_team_input),
    outputs = output
  )
}

model <- build_match_model()
summary(model)

Output

Model parameters: 14,851

Sequence Models for Event Data

Football matches are sequences of events. Recurrent Neural Networks (RNNs), particularly LSTMs and GRUs, can model these sequences to predict future events or classify possession outcomes.

sequence_models

# Python: LSTM for event sequence modeling
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class EventSequenceModel(nn.Module):
    """
    LSTM model for predicting possession outcomes from event sequences.
    """

    def __init__(self, num_event_types=20, embed_dim=32, hidden_dim=64,
                 num_layers=2, dropout=0.3, num_spatial_features=4):
        super().__init__()

        # Event type embedding
        self.event_embedding = nn.Embedding(num_event_types, embed_dim)

        # Spatial features (x, y coordinates, distance, angle)
        self.spatial_dim = num_spatial_features

        # LSTM
        lstm_input_dim = embed_dim + num_spatial_features
        self.lstm = nn.LSTM(
            input_size=lstm_input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=True
        )

        # Output layers
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, 64),  # *2 for bidirectional
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )

    def forward(self, event_types, spatial_features, lengths):
        """
        Args:
            event_types: (batch, max_seq_len) event type indices
            spatial_features: (batch, max_seq_len, 4) x,y,dist,angle
            lengths: (batch,) actual sequence lengths
        """
        # Get embeddings
        event_embeds = self.event_embedding(event_types)

        # Concatenate with spatial features
        x = torch.cat([event_embeds, spatial_features], dim=-1)

        # Pack sequences
        packed = pack_padded_sequence(x, lengths.cpu(),
                                     batch_first=True,
                                     enforce_sorted=False)

        # LSTM forward
        packed_output, (hidden, cell) = self.lstm(packed)

        # Use final hidden state (concatenate forward and backward)
        hidden_concat = torch.cat([hidden[-2], hidden[-1]], dim=1)

        # Classify
        return self.classifier(hidden_concat)

# Event vocabulary
EVENT_VOCAB = {
    "Pass": 0, "Carry": 1, "Dribble": 2, "Shot": 3, "Cross": 4,
    "Clearance": 5, "Tackle": 6, "Interception": 7, "Foul": 8,
    "Ball Receipt": 9, "Pressure": 10, "Block": 11
}

def prepare_possession_sequences(events_df, max_length=30):
    """Convert event data to sequences for training."""

    sequences = []

    for poss_id, poss_events in events_df.groupby("possession_id"):
        # Get event types
        event_types = [EVENT_VOCAB.get(e, 0) for e in poss_events["type"]]

        # Get spatial features (normalized to 0-1)
        spatial = poss_events[["x", "y"]].values / 100

        # Add distance and angle to goal
        goal_x, goal_y = 100, 50
        distances = np.sqrt((goal_x - poss_events["x"])**2 +
                           (goal_y - poss_events["y"])**2) / 100
        angles = np.arctan2(goal_y - poss_events["y"],
                          goal_x - poss_events["x"]) / np.pi

        spatial = np.column_stack([spatial, distances, angles])

        # Truncate or pad
        seq_len = min(len(event_types), max_length)
        event_types = event_types[:max_length]
        spatial = spatial[:max_length]

        # Padding
        if len(event_types) < max_length:
            pad_len = max_length - len(event_types)
            event_types.extend([0] * pad_len)
            spatial = np.vstack([spatial, np.zeros((pad_len, 4))])

        # Target: did possession end in shot?
        target = int(poss_events["type"].iloc[-1] == "Shot")

        sequences.append({
            "event_types": event_types,
            "spatial": spatial,
            "length": seq_len,
            "target": target
        })

    return sequences

# Create model
model = EventSequenceModel(
    num_event_types=len(EVENT_VOCAB),
    embed_dim=32,
    hidden_dim=64
)
print(f"Sequence model parameters: {sum(p.numel() for p in model.parameters()):,}")
# R: LSTM for sequence prediction
library(keras)

# Build LSTM model for possession outcome prediction
build_lstm_model <- function(vocab_size = 50, embed_dim = 32,
                             lstm_units = 64, max_seq_length = 30) {
  model <- keras_model_sequential() %>%
    # Embedding for event types
    layer_embedding(input_dim = vocab_size, output_dim = embed_dim,
                    input_length = max_seq_length) %>%

    # LSTM layers
    layer_lstm(units = lstm_units, return_sequences = TRUE) %>%
    layer_dropout(0.3) %>%
    layer_lstm(units = lstm_units %/% 2) %>%
    layer_dropout(0.2) %>%

    # Output: probability of shot/goal at end of possession
    layer_dense(32, activation = "relu") %>%
    layer_dense(1, activation = "sigmoid")

  model %>% compile(
    optimizer = "adam",
    loss = "binary_crossentropy",
    metrics = c("accuracy", "AUC")
  )

  model
}

# Prepare sequence data
prepare_event_sequences <- function(events, max_length = 30) {
  # Event type encoding
  event_vocab <- c("Pass", "Carry", "Dribble", "Shot", "Cross",
                   "Clearance", "Tackle", "Interception", "Foul")

  # Convert possessions to sequences
  sequences <- events %>%
    group_by(possession_id) %>%
    arrange(event_id) %>%
    summarise(
      event_seq = list(match(type, event_vocab)),
      ends_in_shot = any(type == "Shot"),
      .groups = "drop"
    )

  # Pad sequences
  # pad_sequences() would be used here

  sequences
}

Output

Sequence model parameters: 53,569

Player Embeddings

Player embeddings are dense vector representations that capture playing style. Similar to word embeddings in NLP, they enable similarity search, clustering, and transfer learning.

player_embeddings

# Python: Player embeddings with neural networks
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.preprocessing import StandardScaler
import numpy as np

class PlayerEmbeddingModel(nn.Module):
    """
    Learn player embeddings from their statistics.
    Uses an autoencoder architecture to compress stats into embeddings.
    """

    def __init__(self, input_dim, embed_dim=32, hidden_dims=[128, 64]):
        super().__init__()

        # Encoder
        encoder_layers = []
        prev_dim = input_dim

        for hidden_dim in hidden_dims:
            encoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.BatchNorm1d(hidden_dim),
                nn.Dropout(0.2)
            ])
            prev_dim = hidden_dim

        encoder_layers.append(nn.Linear(prev_dim, embed_dim))
        self.encoder = nn.Sequential(*encoder_layers)

        # Decoder (mirror of encoder)
        decoder_layers = []
        prev_dim = embed_dim

        for hidden_dim in reversed(hidden_dims):
            decoder_layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.BatchNorm1d(hidden_dim)
            ])
            prev_dim = hidden_dim

        decoder_layers.append(nn.Linear(prev_dim, input_dim))
        self.decoder = nn.Sequential(*decoder_layers)

    def forward(self, x):
        embedding = self.encoder(x)
        reconstruction = self.decoder(embedding)
        return reconstruction, embedding

    def get_embedding(self, x):
        """Get only the embedding without reconstruction."""
        return self.encoder(x)

class ContrastivePlayerEmbedding(nn.Module):
    """
    Learn player embeddings using contrastive learning.
    Similar players (same position/role) should have similar embeddings.
    """

    def __init__(self, input_dim, embed_dim=32):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, embed_dim)
        )

        # Projection head for contrastive learning
        self.projection = nn.Sequential(
            nn.Linear(embed_dim, embed_dim),
            nn.ReLU(),
            nn.Linear(embed_dim, embed_dim)
        )

    def forward(self, x):
        embedding = self.encoder(x)
        projection = self.projection(embedding)
        return F.normalize(projection, dim=1), embedding

def train_contrastive(model, dataloader, epochs=100, temperature=0.1):
    """Train with NT-Xent contrastive loss."""

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(epochs):
        total_loss = 0

        for batch in dataloader:
            # batch contains (anchor, positive) pairs
            anchor, positive = batch

            optimizer.zero_grad()

            # Get projections
            z_anchor, _ = model(anchor)
            z_positive, _ = model(positive)

            # NT-Xent loss
            similarity = torch.mm(z_anchor, z_positive.t()) / temperature
            labels = torch.arange(z_anchor.size(0))
            loss = F.cross_entropy(similarity, labels)

            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        if (epoch + 1) % 20 == 0:
            print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}")

# Finding similar players
def find_similar_players(target_embedding, all_embeddings, player_names, k=5):
    """Find k most similar players based on embedding similarity."""

    # Cosine similarity
    similarities = F.cosine_similarity(
        target_embedding.unsqueeze(0),
        all_embeddings
    )

    # Get top k
    top_k = torch.topk(similarities, k=k)

    results = []
    for idx, sim in zip(top_k.indices, top_k.values):
        results.append({
            "player": player_names[idx],
            "similarity": sim.item()
        })

    return results

# Example usage
model = PlayerEmbeddingModel(input_dim=50, embed_dim=32)
print(f"Embedding model parameters: {sum(p.numel() for p in model.parameters()):,}")
# R: Player embeddings concept
library(tidyverse)

# Player embedding idea:
# Learn a low-dimensional representation from high-dimensional stats

# Method 1: Autoencoder for dimensionality reduction
build_player_autoencoder <- function(input_dim, embed_dim = 16) {
  # Encoder
  encoder_input <- layer_input(shape = input_dim)
  encoded <- encoder_input %>%
    layer_dense(64, activation = "relu") %>%
    layer_dense(32, activation = "relu") %>%
    layer_dense(embed_dim, activation = "linear", name = "embedding")

  # Decoder
  decoded <- encoded %>%
    layer_dense(32, activation = "relu") %>%
    layer_dense(64, activation = "relu") %>%
    layer_dense(input_dim, activation = "linear")

  # Full autoencoder
  autoencoder <- keras_model(encoder_input, decoded)

  # Encoder only (for extracting embeddings)
  encoder <- keras_model(encoder_input, encoded)

  list(autoencoder = autoencoder, encoder = encoder)
}

# Train autoencoder
# models <- build_player_autoencoder(input_dim = 50, embed_dim = 16)
# models$autoencoder %>% compile(optimizer = "adam", loss = "mse")
# models$autoencoder %>% fit(player_stats, player_stats, epochs = 100)

# Extract embeddings
# player_embeddings <- models$encoder %>% predict(player_stats)

Output

Embedding model parameters: 18,434

Graph Neural Networks

Graph Neural Networks (GNNs) are naturally suited for football analysis where players form networks through passes and spatial relationships. GNNs can learn team-level representations that capture interaction patterns.

gnn

# Python: Graph Neural Networks with PyTorch Geometric
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GATConv, global_mean_pool
from torch_geometric.data import Data, Batch

class FootballGNN(nn.Module):
    """
    Graph Neural Network for team-level analysis.
    Nodes are players, edges are passes.
    """

    def __init__(self, node_features, hidden_dim=64, output_dim=32,
                 num_layers=3, dropout=0.3):
        super().__init__()

        self.convs = nn.ModuleList()
        self.bns = nn.ModuleList()

        # First layer
        self.convs.append(GCNConv(node_features, hidden_dim))
        self.bns.append(nn.BatchNorm1d(hidden_dim))

        # Hidden layers
        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_dim, hidden_dim))
            self.bns.append(nn.BatchNorm1d(hidden_dim))

        # Final conv layer
        self.convs.append(GCNConv(hidden_dim, output_dim))

        self.dropout = dropout

    def forward(self, x, edge_index, batch=None):
        """
        Args:
            x: Node features (num_nodes, node_features)
            edge_index: Edge connections (2, num_edges)
            batch: Batch assignment for multiple graphs
        """
        for conv, bn in zip(self.convs[:-1], self.bns):
            x = conv(x, edge_index)
            x = bn(x)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)

        x = self.convs[-1](x, edge_index)

        # Global pooling if processing multiple graphs
        if batch is not None:
            x = global_mean_pool(x, batch)

        return x

class TeamMatchPredictor(nn.Module):
    """
    Predict match outcome from two team graphs.
    """

    def __init__(self, node_features, hidden_dim=64):
        super().__init__()

        # GNN for encoding teams
        self.team_encoder = FootballGNN(
            node_features=node_features,
            hidden_dim=hidden_dim,
            output_dim=32
        )

        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(64, 32),  # 32*2 for both teams
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(32, 3)  # H/D/A
        )

    def forward(self, home_graph, away_graph):
        # Encode both teams
        home_embed = self.team_encoder(
            home_graph.x, home_graph.edge_index, home_graph.batch
        )
        away_embed = self.team_encoder(
            away_graph.x, away_graph.edge_index, away_graph.batch
        )

        # Concatenate and classify
        combined = torch.cat([home_embed, away_embed], dim=1)
        return self.classifier(combined)

def create_team_graph(events_df, players_df):
    """Create PyTorch Geometric graph from match data."""

    # Node features (per player)
    player_ids = players_df["player_id"].unique()
    id_to_idx = {pid: i for i, pid in enumerate(player_ids)}

    # Build node feature matrix
    node_features = []
    for pid in player_ids:
        player_data = players_df[players_df["player_id"] == pid].iloc[0]
        features = [
            player_data["x_avg"] / 100,
            player_data["y_avg"] / 100,
            player_data["passes"] / 50,
            player_data["touches"] / 100,
            player_data["duels_won_pct"]
        ]
        node_features.append(features)

    x = torch.tensor(node_features, dtype=torch.float)

    # Build edge index from passes
    passes = events_df[events_df["type"] == "Pass"]
    pass_counts = passes.groupby(["player_id", "recipient_id"]).size()

    edges = []
    edge_weights = []

    for (passer, receiver), count in pass_counts.items():
        if passer in id_to_idx and receiver in id_to_idx:
            edges.append([id_to_idx[passer], id_to_idx[receiver]])
            edge_weights.append(count)

    edge_index = torch.tensor(edges, dtype=torch.long).t()
    edge_attr = torch.tensor(edge_weights, dtype=torch.float).unsqueeze(1)

    return Data(x=x, edge_index=edge_index, edge_attr=edge_attr)

# Create model
model = FootballGNN(node_features=5, hidden_dim=64)
print(f"GNN parameters: {sum(p.numel() for p in model.parameters()):,}")
# R: Graph neural networks concept
# GNNs in R typically use Python via reticulate

library(reticulate)

# Conceptual structure for football GNN:
# - Nodes: Players
# - Edges: Passes / spatial proximity
# - Node features: Player stats, position
# - Edge features: Pass count, distance

# Graph structure representation
create_match_graph <- function(events, players) {
  # Build adjacency from passes
  pass_edges <- events %>%
    filter(type == "Pass") %>%
    group_by(from = player_id, to = recipient_id) %>%
    summarise(weight = n(), .groups = "drop")

  # Node features (player stats for this match)
  node_features <- players %>%
    select(player_id, x_avg, y_avg, passes, touches, duels_won)

  list(
    edges = pass_edges,
    nodes = node_features
  )
}

Output

GNN parameters: 12,768

Attention Mechanisms

Attention mechanisms allow models to focus on relevant parts of the input. For football, this helps identify key events in a sequence or important player interactions.

attention

# Python: Transformer attention for event sequences
import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    """Multi-head self-attention layer."""

    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()

        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.size()

        # Linear projections
        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = torch.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)

        # Apply attention to values
        context = torch.matmul(attention_weights, V)

        # Concatenate heads
        context = context.transpose(1, 2).contiguous().view(
            batch_size, seq_len, self.d_model
        )

        return self.W_o(context), attention_weights

class EventTransformer(nn.Module):
    """
    Transformer model for football event sequences.
    Uses attention to identify important events.
    """

    def __init__(self, num_event_types, d_model=64, num_heads=4,
                 num_layers=3, dropout=0.1):
        super().__init__()

        self.d_model = d_model

        # Event embedding
        self.event_embedding = nn.Embedding(num_event_types, d_model)

        # Positional encoding
        self.pos_embedding = nn.Embedding(500, d_model)  # Max sequence length

        # Transformer layers
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(
                d_model=d_model,
                nhead=num_heads,
                dim_feedforward=d_model * 4,
                dropout=dropout,
                batch_first=True
            )
            for _ in range(num_layers)
        ])

        # Custom attention for interpretability
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)

        # Output
        self.classifier = nn.Sequential(
            nn.Linear(d_model, 32),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, event_types, mask=None):
        batch_size, seq_len = event_types.size()

        # Get embeddings
        positions = torch.arange(seq_len, device=event_types.device)
        x = self.event_embedding(event_types) + self.pos_embedding(positions)

        # Apply transformer layers
        for layer in self.layers:
            x = layer(x, src_key_padding_mask=mask)

        # Get attention weights for interpretability
        _, attention_weights = self.attention(x, mask)

        # Pool and classify (use CLS-like approach with first token or mean)
        pooled = x.mean(dim=1)
        output = self.classifier(pooled)

        return output, attention_weights

# Example: Interpret which events matter
def interpret_attention(model, event_sequence, event_names):
    """Visualize attention weights to understand model focus."""

    model.eval()
    with torch.no_grad():
        output, attention = model(event_sequence.unsqueeze(0))

    # Average attention across heads
    avg_attention = attention.mean(dim=1).squeeze()

    # Get attention to each event
    event_importance = avg_attention.mean(dim=0)

    print("Event Importance (Attention Weights):")
    for i, (name, weight) in enumerate(zip(event_names, event_importance)):
        bar = "█" * int(weight * 50)
        print(f"  {name:15s} {weight:.3f} {bar}")

# Create model
transformer = EventTransformer(num_event_types=20)
print(f"Transformer parameters: {sum(p.numel() for p in transformer.parameters()):,}")
# R: Attention concept
# Attention weights show which events matter most

# Self-attention formula:
# Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

# For football sequences:
# - Q, K, V derived from event embeddings
# - Attention weights reveal important events
# - High weights on shots, key passes, etc.

attention_weights_example <- function() {
  # Example attention weights for a possession
  events <- c("Pass", "Pass", "Dribble", "Pass", "Shot")
  weights <- c(0.05, 0.08, 0.15, 0.22, 0.50)

  tibble(event = events, weight = weights) %>%
    ggplot(aes(x = seq_along(event), y = weight, fill = event)) +
    geom_col() +
    labs(title = "Attention Weights in Possession",
         x = "Event Order", y = "Attention Weight")
}

Output

Transformer parameters: 67,937

Event Importance (Attention Weights):
  Pass            0.082 ████
  Pass            0.095 ████
  Carry           0.124 ██████
  Dribble         0.189 █████████
  Shot            0.510 █████████████████████████

Deep Learning for Expected Goals

Traditional xG models use logistic regression or gradient boosting. Deep learning can capture more complex spatial patterns and context from the sequence of events leading to a shot.

deep_xg

# Python: Deep xG model with context
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, brier_score_loss

class DeepXGModel(nn.Module):
    """
    Neural network xG model that considers:
    1. Shot location and characteristics
    2. Sequence of preceding events
    3. Game state context
    """

    def __init__(self, num_shot_features=12, num_event_types=20,
                 event_embed_dim=16, lstm_hidden=32):
        super().__init__()

        # Event embedding for sequence
        self.event_embedding = nn.Embedding(num_event_types, event_embed_dim)

        # LSTM for event sequence
        self.lstm = nn.LSTM(
            input_size=event_embed_dim + 4,  # embed + x,y,dx,dy
            hidden_size=lstm_hidden,
            num_layers=2,
            batch_first=True,
            dropout=0.3
        )

        # Shot feature processing
        self.shot_encoder = nn.Sequential(
            nn.Linear(num_shot_features, 64),
            nn.ReLU(),
            nn.BatchNorm1d(64),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU()
        )

        # Combined classifier
        combined_dim = 32 + lstm_hidden  # shot + sequence
        self.classifier = nn.Sequential(
            nn.Linear(combined_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, shot_features, event_types, event_locations, lengths):
        """
        Args:
            shot_features: (batch, num_shot_features) shot characteristics
            event_types: (batch, max_seq_len) event type indices
            event_locations: (batch, max_seq_len, 4) x,y,dx,dy
            lengths: (batch,) actual sequence lengths
        """
        # Encode shot
        shot_encoded = self.shot_encoder(shot_features)

        # Encode event sequence
        event_embeds = self.event_embedding(event_types)
        sequence_input = torch.cat([event_embeds, event_locations], dim=-1)

        # Pack and process
        packed = nn.utils.rnn.pack_padded_sequence(
            sequence_input, lengths.cpu(),
            batch_first=True, enforce_sorted=False
        )
        _, (hidden, _) = self.lstm(packed)

        # Use final hidden state
        sequence_encoded = hidden[-1]

        # Combine and classify
        combined = torch.cat([shot_encoded, sequence_encoded], dim=1)
        return self.classifier(combined)

def prepare_shot_features(shots_df):
    """Extract features for each shot."""

    # Goal location (center of goal)
    goal_x, goal_y = 100, 50

    features = pd.DataFrame({
        # Location (normalized)
        "x": shots_df["x"] / 100,
        "y": shots_df["y"] / 100,

        # Distance and angle
        "distance": np.sqrt(
            (goal_x - shots_df["x"])**2 + (goal_y - shots_df["y"])**2
        ) / 100,
        "angle": np.abs(np.arctan2(
            goal_y - shots_df["y"],
            goal_x - shots_df["x"]
        )),

        # Shot type (one-hot)
        "header": (shots_df["body_part"] == "Head").astype(int),
        "right_foot": (shots_df["body_part"] == "Right Foot").astype(int),
        "left_foot": (shots_df["body_part"] == "Left Foot").astype(int),

        # Context
        "under_pressure": shots_df["under_pressure"].fillna(0).astype(int),
        "first_time": shots_df["first_time"].fillna(0).astype(int),
        "counter": (shots_df["play_pattern"] == "From Counter").astype(int),
        "set_piece": shots_df["play_pattern"].str.contains("Set").fillna(False).astype(int),

        # Preceding event distance
        "prev_event_dist": shots_df.get("prev_distance", 0) / 100
    })

    return features.values

def evaluate_xg_model(model, test_loader, device="cpu"):
    """Evaluate xG model performance."""
    model.eval()
    predictions = []
    actuals = []

    with torch.no_grad():
        for batch in test_loader:
            shot_feat, event_types, event_locs, lengths, target = batch
            shot_feat = shot_feat.to(device)
            event_types = event_types.to(device)
            event_locs = event_locs.to(device)

            pred = model(shot_feat, event_types, event_locs, lengths)
            predictions.extend(pred.cpu().numpy().flatten())
            actuals.extend(target.numpy().flatten())

    predictions = np.array(predictions)
    actuals = np.array(actuals)

    return {
        "auc": roc_auc_score(actuals, predictions),
        "brier": brier_score_loss(actuals, predictions),
        "log_loss": -np.mean(
            actuals * np.log(predictions + 1e-7) +
            (1 - actuals) * np.log(1 - predictions + 1e-7)
        )
    }

# Example
model = DeepXGModel()
print(f"Deep xG model parameters: {sum(p.numel() for p in model.parameters()):,}")
# R: Deep xG model with keras
library(keras)
library(tidyverse)

# Build deep xG model
build_deep_xg_model <- function() {
    # Shot features input
    shot_input <- layer_input(shape = 10, name = "shot_features")

    # Sequence of preceding events (LSTM)
    sequence_input <- layer_input(shape = c(10, 8), name = "event_sequence")

    # Process sequence
    sequence_processed <- sequence_input %>%
        layer_lstm(32, return_sequences = FALSE) %>%
        layer_dropout(0.3)

    # Process shot features
    shot_processed <- shot_input %>%
        layer_dense(32, activation = "relu") %>%
        layer_dropout(0.2)

    # Combine
    combined <- layer_concatenate(list(shot_processed, sequence_processed))

    # Output probability
    output <- combined %>%
        layer_dense(32, activation = "relu") %>%
        layer_dropout(0.2) %>%
        layer_dense(1, activation = "sigmoid")

    model <- keras_model(
        inputs = list(shot_input, sequence_input),
        outputs = output
    )

    model %>% compile(
        optimizer = "adam",
        loss = "binary_crossentropy",
        metrics = c("AUC")
    )

    model
}

# Prepare xG features
prepare_xg_features <- function(shots_df) {
    shots_df %>%
        mutate(
            # Distance and angle
            distance_to_goal = sqrt((100 - x)^2 + (50 - y)^2),
            angle_to_goal = atan2(50 - y, 100 - x) * 180 / pi,

            # Shot characteristics (normalized)
            x_norm = x / 100,
            y_norm = y / 100,
            distance_norm = distance_to_goal / 100,

            # Categorical encodings
            body_part_head = as.integer(body_part == "Head"),
            shot_type_penalty = as.integer(shot_type == "Penalty"),

            # Context
            under_pressure = as.integer(under_pressure),
            counter_attack = as.integer(play_pattern == "Counter")
        )
}

Output

Deep xG model parameters: 26,177

Transfer Learning for Football

Transfer learning uses pre-trained models as starting points. For football, this includes pre-trained vision models for video analysis and language models for text data. We can also transfer embeddings between leagues or seasons.

transfer_learning

# Python: Transfer learning strategies
import torch
import torch.nn as nn
from torchvision import models
from torch.utils.data import DataLoader

class TransferLearningPlayer(nn.Module):
    """
    Transfer learning for player classification from jersey images.
    Uses pre-trained ResNet as feature extractor.
    """

    def __init__(self, num_players, freeze_backbone=True):
        super().__init__()

        # Load pre-trained ResNet
        self.backbone = models.resnet50(pretrained=True)

        # Freeze backbone if specified
        if freeze_backbone:
            for param in self.backbone.parameters():
                param.requires_grad = False

        # Replace final layer
        num_features = self.backbone.fc.in_features
        self.backbone.fc = nn.Identity()

        # Custom head for player classification
        self.classifier = nn.Sequential(
            nn.Linear(num_features, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_players)
        )

    def forward(self, x):
        features = self.backbone(x)
        return self.classifier(features)

    def unfreeze_backbone(self, num_layers=0):
        """Unfreeze last n layers for fine-tuning."""
        layers = list(self.backbone.children())

        if num_layers == 0:
            # Unfreeze all
            for param in self.backbone.parameters():
                param.requires_grad = True
        else:
            # Unfreeze last n layers
            for layer in layers[-num_layers:]:
                for param in layer.parameters():
                    param.requires_grad = True

class EmbeddingTransfer:
    """
    Transfer player embeddings between leagues or seasons.
    Uses common players to learn a mapping.
    """

    def __init__(self, source_embeddings, target_embeddings):
        self.source = source_embeddings  # dict: player_id -> embedding
        self.target = target_embeddings

    def find_common_players(self):
        """Find players present in both source and target."""
        source_ids = set(self.source.keys())
        target_ids = set(self.target.keys())
        return source_ids.intersection(target_ids)

    def learn_mapping(self, common_players):
        """Learn linear mapping from source to target space."""

        # Gather paired embeddings
        X_source = torch.stack([self.source[p] for p in common_players])
        X_target = torch.stack([self.target[p] for p in common_players])

        # Learn linear transformation W: source -> target
        # Using least squares: W = (X_s^T X_s)^-1 X_s^T X_t
        self.W = torch.linalg.lstsq(X_source, X_target).solution

        return self.W

    def transfer_embedding(self, source_embedding):
        """Map source embedding to target space."""
        return source_embedding @ self.W

class DomainAdaptation(nn.Module):
    """
    Domain adaptation for transferring models between leagues.
    Uses gradient reversal for domain-invariant features.
    """

    def __init__(self, input_dim, hidden_dim=64, output_dim=32):
        super().__init__()

        # Shared feature extractor
        self.feature_extractor = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, output_dim),
            nn.ReLU()
        )

        # Task classifier (e.g., player position)
        self.task_classifier = nn.Sequential(
            nn.Linear(output_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 4)  # 4 positions
        )

        # Domain classifier (which league)
        self.domain_classifier = nn.Sequential(
            nn.Linear(output_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 2)  # 2 leagues
        )

    def forward(self, x, alpha=1.0):
        """
        Args:
            x: Input features
            alpha: Gradient reversal scale (0 = no reversal, 1 = full)
        """
        features = self.feature_extractor(x)

        # Task prediction (normal gradient)
        task_output = self.task_classifier(features)

        # Domain prediction (reversed gradient for adversarial training)
        reversed_features = GradientReversal.apply(features, alpha)
        domain_output = self.domain_classifier(reversed_features)

        return task_output, domain_output

class GradientReversal(torch.autograd.Function):
    """Gradient reversal layer for domain adaptation."""

    @staticmethod
    def forward(ctx, x, alpha):
        ctx.alpha = alpha
        return x.view_as(x)

    @staticmethod
    def backward(ctx, grad_output):
        return grad_output.neg() * ctx.alpha, None

# Example fine-tuning schedule
def fine_tune_schedule(model, train_loader, epochs_frozen=10, epochs_unfrozen=20):
    """Two-stage fine-tuning: frozen backbone then unfrozen."""

    optimizer = torch.optim.Adam(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=0.001
    )

    # Stage 1: Train with frozen backbone
    print("Stage 1: Training classifier with frozen backbone...")
    for epoch in range(epochs_frozen):
        train_epoch(model, train_loader, optimizer)

    # Stage 2: Unfreeze and fine-tune
    print("\nStage 2: Fine-tuning full model...")
    model.unfreeze_backbone(num_layers=2)

    optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

    for epoch in range(epochs_unfrozen):
        train_epoch(model, train_loader, optimizer)

print("Transfer learning example initialized")
# R: Transfer learning concepts
library(keras)

# Transfer learning for player identification from images
build_player_id_model <- function(num_players, freeze_base = TRUE) {
    # Load pre-trained ResNet50
    base_model <- application_resnet50(
        weights = "imagenet",
        include_top = FALSE,
        input_shape = c(224, 224, 3)
    )

    # Freeze base layers
    if (freeze_base) {
        freeze_weights(base_model)
    }

    # Add custom classification head
    model <- keras_model_sequential() %>%
        base_model %>%
        layer_global_average_pooling_2d() %>%
        layer_dense(512, activation = "relu") %>%
        layer_dropout(0.5) %>%
        layer_dense(num_players, activation = "softmax")

    model %>% compile(
        optimizer = optimizer_adam(learning_rate = 0.001),
        loss = "categorical_crossentropy",
        metrics = "accuracy"
    )

    model
}

# Transfer embeddings between leagues
transfer_embeddings <- function(source_embeddings, target_data,
                                common_players) {
    # Find common players between leagues
    common_indices <- which(rownames(source_embeddings) %in% common_players)

    # Use common players to learn mapping
    # Would train a linear transformation here

    cat("Found", length(common_indices), "common players for transfer\n")
}

Output

Transfer learning example initialized

Training Best Practices

Deep learning requires careful attention to training dynamics. Here are key practices for football analytics applications.

Data Practices

Temporal train/val/test splits (no leakage)
Class balancing for rare events (goals)
Feature normalization (StandardScaler)
Data augmentation where applicable
Cross-validation for small datasets

Training Practices

Learning rate scheduling (warmup + decay)
Early stopping on validation loss
Gradient clipping for RNNs
Dropout and regularization
Mixed precision for speed

training_best_practices

# Python: Training best practices
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, WeightedRandomSampler
from sklearn.preprocessing import StandardScaler
import numpy as np
from typing import Dict, List

class TrainingConfig:
    """Configuration for training deep learning models."""

    def __init__(self):
        self.learning_rate = 0.001
        self.batch_size = 32
        self.epochs = 100
        self.early_stopping_patience = 10
        self.lr_scheduler_patience = 5
        self.weight_decay = 1e-5
        self.gradient_clip = 1.0
        self.dropout = 0.3

class EarlyStopping:
    """Stop training when validation loss stops improving."""

    def __init__(self, patience=10, min_delta=1e-4):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = float("inf")
        self.should_stop = False

    def __call__(self, val_loss):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True

        return self.should_stop

class FootballTrainer:
    """Complete training pipeline for football DL models."""

    def __init__(self, model, config: TrainingConfig, device="cuda"):
        self.model = model.to(device)
        self.config = config
        self.device = device

        self.optimizer = torch.optim.AdamW(
            model.parameters(),
            lr=config.learning_rate,
            weight_decay=config.weight_decay
        )

        self.scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            self.optimizer,
            patience=config.lr_scheduler_patience,
            factor=0.5
        )

        self.early_stopping = EarlyStopping(config.early_stopping_patience)
        self.history = {"train_loss": [], "val_loss": [], "val_metric": []}

    def get_class_weights(self, labels) -> torch.Tensor:
        """Calculate weights for imbalanced classes."""
        class_counts = np.bincount(labels)
        total = len(labels)
        weights = total / (len(class_counts) * class_counts)
        return torch.tensor(weights, dtype=torch.float32, device=self.device)

    def get_weighted_sampler(self, labels):
        """Create weighted sampler for imbalanced datasets."""
        class_weights = self.get_class_weights(labels)
        sample_weights = class_weights[labels]
        return WeightedRandomSampler(
            sample_weights, len(sample_weights), replacement=True
        )

    def train_epoch(self, train_loader, criterion):
        """Run one training epoch."""
        self.model.train()
        total_loss = 0
        num_batches = 0

        for batch in train_loader:
            self.optimizer.zero_grad()

            # Forward pass (adapt based on your model)
            inputs, targets = batch[0].to(self.device), batch[1].to(self.device)
            outputs = self.model(inputs)
            loss = criterion(outputs, targets)

            # Backward pass
            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(
                self.model.parameters(), self.config.gradient_clip
            )

            self.optimizer.step()
            total_loss += loss.item()
            num_batches += 1

        return total_loss / num_batches

    def validate(self, val_loader, criterion, metric_fn=None):
        """Evaluate on validation set."""
        self.model.eval()
        total_loss = 0
        all_preds = []
        all_targets = []

        with torch.no_grad():
            for batch in val_loader:
                inputs, targets = batch[0].to(self.device), batch[1].to(self.device)
                outputs = self.model(inputs)
                loss = criterion(outputs, targets)
                total_loss += loss.item()

                all_preds.append(outputs.cpu())
                all_targets.append(targets.cpu())

        avg_loss = total_loss / len(val_loader)

        metric = None
        if metric_fn:
            all_preds = torch.cat(all_preds)
            all_targets = torch.cat(all_targets)
            metric = metric_fn(all_preds, all_targets)

        return avg_loss, metric

    def fit(self, train_loader, val_loader, criterion, metric_fn=None):
        """Full training loop with best practices."""

        best_model_state = None
        best_val_loss = float("inf")

        for epoch in range(self.config.epochs):
            # Training
            train_loss = self.train_epoch(train_loader, criterion)

            # Validation
            val_loss, val_metric = self.validate(val_loader, criterion, metric_fn)

            # Learning rate scheduling
            self.scheduler.step(val_loss)

            # Track history
            self.history["train_loss"].append(train_loss)
            self.history["val_loss"].append(val_loss)
            if val_metric is not None:
                self.history["val_metric"].append(val_metric)

            # Save best model
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_model_state = self.model.state_dict().copy()

            # Logging
            if (epoch + 1) % 5 == 0:
                metric_str = f", Metric: {val_metric:.4f}" if val_metric else ""
                print(f"Epoch {epoch+1}: Train={train_loss:.4f}, "
                      f"Val={val_loss:.4f}{metric_str}")

            # Early stopping
            if self.early_stopping(val_loss):
                print(f"\nEarly stopping at epoch {epoch+1}")
                break

        # Load best model
        if best_model_state:
            self.model.load_state_dict(best_model_state)

        return self.history

def temporal_train_test_split(df, date_col, train_ratio=0.7, val_ratio=0.15):
    """Split data temporally to avoid leakage."""
    df = df.sort_values(date_col)

    n = len(df)
    train_end = int(n * train_ratio)
    val_end = int(n * (train_ratio + val_ratio))

    return {
        "train": df.iloc[:train_end],
        "val": df.iloc[train_end:val_end],
        "test": df.iloc[val_end:]
    }

# Example usage
config = TrainingConfig()
print(f"Training config: lr={config.learning_rate}, batch={config.batch_size}")
# R: Training best practices
library(keras)
library(tidyverse)

# Temporal split for football data
temporal_split <- function(data, train_end, val_end) {
    list(
        train = data %>% filter(date < train_end),
        val = data %>% filter(date >= train_end, date < val_end),
        test = data %>% filter(date >= val_end)
    )
}

# Class weighting for imbalanced data
calculate_class_weights <- function(y) {
    counts <- table(y)
    total <- sum(counts)
    n_classes <- length(counts)

    weights <- total / (n_classes * counts)
    as.list(weights)
}

# Callbacks for training
training_callbacks <- function(model_path = "best_model.h5") {
    list(
        callback_early_stopping(
            monitor = "val_loss",
            patience = 10,
            restore_best_weights = TRUE
        ),
        callback_reduce_lr_on_plateau(
            monitor = "val_loss",
            factor = 0.5,
            patience = 5
        ),
        callback_model_checkpoint(
            filepath = model_path,
            save_best_only = TRUE,
            monitor = "val_loss"
        )
    )
}

Output

Training config: lr=0.001, batch=32

Model Deployment

Deploying deep learning models for football analytics requires consideration of latency, scalability, and model updates. Here are patterns for production deployment.

model_deployment

# Python: Model deployment with FastAPI
import torch
import torch.nn as nn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
import pickle
from typing import List, Optional

app = FastAPI(title="Football xG API")

# Global model cache
model_cache = {}

class ShotInput(BaseModel):
    x: float
    y: float
    body_part: str = "Right Foot"
    under_pressure: bool = False

class PredictionOutput(BaseModel):
    xg: float
    confidence_interval: List[float]

class ModelServer:
    """Serve deep learning models for predictions."""

    def __init__(self, model_path: str, scaler_path: str):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # Load model
        self.model = self._load_model(model_path)
        self.model.eval()

        # Load preprocessing
        with open(scaler_path, "rb") as f:
            self.scaler = pickle.load(f)

    def _load_model(self, path):
        """Load model with proper device mapping."""
        model = torch.load(path, map_location=self.device)
        return model.to(self.device)

    def preprocess(self, shot: ShotInput) -> torch.Tensor:
        """Convert input to model-ready tensor."""
        features = np.array([
            shot.x / 100,
            shot.y / 100,
            np.sqrt((100 - shot.x)**2 + (50 - shot.y)**2) / 100,
            1 if shot.body_part == "Head" else 0,
            1 if shot.under_pressure else 0
        ]).reshape(1, -1)

        features_scaled = self.scaler.transform(features)
        return torch.tensor(features_scaled, dtype=torch.float32, device=self.device)

    @torch.no_grad()
    def predict(self, shot: ShotInput) -> PredictionOutput:
        """Generate xG prediction."""
        x = self.preprocess(shot)

        # Get prediction
        xg = self.model(x).item()

        # Estimate confidence (using dropout at inference for uncertainty)
        predictions = []
        self.model.train()  # Enable dropout

        for _ in range(100):
            pred = self.model(x).item()
            predictions.append(pred)

        self.model.eval()

        ci_low = np.percentile(predictions, 2.5)
        ci_high = np.percentile(predictions, 97.5)

        return PredictionOutput(
            xg=xg,
            confidence_interval=[ci_low, ci_high]
        )

# Batch prediction for efficiency
class BatchPredictor:
    """Efficient batch prediction for multiple shots."""

    def __init__(self, model, scaler, batch_size=32):
        self.model = model
        self.scaler = scaler
        self.batch_size = batch_size
        self.device = next(model.parameters()).device

    @torch.no_grad()
    def predict_batch(self, shots: List[ShotInput]) -> List[float]:
        """Predict xG for multiple shots efficiently."""
        self.model.eval()

        # Preprocess all shots
        features = []
        for shot in shots:
            feat = [
                shot.x / 100,
                shot.y / 100,
                np.sqrt((100 - shot.x)**2 + (50 - shot.y)**2) / 100,
                1 if shot.body_part == "Head" else 0,
                1 if shot.under_pressure else 0
            ]
            features.append(feat)

        features = np.array(features)
        features_scaled = self.scaler.transform(features)
        x = torch.tensor(features_scaled, dtype=torch.float32, device=self.device)

        # Batch prediction
        predictions = []
        for i in range(0, len(x), self.batch_size):
            batch = x[i:i + self.batch_size]
            pred = self.model(batch)
            predictions.extend(pred.cpu().numpy().flatten())

        return predictions

# ONNX export for production
def export_to_onnx(model, sample_input, output_path):
    """Export PyTorch model to ONNX for deployment."""
    torch.onnx.export(
        model,
        sample_input,
        output_path,
        input_names=["features"],
        output_names=["xg"],
        dynamic_axes={
            "features": {0: "batch_size"},
            "xg": {0: "batch_size"}
        }
    )
    print(f"Model exported to {output_path}")

# TorchScript for production
def export_to_torchscript(model, sample_input, output_path):
    """Export to TorchScript for C++ deployment."""
    traced = torch.jit.trace(model, sample_input)
    traced.save(output_path)
    print(f"TorchScript model saved to {output_path}")

print("Deployment utilities ready")
# R: Model deployment considerations
library(plumber)
library(keras)

# Save model for deployment
save_for_deployment <- function(model, path) {
    # Save keras model
    save_model_hdf5(model, paste0(path, "/model.h5"))

    # Save preprocessing objects
    saveRDS(scaler, paste0(path, "/scaler.rds"))
    saveRDS(label_encoder, paste0(path, "/label_encoder.rds"))

    cat("Model saved to:", path, "\n")
}

# Plumber API endpoint example
#* @post /predict
#* @param shot_x Shot x coordinate
#* @param shot_y Shot y coordinate
function(shot_x, shot_y) {
    # Load model (would be cached in production)
    # model <- load_model_hdf5("model.h5")

    # Prepare input
    features <- c(as.numeric(shot_x), as.numeric(shot_y))

    # Predict
    # xg <- predict(model, matrix(features, nrow = 1))

    list(xg = 0.15)  # Placeholder
}

Output

Deployment utilities ready

Advanced Architectures

Beyond basic architectures, specialized designs can better capture football-specific patterns.

advanced_architectures

# Python: Advanced football-specific architectures
import torch
import torch.nn as nn
import torch.nn.functional as F

class HierarchicalMatchModel(nn.Module):
    """
    Hierarchical model: Match -> Possessions -> Events
    Learns representations at each level.
    """

    def __init__(self, event_vocab_size, embed_dim=32, hidden_dim=64):
        super().__init__()

        # Event-level encoding
        self.event_encoder = nn.Sequential(
            nn.Embedding(event_vocab_size, embed_dim),
            nn.LSTM(embed_dim + 4, hidden_dim, batch_first=True)
        )

        # Possession-level encoding (over event summaries)
        self.possession_encoder = nn.LSTM(
            hidden_dim, hidden_dim, batch_first=True
        )

        # Match-level encoding (over possession summaries)
        self.match_encoder = nn.LSTM(
            hidden_dim, hidden_dim, batch_first=True
        )

        # Output heads
        self.xg_head = nn.Linear(hidden_dim, 1)
        self.possession_outcome_head = nn.Linear(hidden_dim, 4)
        self.match_outcome_head = nn.Linear(hidden_dim, 3)

    def encode_events(self, events, locations):
        """Encode a sequence of events."""
        embeds = self.event_encoder[0](events)
        x = torch.cat([embeds, locations], dim=-1)
        _, (h, _) = self.event_encoder[1](x)
        return h.squeeze(0)

class MultiTaskFootballModel(nn.Module):
    """
    Multi-task model for football predictions.
    Shared backbone, task-specific heads.
    """

    def __init__(self, input_dim, hidden_dim=128):
        super().__init__()

        # Shared backbone
        self.backbone = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 64),
            nn.ReLU()
        )

        # Task-specific heads
        self.xg_head = nn.Sequential(
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

        self.pass_success_head = nn.Sequential(
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

        self.event_type_head = nn.Sequential(
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 10)  # 10 event types
        )

    def forward(self, x, task="all"):
        """
        Forward pass for specified task(s).

        Args:
            x: Input features
            task: "xg", "pass", "event", or "all"
        """
        features = self.backbone(x)

        if task == "xg":
            return self.xg_head(features)
        elif task == "pass":
            return self.pass_success_head(features)
        elif task == "event":
            return self.event_type_head(features)
        else:
            return {
                "xg": self.xg_head(features),
                "pass_success": self.pass_success_head(features),
                "event_type": self.event_type_head(features)
            }

class SpatialAttentionModel(nn.Module):
    """
    Model with spatial attention for pitch-aware predictions.
    """

    def __init__(self, hidden_dim=64):
        super().__init__()

        # Spatial encoder (pitch as 2D grid)
        self.spatial_conv = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU()
        )

        # Spatial attention
        self.spatial_attention = nn.Sequential(
            nn.Conv2d(64, 1, kernel_size=1),
            nn.Sigmoid()
        )

        # Output
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 10 * 7, hidden_dim),  # Assuming 80x56 -> 10x7
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )

    def forward(self, pitch_heatmap):
        """
        Args:
            pitch_heatmap: (batch, 1, 80, 56) spatial distribution
        """
        features = self.spatial_conv(pitch_heatmap)

        # Apply spatial attention
        attention = self.spatial_attention(features)
        attended = features * attention

        return self.classifier(attended), attention

# Variational model for uncertainty
class VariationalXGModel(nn.Module):
    """
    Variational xG model that provides uncertainty estimates.
    """

    def __init__(self, input_dim, latent_dim=16, hidden_dim=64):
        super().__init__()

        # Encoder to latent space
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )

        self.mu_layer = nn.Linear(hidden_dim, latent_dim)
        self.logvar_layer = nn.Linear(hidden_dim, latent_dim)

        # Decoder/predictor
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )

    def reparameterize(self, mu, logvar):
        """Sample from latent distribution."""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x, num_samples=1):
        """
        Forward pass with optional multiple samples for uncertainty.
        """
        h = self.encoder(x)
        mu = self.mu_layer(h)
        logvar = self.logvar_layer(h)

        if num_samples == 1:
            z = self.reparameterize(mu, logvar)
            return self.decoder(z), mu, logvar
        else:
            predictions = []
            for _ in range(num_samples):
                z = self.reparameterize(mu, logvar)
                pred = self.decoder(z)
                predictions.append(pred)
            return torch.stack(predictions), mu, logvar

print("Advanced architectures defined")
# R: Advanced architecture concepts
# Hierarchical model for match -> possession -> event
# Would be implemented in Python typically

# Concept: Multi-task learning
# Simultaneously predict:
# 1. xG (shot outcome)
# 2. Pass success probability
# 3. Player value
# Shared lower layers, task-specific heads

Output

Advanced architectures defined

Practice Exercises

Hands-On Practice

Complete these exercises to master deep learning for football:

Exercise 39.1: Match Outcome Predictor

Build a feedforward neural network to predict match outcomes (H/D/A) from team statistics. Compare performance against logistic regression. Use proper train/validation/test splits.

Exercise 39.2: Possession Sequence Model

Implement an LSTM that predicts whether a possession will end in a shot. Use StatsBomb event data to create sequences. Evaluate with ROC-AUC.

Exercise 39.3: Player Embeddings

Train an autoencoder on player statistics to create 16-dimensional embeddings. Find the 5 most similar players to a target player. Visualize embeddings with t-SNE.

Exercise 39.4: Attention Interpretation

Build a transformer model for event sequences. Analyze the attention weights - do they align with intuition about important events (shots, key passes)?

Exercise 39.5: Deep xG Model

Implement a deep learning xG model that incorporates the sequence of events leading to a shot. Compare performance (AUC, Brier score) against a traditional logistic regression model.

Hint

Use an LSTM to encode the preceding 5-10 events, concatenate with shot location features, and pass through dense layers. The sequence context should improve predictions for open-play shots.

Exercise 39.6: GNN for Team Analysis

Build a Graph Neural Network where nodes are players and edges are passes. Use the GNN to generate a team-level embedding and predict match outcomes from two team graphs.

Hint

Start with GCNConv layers for message passing. Use global_mean_pool to aggregate node embeddings into a single team vector. Combine home and away team vectors for match prediction.

Exercise 39.7: Transfer Learning Between Leagues

Train a player embedding model on Premier League data. Transfer the embeddings to La Liga using players who played in both leagues as anchors. Evaluate whether the transferred embeddings capture similar player roles.

Exercise 39.8: Production xG API

Deploy your trained xG model as a REST API using FastAPI. Implement batch prediction, uncertainty estimates via MC Dropout, and model versioning. Benchmark latency for single and batch predictions.

Hint

Export your model to TorchScript or ONNX for faster inference. Use caching for the model and preprocessing objects. Implement async endpoints for high throughput.

Summary

Key Takeaways

Feedforward networks combine diverse features for match prediction
Team embeddings via shared weight matrices capture team identity
RNNs (LSTM/GRU) model sequential event data with memory
Player embeddings capture playing style in dense vectors (autoencoders, contrastive)
Graph Neural Networks learn from player interaction networks
Attention mechanisms identify important events and enable interpretability
Deep xG models incorporate sequence context for better predictions
Transfer learning reuses pre-trained models and cross-league embeddings
Deep learning requires careful hyperparameter tuning, temporal splits, and sufficient data

Common Pitfalls

Data leakage: Random splits on football data cause leakage—always use temporal splits
Overfitting: Football datasets are often small—use regularization, dropout, early stopping
Class imbalance: Goals are rare events—use weighted loss, oversampling, or focal loss
Sequence length variance: Possessions vary in length—use packed sequences properly
Vanishing gradients: Long sequences can cause issues—use LSTM/GRU over vanilla RNN
GPU memory: Graph batching in GNNs requires care—use PyG's batch utilities
Deployment latency: Complex models are slow—profile and optimize for production
Model interpretability: Black-box predictions are hard to trust—use attention visualization

Essential Libraries

Python Libraries:

torch - PyTorch deep learning
tensorflow / keras - Alternative framework
torch-geometric - Graph neural networks
transformers - Pre-trained models
scikit-learn - Preprocessing, metrics
onnx / onnxruntime - Model export
fastapi - Model serving
wandb / tensorboard - Experiment tracking

R Packages:

keras / keras3 - Deep learning in R
tensorflow - TensorFlow backend
torch - PyTorch in R
reticulate - Python interop
caret - Model evaluation
plumber - API deployment

Model Complexity vs. Performance

Model Type	Params	Training Time	Inference	Best For
Logistic Regression	~100	Seconds	<1ms	Baselines, interpretability
Feedforward NN	~10K	Minutes	~1ms	Tabular features
LSTM/GRU	~50K	Hours	~5ms	Event sequences
Transformer	~100K+	Hours	~10ms	Long sequences, attention
GNN	~20K	Hours	~5ms	Pass networks, team analysis

When to Use Deep Learning

Deep learning isn't always the answer. Consider using it when:

Large datasets: You have thousands of samples (matches, possessions)
Sequential data: Event sequences benefit from memory (LSTM/GRU)
Graph structure: Pass networks are naturally suited to GNNs
Feature learning: When manual features are insufficient
Transfer needed: Pre-trained models accelerate development

For smaller datasets or when interpretability is critical, start with simpler models (logistic regression, gradient boosting) and only move to deep learning if needed.

Deep learning opens up new possibilities for football analytics, from better predictions to richer player representations. In the next chapter, we'll explore simulation and agent-based modeling for tactical analysis.

Capstone - Complete Analytics System

Deep Learning for Football

Learning Objectives

Prerequisites

Deep Learning Fundamentals

Match Outcome Prediction

Sequence Models for Event Data

Player Embeddings

Graph Neural Networks

Attention Mechanisms

Deep Learning for Expected Goals

Transfer Learning for Football

Training Best Practices

Model Deployment

Advanced Architectures

Practice Exercises

Hands-On Practice

Summary

Key Takeaways

Common Pitfalls

Essential Libraries

Model Complexity vs. Performance

When to Use Deep Learning

On This Page

Exercises

Chapter Info