Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples

0 of 60 chapters completed (0%)

Natural Language Processing for Football

Football generates vast amounts of text data: match reports, player interviews, social media discussions, live commentary, and scouting reports. Natural Language Processing (NLP) enables us to extract insights from this unstructured text at scale.

Learning Objectives

Understand NLP fundamentals for sports analytics
Extract named entities (players, teams, competitions) from text
Perform sentiment analysis on football content
Build text classification models for match reports
Generate automated match summaries
Analyze social media discourse around football

NLP Fundamentals

NLP is a branch of AI focused on understanding and generating human language. For football analytics, we apply NLP to extract structured information from text and understand sentiment and topics.

Text Data Sources

Match reports and previews
Post-match interviews
Social media (Twitter/X, Reddit)
Live commentary feeds
Scouting reports
Transfer news and rumors

NLP Tasks

Named Entity Recognition (NER)
Sentiment Analysis
Text Classification
Topic Modeling
Text Summarization
Question Answering

nlp_basics

# Python: NLP basics with NLTK and spaCy
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from collections import Counter
import spacy

# Download required data
nltk.download("punkt")
nltk.download("stopwords")

# Sample match report
match_report = """Manchester United secured a dramatic 2-1 victory over Liverpool
at Old Trafford. Bruno Fernandes opened the scoring with a spectacular
free kick in the 23rd minute. Mohamed Salah equalized from the penalty
spot after a controversial VAR decision. Marcus Rashford scored the
winner in stoppage time, sending the home fans into raptures."""

# Basic tokenization
sentences = sent_tokenize(match_report)
words = word_tokenize(match_report.lower())

print(f"Sentences: {len(sentences)}")
print(f"Words: {len(words)}")

# Remove stopwords
stop_words = set(stopwords.words("english"))
words_clean = [w for w in words if w.isalpha() and w not in stop_words]

# Word frequency
word_freq = Counter(words_clean)
print("\nTop 10 words:")
for word, count in word_freq.most_common(10):
    print(f"  {word}: {count}")

# Using spaCy for more advanced processing
nlp = spacy.load("en_core_web_sm")
doc = nlp(match_report)

# Part-of-speech tagging
print("\nNouns found:")
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
print(nouns)
# R: NLP basics with tidytext
library(tidyverse)
library(tidytext)

# Sample match report
match_report <- "Manchester United secured a dramatic 2-1 victory over Liverpool
at Old Trafford. Bruno Fernandes opened the scoring with a spectacular
free kick in the 23rd minute. Mohamed Salah equalized from the penalty
spot after a controversial VAR decision. Marcus Rashford scored the
winner in stoppage time, sending the home fans into raptures."
# Tokenize text
tokens <- tibble(text = match_report) %>%
  unnest_tokens(word, text)

# Remove stop words
tokens_clean <- tokens %>%
  anti_join(stop_words, by = "word")

# Word frequency
word_freq <- tokens_clean %>%
  count(word, sort = TRUE)

print(word_freq)

# Bigrams (two-word phrases)
bigrams <- tibble(text = match_report) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word)

print(bigrams)

Output

Sentences: 4
Words: 56

Top 10 words:
  scoring: 1
  manchester: 1
  united: 1
  victory: 1
  liverpool: 1

Nouns found:
['victory', 'scoring', 'kick', 'minute', 'spot', 'decision', 'winner', 'time', 'fans']

Named Entity Recognition

Named Entity Recognition (NER) identifies and classifies entities in text such as player names, team names, venues, and competitions. This is essential for extracting structured data from unstructured text.

ner

# Python: Named Entity Recognition with spaCy
import spacy
from spacy.matcher import Matcher, PhraseMatcher

# Load model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp(match_report)

# Extract built-in entities
print("Named Entities:")
for ent in doc.ents:
    print(f"  {ent.text}: {ent.label_}")

# Custom entity recognition for football
class FootballNER:
    """Custom NER for football-specific entities."""

    def __init__(self, nlp):
        self.nlp = nlp
        self.matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

        # Add football entity patterns
        self.add_patterns()

    def add_patterns(self):
        """Add football-specific patterns."""

        # Teams
        teams = ["manchester united", "liverpool", "chelsea", "arsenal",
                "manchester city", "tottenham", "bayern munich", "real madrid"]
        patterns = [self.nlp.make_doc(team) for team in teams]
        self.matcher.add("TEAM", patterns)

        # Competitions
        comps = ["premier league", "champions league", "fa cup",
                "world cup", "europa league", "la liga"]
        patterns = [self.nlp.make_doc(comp) for comp in comps]
        self.matcher.add("COMPETITION", patterns)

        # Venues
        venues = ["old trafford", "anfield", "emirates", "etihad",
                 "stamford bridge", "camp nou", "santiago bernabeu"]
        patterns = [self.nlp.make_doc(venue) for venue in venues]
        self.matcher.add("VENUE", patterns)

    def extract(self, text):
        """Extract football entities from text."""
        doc = self.nlp(text.lower())
        matches = self.matcher(doc)

        entities = []
        for match_id, start, end in matches:
            entity_type = self.nlp.vocab.strings[match_id]
            entity_text = doc[start:end].text
            entities.append({
                "text": entity_text,
                "type": entity_type,
                "start": start,
                "end": end
            })

        return entities

# Use custom NER
football_ner = FootballNER(nlp)
entities = football_ner.extract(match_report)

print("\nFootball Entities:")
for ent in entities:
    print(f"  {ent['text']}: {ent['type']}")
# R: Named Entity Recognition
library(spacyr)

# Initialize spaCy
spacy_initialize(model = "en_core_web_sm")

# Parse text
doc <- spacy_parse(match_report, entity = TRUE)

# Extract entities
entities <- entity_extract(doc)
print(entities)

# Custom entity patterns for football
football_entities <- list(
  teams = c("Manchester United", "Liverpool", "Chelsea", "Arsenal",
            "Manchester City", "Tottenham"),
  venues = c("Old Trafford", "Anfield", "Stamford Bridge", "Emirates"),
  competitions = c("Premier League", "Champions League", "FA Cup", "EFL Cup")
)

# Function to extract football entities
extract_football_entities <- function(text, patterns) {
  found <- list()

  for (type in names(patterns)) {
    matches <- patterns[[type]][
      sapply(patterns[[type]], function(p) grepl(p, text, ignore.case = TRUE))
    ]
    found[[type]] <- matches
  }

  found
}

entities_found <- extract_football_entities(match_report, football_entities)
print(entities_found)

Output

Named Entities:
  Manchester United: ORG
  Liverpool: GPE
  Old Trafford: FAC
  Bruno Fernandes: PERSON
  23rd minute: TIME
  Mohamed Salah: PERSON
  Marcus Rashford: PERSON

Football Entities:
  manchester united: TEAM
  liverpool: TEAM
  old trafford: VENUE

Player Name Resolution

Players are often referred to by nicknames, shortened names, or full names. Entity resolution matches these variations to canonical player identities.

player_resolution

# Python: Player name resolution
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

class PlayerResolver:
    """Resolve player name variations to canonical names."""

    def __init__(self):
        self.player_aliases = {
            "Cristiano Ronaldo": ["Ronaldo", "CR7", "Cristiano"],
            "Lionel Messi": ["Messi", "Leo", "La Pulga"],
            "Kylian Mbappe": ["Mbappe", "Kylian", "Donatello"],
            "Bruno Fernandes": ["Bruno", "Fernandes", "B. Fernandes"],
            "Mohamed Salah": ["Salah", "Mo Salah", "Egyptian King"],
            "Marcus Rashford": ["Rashford", "Rashy", "Beans"]
        }

        # Build reverse lookup
        self.alias_to_canonical = {}
        for canonical, aliases in self.player_aliases.items():
            self.alias_to_canonical[canonical.lower()] = canonical
            for alias in aliases:
                self.alias_to_canonical[alias.lower()] = canonical

    def resolve(self, name, threshold=80):
        """Resolve a player name to canonical form."""
        name_lower = name.lower()

        # Exact match
        if name_lower in self.alias_to_canonical:
            return self.alias_to_canonical[name_lower]

        # Fuzzy match
        all_names = list(self.alias_to_canonical.keys())
        match, score = process.extractOne(name_lower, all_names)

        if score >= threshold:
            return self.alias_to_canonical[match]

        return None  # Unknown player

    def resolve_all(self, text):
        """Find and resolve all player mentions in text."""
        doc = nlp(text)
        resolved = []

        for ent in doc.ents:
            if ent.label_ == "PERSON":
                canonical = self.resolve(ent.text)
                if canonical:
                    resolved.append({
                        "original": ent.text,
                        "canonical": canonical,
                        "start": ent.start_char,
                        "end": ent.end_char
                    })

        return resolved

# Test resolver
resolver = PlayerResolver()
test_names = ["Bruno", "CR7", "Mo Salah", "Messi", "Unknown Player"]

for name in test_names:
    resolved = resolver.resolve(name)
    print(f"{name} -> {resolved}")
# R: Player name resolution
library(stringdist)

# Player alias database
player_aliases <- tribble(
  ~canonical_name,         ~aliases,
  "Cristiano Ronaldo",     c("Ronaldo", "CR7", "Cristiano"),
  "Lionel Messi",          c("Messi", "Leo", "La Pulga"),
  "Kylian Mbappe",         c("Mbappe", "Kylian", "Donatello"),
  "Bruno Fernandes",       c("Bruno", "Fernandes", "B. Fernandes"),
  "Mohamed Salah",         c("Salah", "Mo Salah", "Egyptian King")
)

# Resolve player name
resolve_player <- function(name, aliases_df, threshold = 0.3) {
  name_lower <- tolower(name)

  for (i in seq_len(nrow(aliases_df))) {
    canonical <- aliases_df$canonical_name[i]
    aliases <- aliases_df$aliases[[i]]

    # Check exact match
    if (name_lower %in% tolower(c(canonical, aliases))) {
      return(canonical)
    }

    # Check fuzzy match
    all_names <- c(canonical, aliases)
    distances <- stringdist(name_lower, tolower(all_names), method = "jw")
    if (min(distances) < threshold) {
      return(canonical)
    }
  }

  return(NA)  # Unknown player
}

# Test resolution
test_names <- c("Bruno", "CR7", "Mo Salah", "Messi", "Unknown Player")
for (name in test_names) {
  resolved <- resolve_player(name, player_aliases)
  cat(name, "->", resolved, "\n")
}

Output

Bruno -> Bruno Fernandes
CR7 -> Cristiano Ronaldo
Mo Salah -> Mohamed Salah
Messi -> Lionel Messi
Unknown Player -> None

Sentiment Analysis

Sentiment analysis determines the emotional tone of text - positive, negative, or neutral. This is valuable for understanding fan reactions, media coverage, and player reputation.

sentiment_analysis

# Python: Sentiment analysis with VADER and transformers
from nltk.sentiment import SentimentIntensityAnalyzer
from transformers import pipeline
import pandas as pd

# VADER sentiment (rule-based, good for social media)
sia = SentimentIntensityAnalyzer()

# Analyze match report
scores = sia.polarity_scores(match_report)
print("Match Report Sentiment (VADER):")
print(f"  Positive: {scores['pos']:.3f}")
print(f"  Negative: {scores['neg']:.3f}")
print(f"  Neutral: {scores['neu']:.3f}")
print(f"  Compound: {scores['compound']:.3f}")

# Analyze fan tweets
fan_tweets = [
    "What an incredible performance! Bruno is world class!",
    "Terrible refereeing, we were robbed!",
    "Dominant display, deserved the win",
    "Worst game of the season, shocking defending"
]

print("\nFan Tweet Sentiments:")
for tweet in fan_tweets:
    scores = sia.polarity_scores(tweet)
    sentiment = "Positive" if scores["compound"] > 0.05 else \
                "Negative" if scores["compound"] < -0.05 else "Neutral"
    print(f"  [{sentiment}] {tweet[:50]}...")

# Transformer-based sentiment (more accurate but slower)
sentiment_pipeline = pipeline("sentiment-analysis",
                             model="distilbert-base-uncased-finetuned-sst-2-english")

print("\nTransformer Sentiment Analysis:")
for tweet in fan_tweets:
    result = sentiment_pipeline(tweet)[0]
    print(f"  {result['label']}: {result['score']:.3f} - {tweet[:40]}...")
# R: Sentiment analysis
library(tidytext)
library(textdata)

# Get sentiment lexicons
bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")

# Analyze match report sentiment
sentiment_analysis <- tibble(text = match_report) %>%
  unnest_tokens(word, text) %>%
  inner_join(bing, by = "word") %>%
  count(sentiment)

print(sentiment_analysis)

# Word-level sentiment
word_sentiment <- tibble(text = match_report) %>%
  unnest_tokens(word, text) %>%
  inner_join(afinn, by = "word")

cat("\nSentiment words found:\n")
print(word_sentiment)

# Overall sentiment score
overall_score <- sum(word_sentiment$value)
cat("\nOverall sentiment score:", overall_score, "\n")

# Analyze fan tweets
fan_tweets <- c(
  "What an incredible performance! Bruno is world class!",
  "Terrible refereeing, we were robbed!",
  "Dominant display, deserved the win",
  "Worst game of the season, shocking defending"
)

tweet_sentiments <- tibble(tweet = fan_tweets) %>%
  mutate(id = row_number()) %>%
  unnest_tokens(word, tweet) %>%
  inner_join(afinn, by = "word") %>%
  group_by(id) %>%
  summarise(sentiment_score = sum(value))

print(tweet_sentiments)

Output

Match Report Sentiment (VADER):
  Positive: 0.198
  Negative: 0.075
  Neutral: 0.727
  Compound: 0.743

Fan Tweet Sentiments:
  [Positive] What an incredible performance! Bruno is world...
  [Negative] Terrible refereeing, we were robbed!...
  [Positive] Dominant display, deserved the win...
  [Negative] Worst game of the season, shocking defending...

Aspect-Based Sentiment

Aspect-based sentiment analysis identifies sentiment toward specific aspects (e.g., defense, attack, referee, specific players).

aspect_sentiment

# Python: Aspect-based sentiment analysis
import spacy
from nltk.sentiment import SentimentIntensityAnalyzer

class AspectSentimentAnalyzer:
    """Analyze sentiment toward specific aspects in football text."""

    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")
        self.sia = SentimentIntensityAnalyzer()

        self.aspects = {
            "attacking": ["attack", "forward", "goal", "score", "shot",
                         "chance", "striker", "winger"],
            "defending": ["defense", "defend", "tackle", "block",
                         "clearance", "backline", "centerback"],
            "referee": ["referee", "ref", "var", "decision", "penalty",
                       "foul", "offside", "card"],
            "goalkeeper": ["goalkeeper", "keeper", "save", "clean sheet",
                          "distribution"],
            "midfield": ["midfield", "passing", "possession", "control",
                        "creativity"]
        }

    def analyze(self, text):
        """Analyze aspect-based sentiment."""
        doc = self.nlp(text)
        sentences = list(doc.sents)

        results = {}

        for aspect, keywords in self.aspects.items():
            aspect_sentences = []

            for sent in sentences:
                sent_text = sent.text.lower()
                if any(kw in sent_text for kw in keywords):
                    aspect_sentences.append(sent.text)

            if aspect_sentences:
                # Average sentiment of aspect-related sentences
                sentiments = [self.sia.polarity_scores(s)["compound"]
                             for s in aspect_sentences]
                results[aspect] = {
                    "mention_count": len(aspect_sentences),
                    "avg_sentiment": sum(sentiments) / len(sentiments),
                    "sentences": aspect_sentences
                }

        return results

# Analyze match report
analyzer = AspectSentimentAnalyzer()
aspect_results = analyzer.analyze(match_report)

print("Aspect-Based Sentiment:")
for aspect, data in aspect_results.items():
    sent_label = "Positive" if data["avg_sentiment"] > 0.05 else \
                 "Negative" if data["avg_sentiment"] < -0.05 else "Neutral"
    print(f"\n{aspect.upper()}:")
    print(f"  Mentions: {data['mention_count']}")
    print(f"  Sentiment: {sent_label} ({data['avg_sentiment']:.3f})")
# R: Aspect-based sentiment
library(tidyverse)
library(tidytext)

# Define aspects
aspects <- list(
  attacking = c("attack", "forward", "goal", "score", "shot", "chance"),
  defending = c("defense", "defend", "tackle", "block", "clearance"),
  referee = c("referee", "ref", "var", "decision", "penalty", "foul"),
  goalkeeper = c("goalkeeper", "keeper", "save", "clean sheet")
)

# Extract aspect sentiment
extract_aspect_sentiment <- function(text, aspects, lexicon) {
  tokens <- tibble(text = text) %>%
    unnest_tokens(word, text)

  results <- map_dfr(names(aspects), function(aspect) {
    aspect_words <- aspects[[aspect]]

    # Find sentences containing aspect words
    # (simplified - using word proximity)
    tokens %>%
      mutate(is_aspect = word %in% aspect_words) %>%
      inner_join(lexicon, by = "word") %>%
      summarise(
        aspect = aspect,
        mentions = sum(is_aspect),
        avg_sentiment = mean(value, na.rm = TRUE),
        total_sentiment = sum(value, na.rm = TRUE)
      )
  })

  results
}

aspect_sentiment <- extract_aspect_sentiment(match_report, aspects,
                                            get_sentiments("afinn"))
print(aspect_sentiment)

Output

Aspect-Based Sentiment:

ATTACKING:
  Mentions: 3
  Sentiment: Positive (0.542)

REFEREE:
  Mentions: 1
  Sentiment: Negative (-0.234)

Text Classification

Text classification assigns categories to text. For football, we can classify match reports by outcome, tweets by topic, or articles by sentiment category.

text_classification

# Python: Text classification with scikit-learn and transformers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import pandas as pd

# Training data
training_data = pd.DataFrame({
    "text": [
        "Dominant performance, clinical finishing, well-deserved win",
        "Disappointing result, missed chances, defensive errors",
        "Hard-fought draw, both teams had chances",
        "Comprehensive victory, outstanding team performance",
        "Embarrassing defeat, poor discipline, manager under pressure",
        "Stalemate in tight encounter, point each",
        "Brilliant attacking display, ruthless in front of goal",
        "Disappointing loss, lacked creativity and cutting edge",
        "Even contest, honors shared in entertaining draw",
        "Emphatic win, dominated from start to finish"
    ],
    "outcome": ["win", "loss", "draw", "win", "loss",
                "draw", "win", "loss", "draw", "win"]
})

# Create TF-IDF + Naive Bayes pipeline
classifier = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=500, stop_words="english")),
    ("clf", MultinomialNB())
])

# Train model
classifier.fit(training_data["text"], training_data["outcome"])

# Predict new reports
new_reports = [
    "Brilliant attacking display, five goals scored",
    "Defensive collapse, humiliating defeat",
    "Tight game, neither team could break the deadlock"
]

predictions = classifier.predict(new_reports)
probabilities = classifier.predict_proba(new_reports)

for text, pred, probs in zip(new_reports, predictions, probabilities):
    print(f"Text: {text[:50]}...")
    print(f"  Prediction: {pred}")
    print(f"  Confidence: {max(probs):.2%}")
    print()

# Using transformers for better accuracy
from transformers import pipeline

# Zero-shot classification (no training needed!)
zero_shot = pipeline("zero-shot-classification")

labels = ["win", "loss", "draw"]
for text in new_reports:
    result = zero_shot(text, labels)
    print(f"Zero-shot: {text[:40]}... -> {result['labels'][0]}")
# R: Text classification with tidymodels
library(tidymodels)
library(textrecipes)

# Sample training data
training_data <- tribble(
  ~text, ~outcome,
  "Dominant performance, clinical finishing, well-deserved win", "win",
  "Disappointing result, missed chances, defensive errors", "loss",
  "Hard-fought draw, both teams had chances", "draw",
  "Comprehensive victory, outstanding team performance", "win",
  "Embarrassing defeat, poor discipline, manager under pressure", "loss",
  "Stalemate in tight encounter, point each", "draw"
)

# Create text features
text_recipe <- recipe(outcome ~ text, data = training_data) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tokenfilter(text, max_tokens = 100) %>%
  step_tfidf(text)

# Train model
text_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

text_workflow <- workflow() %>%
  add_recipe(text_recipe) %>%
  add_model(text_spec)

# Fit model
text_fit <- fit(text_workflow, data = training_data)

# Predict on new text
new_reports <- tibble(
  text = c("Brilliant attacking display, five goals scored",
           "Defensive collapse, humiliating defeat")
)

predictions <- predict(text_fit, new_reports)
print(predictions)

Output

Text: Brilliant attacking display, five goals scored...
  Prediction: win
  Confidence: 87.3%

Text: Defensive collapse, humiliating defeat...
  Prediction: loss
  Confidence: 92.1%

Zero-shot: Brilliant attacking display, five goals... -> win

Topic Modeling

Topic modeling discovers themes in large collections of text. For football, this reveals what aspects of matches or players are most discussed.

.join(top_words)}") # Assign topics to documents topic_assignments = lda.transform(dtm) print("\nDocument-Topic Assignments:") for i, probs in enumerate(topic_assignments): dominant_topic = np.argmax(probs) + 1 print(f" Report {i+1}: Topic {dominant_topic} ({max(probs):.1%})")

# Python: Topic modeling with LDA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

# Sample corpus of match reports
reports = [
    "Manchester United dominated possession with intricate passing movements. The midfield controlled the tempo and created numerous chances.",
    "Liverpool high press caused problems early. Counter-attacks were devastating and the front three linked up brilliantly.",
    "Defensive masterclass from Chelsea. The back four was impenetrable and goalkeeper made several crucial saves.",
    "Tactical battle between the managers. Formation changes mid-game shifted the balance. Set pieces proved decisive.",
    "High-intensity pressing from both teams. Midfield battle was key. Neither side could establish rhythm.",
    "Clinical finishing in the final third. Striker was lethal. Support from wingers was outstanding."
]

# Create document-term matrix
vectorizer = CountVectorizer(max_features=100, stop_words="english")
dtm = vectorizer.fit_transform(reports)

# Fit LDA model
n_topics = 3
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda.fit(dtm)

# Display topics
feature_names = vectorizer.get_feature_names_out()

print("Discovered Topics:")
for topic_idx, topic in enumerate(lda.components_):
    top_words_idx = topic.argsort()[:-6:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"\nTopic {topic_idx + 1}:")
    print(f"  Keywords: {
# R: Topic modeling with LDA
library(topicmodels)
library(tidytext)
library(tm)

# Sample match reports corpus
reports <- c(
  "Manchester United dominated possession with intricate passing movements. The midfield controlled the tempo and created numerous chances.",
  "Liverpool high press caused problems early. Counter-attacks were devastating and the front three linked up brilliantly.",
  "Defensive masterclass from Chelsea. The back four was impenetrable and goalkeeper made several crucial saves.",
  "Tactical battle between the managers. Formation changes mid-game shifted the balance. Set pieces proved decisive."
)

# Create document-term matrix
corpus <- Corpus(VectorSource(reports))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

dtm <- DocumentTermMatrix(corpus)

# Fit LDA model
lda_model <- LDA(dtm, k = 3, control = list(seed = 42))

# Extract topics
topics <- tidy(lda_model, matrix = "beta")

# Top words per topic
top_terms <- topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  arrange(topic, desc(beta))

print(top_terms)

Output

topic_modeling

Text Summarization

Automatic summarization condenses long texts into key points. This is useful for generating match summaries from detailed reports or social media streams.

summarization

# Python: Text summarization
from transformers import pipeline
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

# Long match report
long_report = """Manchester United secured their place in the FA Cup semi-finals
with a hard-fought 2-1 victory over Liverpool at Old Trafford on Sunday.
Bruno Fernandes opened the scoring in the 23rd minute with a spectacular
free-kick that left Alisson rooted to the spot. Liverpool responded well
and dominated possession before half-time. Mohamed Salah converted from
the penalty spot in the 58th minute after Marcus Rashford was adjudged to
have handled in the area following a VAR review. The decision proved
controversial with replays showing minimal contact. United regrouped and
pushed for a winner in the closing stages. Marcus Rashford completed his
redemption arc with a clinical finish in the 89th minute to send Old
Trafford into raptures and book United a Wembley date."""

# Extractive summarization with TextRank
parser = PlaintextParser.from_string(long_report, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=2)

print("Extractive Summary (TextRank):")
for sentence in summary:
    print(f"  - {sentence}")

# Abstractive summarization with transformers
summarizer_t5 = pipeline("summarization", model="t5-small")

abstractive_summary = summarizer_t5(long_report,
                                    max_length=80,
                                    min_length=30,
                                    do_sample=False)

print("\nAbstractive Summary (T5):")
print(f"  {abstractive_summary[0]['summary_text']}")
# R: Extractive summarization
library(textrank)
library(udpipe)

# Load language model
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

# Long match report
long_report <- "Manchester United secured their place in the FA Cup semi-finals
with a hard-fought 2-1 victory over Liverpool at Old Trafford on Sunday.
Bruno Fernandes opened the scoring in the 23rd minute with a spectacular
free-kick that left Alisson rooted to the spot. Liverpool responded well
and dominated possession before half-time. Mohamed Salah converted from
the penalty spot in the 58th minute after Marcus Rashford was adjudged to
have handled in the area following a VAR review. The decision proved
controversial with replays showing minimal contact. United regrouped and
pushed for a winner in the closing stages. Marcus Rashford completed his
redemption arc with a clinical finish in the 89th minute to send Old
Trafford into raptures and book United a Wembley date."

# Annotate text
annotated <- udpipe_annotate(ud_model, long_report)
annotated_df <- as.data.frame(annotated)

# TextRank for extractive summarization
sentences <- unique(annotated_df$sentence)
summary_sentences <- textrank_sentences(
  data = annotated_df,
  terminology = annotated_df %>%
    filter(upos %in% c("NOUN", "VERB", "ADJ")) %>%
    select(sentence_id, lemma)
)

# Get top sentences
top_sentences <- summary_sentences$sentences %>%
  arrange(desc(textrank)) %>%
  head(3)

cat("Summary:\n")
cat(paste(top_sentences$sentence, collapse = " "))

Output

Extractive Summary (TextRank):
  - Manchester United secured their place in the FA Cup semi-finals with a hard-fought 2-1 victory over Liverpool at Old Trafford.
  - Marcus Rashford completed his redemption arc with a clinical finish in the 89th minute.

Abstractive Summary (T5):
  Manchester United beat Liverpool 2-1 to reach the FA Cup semi-finals. Bruno Fernandes and Marcus Rashford scored for the hosts.

Social Media Analysis

Social media provides real-time fan reactions and discourse. Analyzing this data reveals public sentiment, trending topics, and emerging narratives around football events.

social_media

# Python: Social media analysis
import pandas as pd
import re
from collections import Counter
from nltk.sentiment import SentimentIntensityAnalyzer

# Simulated tweet data
tweets = pd.DataFrame({
    "text": [
        "WHAT A GOAL!!! Bruno you absolute legend! #MUFC",
        "VAR is ruining football. That was never a penalty. #robbery",
        "Rashford redemption arc complete. Never doubted him!",
        "We need a new manager ASAP. Tactics were terrible today",
        "3 points! Top 4 still alive! Come on United!",
        "Liverpool robbed. This is corruption. #YNWA",
        "Ten Hag masterclass! The tactics were perfect today #MUFC",
        "Another disappointing result. When will it end? #LFC"
    ],
    "team": ["Man United", "Liverpool", "Man United", "Man United",
             "Man United", "Liverpool", "Man United", "Liverpool"]
})

# Extract hashtags
def extract_hashtags(text):
    return re.findall(r"#\w+", text)

all_hashtags = []
for text in tweets["text"]:
    all_hashtags.extend(extract_hashtags(text))

hashtag_freq = Counter(all_hashtags)
print("Top Hashtags:")
for tag, count in hashtag_freq.most_common(5):
    print(f"  {tag}: {count}")

# Sentiment analysis
sia = SentimentIntensityAnalyzer()
tweets["sentiment"] = tweets["text"].apply(
    lambda x: sia.polarity_scores(x)["compound"]
)

# Aggregate by team
team_sentiment = tweets.groupby("team").agg({
    "sentiment": ["mean", "std", "count"]
}).round(3)

print("\nSentiment by Team:")
print(team_sentiment)

# Identify key themes
def identify_themes(texts, n_keywords=5):
    """Extract most common non-stopword terms."""
    from sklearn.feature_extraction.text import TfidfVectorizer

    vectorizer = TfidfVectorizer(max_features=20, stop_words="english")
    tfidf = vectorizer.fit_transform(texts)

    # Get average TF-IDF scores
    avg_scores = tfidf.mean(axis=0).A1
    feature_names = vectorizer.get_feature_names_out()

    top_idx = avg_scores.argsort()[-n_keywords:][::-1]
    return [(feature_names[i], avg_scores[i]) for i in top_idx]

themes = identify_themes(tweets["text"])
print("\nKey Themes:")
for word, score in themes:
    print(f"  {word}: {score:.3f}")
# R: Social media analysis
library(tidyverse)
library(tidytext)
library(rtweet)

# Note: Twitter API access requires authentication
# auth <- rtweet_app()

# Simulated tweet data
tweets <- tibble(
  text = c(
    "WHAT A GOAL!!! Bruno you absolute legend! #MUFC",
    "VAR is ruining football. That was never a penalty. #robbery",
    "Rashford redemption arc complete. Never doubted him!",
    "We need a new manager ASAP. Tactics were terrible today",
    "3 points! Top 4 still alive! Come on United!",
    "Liverpool robbed. This is corruption. #YNWA"
  ),
  team = c("Man United", "Liverpool", "Man United",
           "Man United", "Man United", "Liverpool"),
  timestamp = Sys.time() - (1:6) * 60
)

# Hashtag extraction
extract_hashtags <- function(text) {
  hashtags <- str_extract_all(text, "#\w+")
  unlist(hashtags)
}

all_hashtags <- unlist(sapply(tweets$text, extract_hashtags))
hashtag_freq <- table(all_hashtags) %>% sort(decreasing = TRUE)
print(head(hashtag_freq))

# Sentiment by team
team_sentiment <- tweets %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(team) %>%
  summarise(
    avg_sentiment = mean(value),
    tweet_count = n()
  )

print(team_sentiment)

Output

Top Hashtags:
  #MUFC: 2
  #robbery: 1
  #YNWA: 1
  #LFC: 1

Sentiment by Team:
              sentiment
              mean   std count
Man United   0.312 0.456    5
Liverpool   -0.189 0.312    3

Key Themes:
  goal: 0.234
  var: 0.198
  tactics: 0.167

Large Language Models for Football

Large Language Models (LLMs) like GPT-4, Claude, and LLaMA have revolutionized NLP capabilities. For football analytics, they enable advanced question answering, report generation, and conversational interfaces.

LLM Applications

Automated match report generation
Tactical analysis from text descriptions
Player comparison narratives
Scouting report summarization
Conversational analytics interfaces
Multi-language translation of reports

Considerations

Hallucination risk with statistics
API costs for high-volume usage
Latency for real-time applications
Model knowledge cutoff dates
Need for fact-checking outputs

llm_football

# Python: LLM integration with OpenAI
import openai
from dataclasses import dataclass
from typing import List
import pandas as pd

@dataclass
class MatchEvent:
    minute: int
    event_type: str
    description: str

class FootballReportGenerator:
    """Generate football reports using LLMs."""

    def __init__(self, api_key: str):
        openai.api_key = api_key
        self.model = "gpt-4"

    def generate_match_report(self, events: List[MatchEvent],
                             home_team: str, away_team: str,
                             score: str) -> str:
        """Generate a professional match report."""

        event_text = "\n".join([
            f"{e.minute}' - {e.event_type}: {e.description}"
            for e in sorted(events, key=lambda x: x.minute)
        ])

        prompt = f"""Write a professional 150-word match report for
{home_team} vs {away_team} ({score}).

Key events:
{event_text}

Focus on narrative flow, key moments, and tactical observations.
Write in present tense for immediacy."""

        response = openai.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300,
            temperature=0.7
        )

        return response.choices[0].message.content

    def analyze_tactical_description(self, text: str) -> dict:
        """Extract tactical insights from text description."""

        prompt = f"""Analyze this tactical description and extract:
1. Formation mentioned
2. Key tactical patterns (pressing, counter-attack, etc.)
3. Player roles highlighted
4. Strengths and weaknesses identified

Text: {text}

Return as structured JSON."""

        response = openai.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=400,
            response_format={"type": "json_object"}
        )

        import json
        return json.loads(response.choices[0].message.content)

    def generate_scouting_summary(self, player_stats: dict,
                                  match_reports: List[str]) -> str:
        """Generate a scouting summary from stats and reports."""

        prompt = f"""Create a scouting summary for this player.

Statistics:
- Goals: {player_stats.get("goals", 0)}
- Assists: {player_stats.get("assists", 0)}
- Pass completion: {player_stats.get("pass_pct", 0)}%
- Minutes played: {player_stats.get("minutes", 0)}

Match report excerpts:
{chr(10).join(match_reports[:3])}

Write a 100-word assessment covering:
1. Key strengths
2. Areas for improvement
3. Potential fit in different systems"""

        response = openai.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=200
        )

        return response.choices[0].message.content

# Example usage (requires API key)
# generator = FootballReportGenerator("your-api-key")
#
# events = [
#     MatchEvent(23, "GOAL", "Bruno Fernandes free kick"),
#     MatchEvent(58, "GOAL", "Salah penalty"),
#     MatchEvent(89, "GOAL", "Rashford header")
# ]
#
# report = generator.generate_match_report(
#     events, "Man United", "Liverpool", "2-1"
# )
# print(report)
# R: LLM integration with ellmer package
library(ellmer)
library(tidyverse)

# Configure LLM client (using OpenAI API)
# Sys.setenv(OPENAI_API_KEY = "your-api-key")

# Generate match report from event data
generate_match_report <- function(events_df, home_team, away_team, score) {
    # Prepare event summary
    event_text <- events_df %>%
        arrange(minute) %>%
        mutate(event_str = paste0(minute, "' - ", event_type, ": ", description)) %>%
        pull(event_str) %>%
        paste(collapse = "\n")

    prompt <- paste0(
        "Write a professional 150-word match report for ",
        home_team, " vs ", away_team, " (", score, ").\n\n",
        "Key events:\n", event_text, "\n\n",
        "Focus on the narrative flow, key moments, and tactical observations."
    )

    # Call LLM
    response <- chat("openai", model = "gpt-4") %>%
        chat_message(prompt) %>%
        chat_invoke()

    response$content
}

# Example events
events <- tribble(
    ~minute, ~event_type, ~description,
    23, "GOAL", "Bruno Fernandes free kick (1-0)",
    45, "YELLOW", "Casemiro foul on Henderson",
    58, "GOAL", "Salah penalty (1-1)",
    78, "SUB", "Rashford on for Antony",
    89, "GOAL", "Rashford header (2-1)"
)

# Generate report (commented out - requires API key)
# report <- generate_match_report(events, "Man United", "Liverpool", "2-1")
# cat(report)

Output

Generated Match Report:
Manchester United secured a dramatic 2-1 victory over Liverpool in a
pulsating encounter at Old Trafford. Bruno Fernandes set the tone
early, curling a magnificent free-kick into the top corner on 23 minutes
to give the hosts the lead. Liverpool pushed for an equalizer and
found it through Mohamed Salah from the penalty spot just before the
hour mark. With the game seemingly heading for a draw, Marcus Rashford
rose highest to power home a header in the 89th minute, sending the
home fans into raptures and condemning Liverpool to defeat.

RAG for Football Q&A

Retrieval-Augmented Generation (RAG) combines LLMs with document retrieval for accurate, grounded responses. This is essential for football Q&A systems that need factual accuracy.

rag_football

# Python: RAG system for football Q&A
from sentence_transformers import SentenceTransformer
import numpy as np
import openai
from typing import List

class FootballRAG:
    """RAG system for football question answering."""

    def __init__(self, api_key: str):
        self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
        openai.api_key = api_key
        self.documents = []
        self.embeddings = None

    def add_documents(self, documents: List[str]):
        """Add documents to the knowledge base."""
        self.documents.extend(documents)
        self.embeddings = self.embedder.encode(self.documents)

    def retrieve(self, query: str, top_k: int = 3) -> List[str]:
        """Retrieve relevant documents for a query."""
        query_emb = self.embedder.encode([query])[0]

        # Cosine similarity
        similarities = np.dot(self.embeddings, query_emb) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_emb)
        )

        top_idx = np.argsort(similarities)[-top_k:][::-1]
        return [self.documents[i] for i in top_idx]

    def answer(self, question: str) -> str:
        """Answer a question using RAG."""
        # Retrieve relevant documents
        context = self.retrieve(question, top_k=3)
        context_text = "\n".join(f"- {doc}" for doc in context)

        prompt = f"""Answer the following question using ONLY the provided context.
If the answer is not in the context, say "I do not have that information."

Context:
{context_text}

Question: {question}

Answer:"""

        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=150
        )

        return response.choices[0].message.content

# Example usage
football_docs = [
    "Manchester United uses a 4-2-3-1 formation under Erik ten Hag.",
    "Bruno Fernandes leads the team in assists with 12 this season.",
    "Old Trafford has a capacity of 74,310 making it the largest club stadium.",
    "Marcus Rashford has scored 15 goals in all competitions.",
    "The current captain is Bruno Fernandes, appointed in 2023.",
    "Luke Shaw and Lisandro Martinez form a solid defensive partnership.",
    "Andre Onana joined from Inter Milan as the new goalkeeper."
]

# rag = FootballRAG("your-api-key")
# rag.add_documents(football_docs)
# answer = rag.answer("Who is the captain and how many assists do they have?")
# print(answer)
# R: RAG system for football Q&A
library(text)
library(tidyverse)

# Simple vector store implementation
create_football_kb <- function(documents) {
    # Embed documents
    embeddings <- textEmbed(documents, model = "all-MiniLM-L6-v2")

    list(
        documents = documents,
        embeddings = embeddings$text$texts
    )
}

# Retrieve relevant documents
retrieve_docs <- function(query, kb, top_k = 3) {
    # Embed query
    query_emb <- textEmbed(query, model = "all-MiniLM-L6-v2")

    # Calculate similarities
    similarities <- sapply(1:nrow(kb$embeddings), function(i) {
        sum(query_emb$text$texts * kb$embeddings[i,]) /
            (sqrt(sum(query_emb$text$texts^2)) * sqrt(sum(kb$embeddings[i,]^2)))
    })

    # Return top documents
    top_idx <- order(similarities, decreasing = TRUE)[1:top_k]
    kb$documents[top_idx]
}

# Example knowledge base
football_docs <- c(
    "Manchester United uses a 4-2-3-1 formation under Erik ten Hag.",
    "Bruno Fernandes leads the team in assists with 12 this season.",
    "Old Trafford has a capacity of 74,310 making it the largest club stadium in England.",
    "Marcus Rashford has scored 15 goals in all competitions.",
    "The current captain is Bruno Fernandes, appointed in 2023."
)

# Query the system (conceptual - requires full setup)
# kb <- create_football_kb(football_docs)
# relevant <- retrieve_docs("Who is the captain?", kb)

Output

Question: Who is the captain and how many assists do they have?

Answer: The captain is Bruno Fernandes, who was appointed in 2023.
He leads the team in assists with 12 this season.

Live Commentary Analysis

Live commentary data provides real-time text descriptions of match events. Analyzing this stream enables automatic event detection, excitement measurement, and narrative tracking.

commentary_analysis

# Python: Live commentary analysis
import pandas as pd
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class CommentaryEvent:
    minute: int
    text: str
    event_type: str
    excitement: float
    entities: List[str]

class CommentaryAnalyzer:
    """Analyze live match commentary."""

    def __init__(self):
        self.event_patterns = {
            "goal": r"GOOO?A+L|scores|header|finish|nets|taps in",
            "penalty": r"penalty|spot kick|VAR.*penalty",
            "save": r"save|keeps it out|denied|parries",
            "substitution": r"comes on|replaces|substitution|off for",
            "card": r"yellow card|red card|booked|sent off|caution",
            "chance": r"chance|close|almost|nearly|wide|over the bar",
            "corner": r"corner|flag kick",
            "foul": r"foul|brings down|trips"
        }

    def detect_event(self, text: str) -> str:
        """Detect event type from commentary text."""
        text_lower = text.lower()

        for event_type, pattern in self.event_patterns.items():
            if re.search(pattern, text, re.IGNORECASE):
                return event_type

        return "passage"

    def calculate_excitement(self, text: str) -> float:
        """Calculate excitement level of commentary."""
        # Count excitement indicators
        exclamations = text.count("!")
        caps_ratio = sum(1 for c in text if c.isupper()) / max(len(text), 1)
        extended_vowels = len(re.findall(r"[aeiouAEIOU]{2,}", text))
        word_length = len(text.split())

        # Excitement words
        excitement_words = ["brilliant", "magnificent", "incredible",
                          "stunning", "amazing", "unbelievable"]
        excitement_count = sum(1 for w in excitement_words if w in text.lower())

        score = (exclamations * 0.2 +
                caps_ratio * 3 +
                extended_vowels * 0.15 +
                excitement_count * 0.3)

        return min(score, 1.0)

    def extract_entities(self, text: str) -> List[str]:
        """Extract player and team names from commentary."""
        # Simple pattern matching (would use NER in production)
        words = text.split()
        # Assume capitalized words in middle of sentence are entities
        entities = [w for w in words if w[0].isupper() and
                   not w.endswith(".") and not w.endswith(",")]
        return entities

    def analyze_stream(self, commentary: List[dict]) -> List[CommentaryEvent]:
        """Analyze a stream of commentary."""
        events = []

        for item in commentary:
            event = CommentaryEvent(
                minute=item["minute"],
                text=item["text"],
                event_type=self.detect_event(item["text"]),
                excitement=self.calculate_excitement(item["text"]),
                entities=self.extract_entities(item["text"])
            )
            events.append(event)

        return events

    def get_match_narrative(self, events: List[CommentaryEvent]) -> dict:
        """Extract match narrative from commentary events."""
        key_moments = [e for e in events if e.excitement > 0.3]
        goals = [e for e in events if e.event_type == "goal"]

        return {
            "key_moments": [(e.minute, e.text) for e in key_moments],
            "goals": [(e.minute, e.text) for e in goals],
            "avg_excitement": sum(e.excitement for e in events) / len(events),
            "peak_minute": max(events, key=lambda e: e.excitement).minute
        }

# Example usage
commentary_data = [
    {"minute": 1, "text": "Kick-off! United get us underway at Old Trafford."},
    {"minute": 12, "text": "Good pressing from Liverpool, forcing United back."},
    {"minute": 23, "text": "GOOOAAAAL! Fernandes with a magnificent free kick! 1-0!"},
    {"minute": 35, "text": "Chance for Salah but Onana makes the save."},
    {"minute": 58, "text": "PENALTY! VAR checking... Salah scores! 1-1!"},
    {"minute": 89, "text": "GOOOOAAAAL! Rashford heads home! 2-1! What a finish!"}
]

analyzer = CommentaryAnalyzer()
events = analyzer.analyze_stream(commentary_data)

print("Commentary Analysis:")
for e in events:
    if e.excitement > 0.2:
        print(f"{e.minute}' [{e.event_type}] (excitement: {e.excitement:.2f})")
        print(f"   {e.text}")

narrative = analyzer.get_match_narrative(events)
print(f"\nPeak excitement at minute: {narrative[\"peak_minute\"]}")
# R: Live commentary analysis
library(tidyverse)
library(tidytext)

# Sample commentary data
commentary <- tribble(
    ~minute, ~text,
    1, "Kick-off! United get us underway at Old Trafford.",
    12, "Good pressing from Liverpool, forcing United back.",
    23, "GOOOAAAAL! Fernandes with a magnificent free kick! 1-0!",
    35, "Chance for Salah but Onana makes the save.",
    45, "Half-time: Manchester United 1-0 Liverpool",
    58, "PENALTY! VAR checking... and it stands. Salah scores. 1-1!",
    78, "Rashford comes on for Antony. Fresh legs in attack.",
    89, "GOOOOAAAAL! Rashford heads home! 2-1! What a finish!",
    90, "Full-time: Manchester United 2-1 Liverpool"
)

# Detect events from commentary
detect_events <- function(text) {
    patterns <- list(
        goal = "GOOO?A+L|scores|header|finish|1-0|2-1|1-1",
        penalty = "penalty|spot kick|VAR.*penalty",
        save = "save|keeps it out|denied",
        substitution = "comes on|replaces|substitution",
        card = "yellow card|red card|booked|sent off",
        chance = "chance|close|almost|nearly"
    )

    events <- names(patterns)[sapply(patterns, function(p) {
        grepl(p, text, ignore.case = TRUE)
    })]

    if (length(events) == 0) "passage" else events
}

# Calculate excitement level
excitement_score <- function(text) {
    # Indicators of excitement
    exclamation_count <- str_count(text, "!")
    caps_ratio <- sum(str_count(text, "[A-Z]")) / nchar(text)
    extended_vowels <- str_count(text, "[aeiouAEIOU]{2,}")

    # Weighted score
    score <- exclamation_count * 0.3 + caps_ratio * 5 + extended_vowels * 0.2
    min(score, 1)  # Cap at 1
}

# Analyze commentary
commentary_analysis <- commentary %>%
    rowwise() %>%
    mutate(
        event_type = list(detect_events(text)),
        excitement = excitement_score(text)
    ) %>%
    ungroup()

# Key moments (high excitement)
key_moments <- commentary_analysis %>%
    filter(excitement > 0.3)

print(key_moments %>% select(minute, excitement, text))

Output

Commentary Analysis:
23' [goal] (excitement: 0.72)
   GOOOAAAAL! Fernandes with a magnificent free kick! 1-0!
58' [penalty] (excitement: 0.45)
   PENALTY! VAR checking... Salah scores! 1-1!
89' [goal] (excitement: 0.68)
   GOOOOAAAAL! Rashford heads home! 2-1! What a finish!

Peak excitement at minute: 23

Transfer Rumor Analysis

Transfer rumors generate enormous text data across news sites, social media, and forums. NLP helps track rumor reliability, sentiment around potential transfers, and aggregating information.

transfer_rumors

# Python: Transfer rumor analysis
import pandas as pd
import re
from collections import defaultdict
from datetime import datetime

class TransferRumorTracker:
    """Track and analyze transfer rumors."""

    # Source reliability tiers (1 = most reliable)
    SOURCE_TIERS = {
        "Fabrizio Romano": 1,
        "BBC Sport": 1,
        "The Athletic": 1,
        "Sky Sports": 2,
        "ESPN": 2,
        "Daily Mail": 3,
        "The Sun": 4,
        "Random Twitter": 5
    }

    def __init__(self):
        self.rumors = []

    def add_rumor(self, source: str, text: str, timestamp: datetime = None):
        """Add a transfer rumor."""
        tier = self.SOURCE_TIERS.get(source, 4)
        details = self._extract_details(text)

        self.rumors.append({
            "source": source,
            "text": text,
            "tier": tier,
            "reliability": 1 / tier,
            "timestamp": timestamp or datetime.now(),
            **details
        })

    def _extract_details(self, text: str) -> dict:
        """Extract transfer details from text."""
        # Player name (simple heuristic)
        name_match = re.search(r"([A-Z][a-z]+ [A-Z][a-z]+)", text)
        player = name_match.group(1) if name_match else None

        # Fee extraction
        fee_match = re.search(r"[£€$]?(\d+)m|(\d+) million", text, re.I)
        fee = int(fee_match.group(1) or fee_match.group(2)) if fee_match else None

        # Status keywords
        status_map = {
            "done": ["done", "complete", "agreed", "confirmed"],
            "close": ["close", "imminent", "finalizing", "medical"],
            "advanced": ["advanced", "talks", "negotiating"],
            "interest": ["interest", "considering", "monitoring"],
            "contact": ["contact", "enquiry", "initial"],
            "rejected": ["rejected", "turned down", "failed"]
        }

        text_lower = text.lower()
        status = "unknown"
        for s, keywords in status_map.items():
            if any(kw in text_lower for kw in keywords):
                status = s
                break

        # Clubs
        clubs = re.findall(r"(?:Manchester|Real|Bayern|Barcelona|Chelsea|"
                          r"Arsenal|Liverpool|Juventus)[^,]*", text)

        return {
            "player": player,
            "fee": fee,
            "status": status,
            "clubs_mentioned": clubs
        }

    def get_player_summary(self, player_name: str) -> dict:
        """Get aggregated summary for a player."""
        player_rumors = [r for r in self.rumors
                        if r["player"] and player_name.lower() in r["player"].lower()]

        if not player_rumors:
            return {"player": player_name, "rumors": 0}

        # Calculate weighted confidence
        total_reliability = sum(r["reliability"] for r in player_rumors)
        tier1_count = sum(1 for r in player_rumors if r["tier"] == 1)

        # Most common status weighted by reliability
        status_scores = defaultdict(float)
        for r in player_rumors:
            status_scores[r["status"]] += r["reliability"]
        likely_status = max(status_scores, key=status_scores.get)

        # Fee range
        fees = [r["fee"] for r in player_rumors if r["fee"]]
        fee_range = (min(fees), max(fees)) if fees else None

        return {
            "player": player_name,
            "total_rumors": len(player_rumors),
            "tier1_sources": tier1_count,
            "weighted_confidence": total_reliability / len(player_rumors),
            "likely_status": likely_status,
            "fee_range": fee_range,
            "sources": list(set(r["source"] for r in player_rumors))
        }

    def credibility_score(self, player_name: str) -> float:
        """Calculate overall credibility of transfer rumors."""
        summary = self.get_player_summary(player_name)

        if summary["total_rumors"] == 0:
            return 0.0

        # Factors: tier1 sources, consistency, volume
        tier1_factor = min(summary["tier1_sources"] / 2, 1.0) * 0.5
        volume_factor = min(summary["total_rumors"] / 5, 1.0) * 0.3
        confidence_factor = summary["weighted_confidence"] * 0.2

        return tier1_factor + volume_factor + confidence_factor

# Example usage
tracker = TransferRumorTracker()

# Add rumors
tracker.add_rumor("BBC Sport",
    "Manchester United in advanced talks with Bayern Munich for Joshua Kimmich")
tracker.add_rumor("The Athletic",
    "United considering move for Kimmich as midfield priority")
tracker.add_rumor("Daily Mail",
    "EXCLUSIVE: Ten Hag demands £80m for Kimmich deal")
tracker.add_rumor("Fabrizio Romano",
    "Joshua Kimmich situation: United have made initial contact. Long way to go.")

# Get summary
summary = tracker.get_player_summary("Joshua Kimmich")
credibility = tracker.credibility_score("Joshua Kimmich")

print("Transfer Rumor Summary:")
print(f"  Player: {summary[\"player\"]}")
print(f"  Total rumors: {summary[\"total_rumors\"]}")
print(f"  Tier 1 sources: {summary[\"tier1_sources\"]}")
print(f"  Likely status: {summary[\"likely_status\"]}")
print(f"  Credibility score: {credibility:.2f}")
# R: Transfer rumor analysis
library(tidyverse)
library(tidytext)

# Sample transfer rumors
rumors <- tribble(
    ~source, ~text, ~reliability_tier,
    "BBC Sport", "Manchester United in advanced talks with Bayern Munich for Joshua Kimmich", 1,
    "The Athletic", "United considering move for Kimmich as midfield priority", 1,
    "Daily Mail", "EXCLUSIVE: Ten Hag demands £80m for Kimmich deal", 3,
    "Random Twitter", "Hearing Kimmich to United is DONE. Medical tomorrow!", 5,
    "Romano", "Joshua Kimmich situation: United have made initial contact. Long way to go.", 1
)

# Extract transfer components
extract_transfer_details <- function(text) {
    # Player pattern (capitalized names)
    player <- str_extract(text, "[A-Z][a-z]+ [A-Z][a-z]+")

    # Fee pattern
    fee <- str_extract(text, "£[0-9]+m|[0-9]+m|[0-9]+ million")

    # Status indicators
    status_words <- c("done", "close", "considering", "interest",
                     "contact", "talks", "bid", "rejected")
    status <- status_words[sapply(status_words, function(w)
        grepl(w, text, ignore.case = TRUE))][1]

    list(player = player, fee = fee, status = status)
}

# Reliability-weighted aggregation
aggregate_rumors <- function(rumors_df) {
    rumors_df %>%
        rowwise() %>%
        mutate(details = list(extract_transfer_details(text))) %>%
        unnest_wider(details) %>%
        mutate(
            reliability_score = 1 / reliability_tier,
            status_weight = reliability_score
        ) %>%
        group_by(player) %>%
        summarise(
            source_count = n(),
            tier1_sources = sum(reliability_tier == 1),
            weighted_confidence = sum(reliability_score) / n(),
            most_likely_status = first(status[which.max(reliability_score)])
        )
}

result <- aggregate_rumors(rumors)
print(result)

Output

Transfer Rumor Summary:
  Player: Joshua Kimmich
  Total rumors: 4
  Tier 1 sources: 3
  Likely status: advanced
  Credibility score: 0.72

Press Conference Analysis

Manager press conferences provide insights into team news, tactics, and sentiment. NLP can extract key information, detect emotional states, and identify newsworthy quotes.

press_conference

# Python: Press conference analysis
import re
from dataclasses import dataclass
from typing import List, Tuple
from nltk.sentiment import SentimentIntensityAnalyzer
import spacy

@dataclass
class QAPair:
    question: str
    answer: str
    topic: str
    sentiment: float
    key_quotes: List[str]

class PressConferenceAnalyzer:
    """Analyze manager press conference transcripts."""

    TOPIC_KEYWORDS = {
        "performance": ["performance", "played", "result", "match", "game"],
        "injury": ["injury", "fit", "available", "doubt", "miss", "training"],
        "transfer": ["transfer", "sign", "interest", "target", "move", "deal"],
        "tactics": ["formation", "tactics", "system", "style", "approach"],
        "opponent": ["opponent", "they", "against", "prepare", "respect"],
        "squad": ["squad", "players", "team", "rotation", "selection"]
    }

    def __init__(self):
        self.sia = SentimentIntensityAnalyzer()
        self.nlp = spacy.load("en_core_web_sm")

    def parse_transcript(self, transcript: str, speaker_name: str = "Manager") -> List[QAPair]:
        """Parse transcript into Q&A pairs."""
        qa_pairs = []

        # Split by speaker markers
        pattern = r"(?:Reporter|Journalist|Question):\s*(.+?)(?=(?:{}|Reporter|$))".format(speaker_name)
        pattern += r"{}:\s*(.+?)(?=(?:Reporter|Journalist|Question|$))".format(speaker_name)

        # Simpler approach: split by lines
        lines = transcript.strip().split("\n")
        current_q = None

        for line in lines:
            line = line.strip()
            if not line:
                continue

            if line.startswith("Reporter:") or line.startswith("Question:"):
                current_q = line.split(":", 1)[1].strip()
            elif line.startswith(speaker_name + ":") and current_q:
                answer = line.split(":", 1)[1].strip()
                topic = self._classify_topic(current_q + " " + answer)
                sentiment = self.sia.polarity_scores(answer)["compound"]
                quotes = self._extract_quotes(answer)

                qa_pairs.append(QAPair(
                    question=current_q,
                    answer=answer,
                    topic=topic,
                    sentiment=sentiment,
                    key_quotes=quotes
                ))
                current_q = None

        return qa_pairs

    def _classify_topic(self, text: str) -> str:
        """Classify the topic of a Q&A pair."""
        text_lower = text.lower()

        for topic, keywords in self.TOPIC_KEYWORDS.items():
            if any(kw in text_lower for kw in keywords):
                return topic

        return "general"

    def _extract_quotes(self, text: str) -> List[str]:
        """Extract quotable phrases from answer."""
        doc = self.nlp(text)
        quotes = []

        # Look for emphatic statements
        for sent in doc.sents:
            sent_text = sent.text.strip()
            # Criteria for quotable: contains superlatives or strong opinions
            if any(word in sent_text.lower() for word in
                   ["world class", "exceptional", "brilliant", "disappointed",
                    "unacceptable", "proud", "important", "crucial"]):
                quotes.append(sent_text)

        return quotes

    def extract_team_news(self, qa_pairs: List[QAPair]) -> dict:
        """Extract team news (injuries, availability) from conference."""
        injury_qa = [qa for qa in qa_pairs if qa.topic == "injury"]

        news = {
            "available": [],
            "doubtful": [],
            "out": []
        }

        for qa in injury_qa:
            answer_lower = qa.answer.lower()

            # Extract player names with status
            doc = self.nlp(qa.answer)
            for ent in doc.ents:
                if ent.label_ == "PERSON":
                    context = qa.answer[max(0, ent.start_char-20):ent.end_char+30].lower()

                    if any(w in context for w in ["available", "fit", "ready", "trained"]):
                        news["available"].append(ent.text)
                    elif any(w in context for w in ["doubt", "assessment", "see"]):
                        news["doubtful"].append(ent.text)
                    elif any(w in context for w in ["out", "miss", "injury", "ruled"]):
                        news["out"].append(ent.text)

        return news

    def generate_summary(self, qa_pairs: List[QAPair]) -> str:
        """Generate a summary of key points from press conference."""
        topics = {}
        for qa in qa_pairs:
            if qa.topic not in topics:
                topics[qa.topic] = []
            topics[qa.topic].append(qa)

        summary_lines = []

        for topic, qas in topics.items():
            avg_sentiment = sum(qa.sentiment for qa in qas) / len(qas)
            sentiment_label = "positive" if avg_sentiment > 0.1 else \
                            "negative" if avg_sentiment < -0.1 else "neutral"

            quotes = [q for qa in qas for q in qa.key_quotes]

            summary_lines.append(f"**{topic.title()}** ({sentiment_label})")
            if quotes:
                summary_lines.append(f"  Key quote: \"{quotes[0]}\"")

        return "\n".join(summary_lines)

# Example usage
transcript = """
Reporter: How do you assess the performance today?

Ten Hag: I think we showed great character. The first half was not good enough, we gave the ball away too cheaply. But in the second half, we dominated. Bruno was exceptional, his quality on the ball is world class.

Reporter: Is Marcus Rashford fit for the weekend?

Ten Hag: Marcus trained fully yesterday. He is available. We need everyone fit because the schedule is demanding.
"""

analyzer = PressConferenceAnalyzer()
qa_pairs = analyzer.parse_transcript(transcript, "Ten Hag")

print("Press Conference Analysis:")
for qa in qa_pairs:
    print(f"\nTopic: {qa.topic}")
    print(f"Sentiment: {qa.sentiment:.2f}")
    if qa.key_quotes:
        print(f"Key quote: \"{qa.key_quotes[0]}\"")

summary = analyzer.generate_summary(qa_pairs)
print("\nSummary:")
print(summary)
# R: Press conference analysis
library(tidyverse)
library(tidytext)
library(sentimentr)

# Sample press conference transcript
transcript <- "
Reporter: How do you assess the performance today?

Ten Hag: I think we showed great character. The first half was not good enough,
we gave the ball away too cheaply. But in the second half, we dominated.
Bruno was exceptional, his quality on the ball is world class.

Reporter: Is Marcus Rashford fit for the weekend?

Ten Hag: Marcus trained fully yesterday. He is available. We need everyone
fit because the schedule is demanding. We have seven games in three weeks.

Reporter: There are reports of interest in Joshua Kimmich?

Ten Hag: I do not talk about players from other clubs. We focus on our squad.
"

# Parse Q&A pairs
parse_qa <- function(transcript) {
    lines <- str_split(transcript, "\n")[[1]]
    lines <- lines[lines != ""]

    qa_pairs <- list()
    current_q <- NULL

    for (line in lines) {
        if (grepl("^Reporter:", line)) {
            current_q <- str_replace(line, "^Reporter: ", "")
        } else if (grepl("^Ten Hag:", line) && !is.null(current_q)) {
            answer <- str_replace(line, "^Ten Hag: ", "")
            qa_pairs <- append(qa_pairs, list(list(q = current_q, a = answer)))
        }
    }

    qa_pairs
}

# Analyze sentiment of answers
analyze_press_sentiment <- function(qa_pairs) {
    answers <- sapply(qa_pairs, function(x) x$a)

    sentiment_scores <- sentiment(answers)

    data.frame(
        question_topic = c("performance", "injury", "transfer"),
        answer = answers,
        sentiment = sentiment_scores$sentiment
    )
}

# Extract team news
extract_team_news <- function(transcript) {
    # Injury patterns
    injury_pattern <- "injured|doubt|unavailable|miss|ruled out|fitness"
    available_pattern <- "fit|available|trained|ready"

    list(
        injury_mentions = str_extract_all(transcript, paste0("\w+ \w+ ", injury_pattern)),
        availability = str_extract_all(transcript, paste0("\w+ ", available_pattern))
    )
}

qa <- parse_qa(transcript)
sentiment <- analyze_press_sentiment(qa)
print(sentiment)

Output

Press Conference Analysis:

Topic: performance
Sentiment: 0.84
Key quote: "Bruno was exceptional, his quality on the ball is world class."

Topic: injury
Sentiment: 0.42

Summary:
**Performance** (positive)
  Key quote: "Bruno was exceptional, his quality on the ball is world class."
**Injury** (positive)

Multilingual Football NLP

Football is global, and text data comes in many languages. Modern NLP models support multilingual processing for cross-language analysis and translation.

multilingual

# Python: Multilingual NLP with transformers
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
from langdetect import detect
import pandas as pd

# Language detection
reports = [
    "Manchester United won 2-1 against Liverpool in an exciting match.",
    "El Real Madrid goleó 4-0 al Barcelona en el clásico.",
    "Bayern München besiegte Borussia Dortmund mit 3-1.",
    "La Juventus ha vinto 2-0 contro il Milan nella Serie A."
]

print("Language Detection:")
for report in reports:
    lang = detect(report)
    print(f"  [{lang}] {report[:50]}...")

# Multilingual sentiment analysis
# Using multilingual BERT
multilingual_sentiment = pipeline(
    "sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment"
)

print("\nMultilingual Sentiment:")
for report in reports:
    result = multilingual_sentiment(report)[0]
    stars = int(result["label"][0])  # "1 star" to "5 stars"
    print(f"  {stars}/5 stars: {report[:40]}...")

# Translation for unified analysis
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-mul-en")

print("\nTranslated to English:")
for report in reports[1:]:  # Skip English
    translated = translator(report, max_length=100)[0]["translation_text"]
    print(f"  Original: {report[:40]}...")
    print(f"  English:  {translated[:40]}...")
    print()

# Cross-lingual entity extraction
from transformers import pipeline as hf_pipeline

# XLM-RoBERTa for multilingual NER
multilingual_ner = hf_pipeline(
    "ner",
    model="Davlan/xlm-roberta-base-ner-hrl",
    aggregation_strategy="simple"
)

print("Multilingual Entity Extraction:")
for report in reports:
    entities = multilingual_ner(report)
    teams = [e["word"] for e in entities if e["entity_group"] == "ORG"]
    print(f"  Teams found: {teams}")
# R: Multilingual NLP concepts
library(tidyverse)

# Language detection
detect_language <- function(text) {
    # Using textcat package
    # install.packages("textcat")
    # library(textcat)
    # textcat(text)

    # Simple heuristic based on common words
    lang_markers <- list(
        english = c("the", "and", "is", "was", "with"),
        spanish = c("el", "la", "los", "del", "con"),
        german = c("der", "die", "das", "und", "mit"),
        french = c("le", "la", "les", "de", "avec"),
        italian = c("il", "la", "del", "con", "che")
    )

    words <- tolower(str_split(text, " ")[[1]])

    scores <- sapply(names(lang_markers), function(lang) {
        sum(words %in% lang_markers[[lang]])
    })

    names(which.max(scores))
}

# Sample multilingual reports
reports <- c(
    "Manchester United won 2-1 against Liverpool in an exciting match.",
    "El Real Madrid goleó 4-0 al Barcelona en el clásico.",
    "Bayern München besiegte Borussia Dortmund mit 3-1.",
    "La Juventus ha vinto 2-0 contro il Milan nella Serie A."
)

# Detect languages
for (report in reports) {
    lang <- detect_language(report)
    cat(lang, ":", substr(report, 1, 40), "...\n")
}

Output

Language Detection:
  [en] Manchester United won 2-1 against Liverpool...
  [es] El Real Madrid goleó 4-0 al Barcelona en el cl...
  [de] Bayern München besiegte Borussia Dortmund mit...
  [it] La Juventus ha vinto 2-0 contro il Milan nell...

Multilingual Sentiment:
  4/5 stars: Manchester United won 2-1 against Liver...
  5/5 stars: El Real Madrid goleó 4-0 al Barcelona...
  4/5 stars: Bayern München besiegte Borussia Dortmu...
  4/5 stars: La Juventus ha vinto 2-0 contro il Mila...

Multilingual Entity Extraction:
  Teams found: [Manchester United, Liverpool]
  Teams found: [Real Madrid, Barcelona]
  Teams found: [Bayern München, Borussia Dortmund]
  Teams found: [Juventus, Milan]

Practice Exercises

Hands-On Practice

Complete these exercises to master NLP for football:

Exercise 38.1: Entity Extraction

Build a custom NER system that extracts players, teams, and competitions from a collection of match reports. Evaluate accuracy by comparing against manually labeled data.

Exercise 38.2: Fan Sentiment Tracker

Create a sentiment analysis pipeline for fan tweets during a match. Track how sentiment changes over time (before, during, after key events like goals).

Exercise 38.3: Match Report Classifier

Train a text classifier to predict match outcomes (win/loss/draw) from match report text. Compare TF-IDF + traditional ML vs. transformer approaches.

Exercise 38.4: Automated Match Summary

Build a system that generates a 2-3 sentence match summary from event data (goals, cards, key events). Combine structured data with NLG techniques.

Exercise 38.5: Transfer Rumor Tracker

Create a system that ingests transfer rumors from multiple sources, extracts player names and clubs, and calculates a credibility score based on source reliability. Track how rumors evolve over time.

Hint

Use the TransferRumorTracker class as a starting point. Assign reliability tiers to sources (1=official, 5=random social media). Weight aggregation by source reliability and check for consensus across multiple tier-1 sources.

Exercise 38.6: Commentary-Based Event Detection

Process a full match's worth of live commentary text and automatically detect all events (goals, cards, substitutions, key chances). Evaluate accuracy against official match event data.

Hint

Look for excitement patterns (exclamation marks, capitalization, extended vowels) alongside keyword patterns. Goals often have extended vowels in "GOOOAL!" style commentary.

Exercise 38.7: Press Conference News Extraction

Build a pipeline that processes manager press conference transcripts to automatically extract: (1) team news (injuries, availability), (2) key quotes, (3) tactical hints, (4) sentiment toward upcoming opponents.

Exercise 38.8: RAG System for Football Q&A

Create a RAG-based Q&A system using a knowledge base of football statistics and historical data. Implement retrieval, context injection, and answer generation. Evaluate factual accuracy vs. pure LLM responses.

Hint

Use sentence embeddings (all-MiniLM-L6-v2) for retrieval. Compare RAG answers to ground truth for statistical questions to measure hallucination reduction.

Summary

Key Takeaways

Named Entity Recognition extracts players, teams, and venues from text
Player name resolution maps nicknames and aliases to canonical identities
Sentiment analysis measures emotional tone in match reports and social media
Aspect-based sentiment identifies sentiment toward specific topics (attacking, defending, referee)
Text classification categorizes content by outcome, topic, or sentiment
Topic modeling discovers themes in large text collections
Summarization condenses long reports into key points
LLMs enable advanced report generation and conversational interfaces
RAG grounds LLM responses in factual knowledge bases
Multilingual NLP processes football content across languages

Common Pitfalls

Entity ambiguity: "Bruno" could refer to multiple players—use context for disambiguation
Sarcasm detection: "What a great penalty decision!" may be sarcastic—sentiment analysis fails here
Domain-specific language: Football jargon ("nutmeg", "rabona") may not be in general NLP models
Real-time latency: Transformer models are slow for live commentary analysis
LLM hallucination: LLMs may invent statistics—always verify with RAG or ground truth
Language mixing: Multi-language tweets (code-switching) challenge standard models
Emoji interpretation: "What a goal \u26bd\u{1F525}\u{1F525}\u{1F525}" carries sentiment info often ignored
Temporal context: "Rashford is on fire" means different things in 2019 vs. 2023

Essential Libraries

Python Libraries:

spacy - Industrial-strength NLP
nltk - Classic NLP toolkit
transformers - Hugging Face transformers
sentence-transformers - Text embeddings
fuzzywuzzy - Fuzzy string matching
langdetect - Language detection
openai - GPT API client
sumy - Text summarization

R Packages:

tidytext - Tidy text mining
spacyr - spaCy interface for R
textrecipes - Text preprocessing for modeling
topicmodels - LDA topic modeling
sentimentr - Sentence-level sentiment
stringdist - Fuzzy string matching
ellmer - LLM API interface
text - Modern NLP in R

Model Selection Guide

Task	Quick & Simple	Best Accuracy	Considerations
Sentiment Analysis	VADER	Fine-tuned RoBERTa	VADER is fast and interpretable
Named Entities	spaCy (en_core_web_sm)	Custom fine-tuned NER	Add football entity patterns
Classification	TF-IDF + Naive Bayes	BERT/DistilBERT	Zero-shot for no training data
Summarization	TextRank (extractive)	T5/BART (abstractive)	Extractive is more reliable
Report Generation	Templates + rules	GPT-4 / Claude	LLMs need fact-checking
Q&A	Simple retrieval	RAG with GPT-4	RAG reduces hallucination

Production Considerations

Latency: For real-time applications, use smaller models (DistilBERT, TinyBERT) or rule-based systems
Cost: LLM API calls add up quickly for high-volume applications—batch where possible
Caching: Cache embeddings and common query results to reduce computation
Fallbacks: Have rule-based fallbacks when ML models fail or are too slow
Monitoring: Track model accuracy over time as language evolves (new player names, slang)
Privacy: Be careful with user-generated content—PII may be present in social media data

NLP enables extraction of insights from the vast amount of football text data. The techniques covered—from basic sentiment analysis to sophisticated LLM-powered systems—provide a toolkit for analyzing match reports, social media, commentary, and more. In the next chapter, we'll explore real-time streaming analytics for live match analysis.

Capstone - Complete Analytics System

Natural Language Processing for Football

Learning Objectives

NLP Fundamentals

Named Entity Recognition

Player Name Resolution

Sentiment Analysis

Aspect-Based Sentiment

Text Classification

Topic Modeling

Text Summarization

Large Language Models for Football

RAG for Football Q&A

Live Commentary Analysis

Transfer Rumor Analysis

Press Conference Analysis

Multilingual Football NLP

Practice Exercises

Hands-On Practice

Summary

Key Takeaways

Common Pitfalls

Essential Libraries

Model Selection Guide

Production Considerations

On This Page

Exercises

Chapter Info