Capstone - Complete Analytics System
Natural Language Processing for Football
Football generates vast amounts of text data: match reports, player interviews, social media discussions, live commentary, and scouting reports. Natural Language Processing (NLP) enables us to extract insights from this unstructured text at scale.
Learning Objectives
- Understand NLP fundamentals for sports analytics
- Extract named entities (players, teams, competitions) from text
- Perform sentiment analysis on football content
- Build text classification models for match reports
- Generate automated match summaries
- Analyze social media discourse around football
NLP Fundamentals
NLP is a branch of AI focused on understanding and generating human language. For football analytics, we apply NLP to extract structured information from text and understand sentiment and topics.
- Match reports and previews
- Post-match interviews
- Social media (Twitter/X, Reddit)
- Live commentary feeds
- Scouting reports
- Transfer news and rumors
- Named Entity Recognition (NER)
- Sentiment Analysis
- Text Classification
- Topic Modeling
- Text Summarization
- Question Answering
# Python: NLP basics with NLTK and spaCy
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from collections import Counter
import spacy
# Download required data
nltk.download("punkt")
nltk.download("stopwords")
# Sample match report
match_report = """Manchester United secured a dramatic 2-1 victory over Liverpool
at Old Trafford. Bruno Fernandes opened the scoring with a spectacular
free kick in the 23rd minute. Mohamed Salah equalized from the penalty
spot after a controversial VAR decision. Marcus Rashford scored the
winner in stoppage time, sending the home fans into raptures."""
# Basic tokenization
sentences = sent_tokenize(match_report)
words = word_tokenize(match_report.lower())
print(f"Sentences: {len(sentences)}")
print(f"Words: {len(words)}")
# Remove stopwords
stop_words = set(stopwords.words("english"))
words_clean = [w for w in words if w.isalpha() and w not in stop_words]
# Word frequency
word_freq = Counter(words_clean)
print("\nTop 10 words:")
for word, count in word_freq.most_common(10):
print(f" {word}: {count}")
# Using spaCy for more advanced processing
nlp = spacy.load("en_core_web_sm")
doc = nlp(match_report)
# Part-of-speech tagging
print("\nNouns found:")
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
print(nouns)# R: NLP basics with tidytext
library(tidyverse)
library(tidytext)
# Sample match report
match_report <- "Manchester United secured a dramatic 2-1 victory over Liverpool
at Old Trafford. Bruno Fernandes opened the scoring with a spectacular
free kick in the 23rd minute. Mohamed Salah equalized from the penalty
spot after a controversial VAR decision. Marcus Rashford scored the
winner in stoppage time, sending the home fans into raptures."
# Tokenize text
tokens <- tibble(text = match_report) %>%
unnest_tokens(word, text)
# Remove stop words
tokens_clean <- tokens %>%
anti_join(stop_words, by = "word")
# Word frequency
word_freq <- tokens_clean %>%
count(word, sort = TRUE)
print(word_freq)
# Bigrams (two-word phrases)
bigrams <- tibble(text = match_report) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word)
print(bigrams)Sentences: 4
Words: 56
Top 10 words:
scoring: 1
manchester: 1
united: 1
victory: 1
liverpool: 1
Nouns found:
['victory', 'scoring', 'kick', 'minute', 'spot', 'decision', 'winner', 'time', 'fans']Named Entity Recognition
Named Entity Recognition (NER) identifies and classifies entities in text such as player names, team names, venues, and competitions. This is essential for extracting structured data from unstructured text.
# Python: Named Entity Recognition with spaCy
import spacy
from spacy.matcher import Matcher, PhraseMatcher
# Load model
nlp = spacy.load("en_core_web_sm")
# Process text
doc = nlp(match_report)
# Extract built-in entities
print("Named Entities:")
for ent in doc.ents:
print(f" {ent.text}: {ent.label_}")
# Custom entity recognition for football
class FootballNER:
"""Custom NER for football-specific entities."""
def __init__(self, nlp):
self.nlp = nlp
self.matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
# Add football entity patterns
self.add_patterns()
def add_patterns(self):
"""Add football-specific patterns."""
# Teams
teams = ["manchester united", "liverpool", "chelsea", "arsenal",
"manchester city", "tottenham", "bayern munich", "real madrid"]
patterns = [self.nlp.make_doc(team) for team in teams]
self.matcher.add("TEAM", patterns)
# Competitions
comps = ["premier league", "champions league", "fa cup",
"world cup", "europa league", "la liga"]
patterns = [self.nlp.make_doc(comp) for comp in comps]
self.matcher.add("COMPETITION", patterns)
# Venues
venues = ["old trafford", "anfield", "emirates", "etihad",
"stamford bridge", "camp nou", "santiago bernabeu"]
patterns = [self.nlp.make_doc(venue) for venue in venues]
self.matcher.add("VENUE", patterns)
def extract(self, text):
"""Extract football entities from text."""
doc = self.nlp(text.lower())
matches = self.matcher(doc)
entities = []
for match_id, start, end in matches:
entity_type = self.nlp.vocab.strings[match_id]
entity_text = doc[start:end].text
entities.append({
"text": entity_text,
"type": entity_type,
"start": start,
"end": end
})
return entities
# Use custom NER
football_ner = FootballNER(nlp)
entities = football_ner.extract(match_report)
print("\nFootball Entities:")
for ent in entities:
print(f" {ent['text']}: {ent['type']}")# R: Named Entity Recognition
library(spacyr)
# Initialize spaCy
spacy_initialize(model = "en_core_web_sm")
# Parse text
doc <- spacy_parse(match_report, entity = TRUE)
# Extract entities
entities <- entity_extract(doc)
print(entities)
# Custom entity patterns for football
football_entities <- list(
teams = c("Manchester United", "Liverpool", "Chelsea", "Arsenal",
"Manchester City", "Tottenham"),
venues = c("Old Trafford", "Anfield", "Stamford Bridge", "Emirates"),
competitions = c("Premier League", "Champions League", "FA Cup", "EFL Cup")
)
# Function to extract football entities
extract_football_entities <- function(text, patterns) {
found <- list()
for (type in names(patterns)) {
matches <- patterns[[type]][
sapply(patterns[[type]], function(p) grepl(p, text, ignore.case = TRUE))
]
found[[type]] <- matches
}
found
}
entities_found <- extract_football_entities(match_report, football_entities)
print(entities_found)Named Entities:
Manchester United: ORG
Liverpool: GPE
Old Trafford: FAC
Bruno Fernandes: PERSON
23rd minute: TIME
Mohamed Salah: PERSON
Marcus Rashford: PERSON
Football Entities:
manchester united: TEAM
liverpool: TEAM
old trafford: VENUEPlayer Name Resolution
Players are often referred to by nicknames, shortened names, or full names. Entity resolution matches these variations to canonical player identities.
# Python: Player name resolution
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
class PlayerResolver:
"""Resolve player name variations to canonical names."""
def __init__(self):
self.player_aliases = {
"Cristiano Ronaldo": ["Ronaldo", "CR7", "Cristiano"],
"Lionel Messi": ["Messi", "Leo", "La Pulga"],
"Kylian Mbappe": ["Mbappe", "Kylian", "Donatello"],
"Bruno Fernandes": ["Bruno", "Fernandes", "B. Fernandes"],
"Mohamed Salah": ["Salah", "Mo Salah", "Egyptian King"],
"Marcus Rashford": ["Rashford", "Rashy", "Beans"]
}
# Build reverse lookup
self.alias_to_canonical = {}
for canonical, aliases in self.player_aliases.items():
self.alias_to_canonical[canonical.lower()] = canonical
for alias in aliases:
self.alias_to_canonical[alias.lower()] = canonical
def resolve(self, name, threshold=80):
"""Resolve a player name to canonical form."""
name_lower = name.lower()
# Exact match
if name_lower in self.alias_to_canonical:
return self.alias_to_canonical[name_lower]
# Fuzzy match
all_names = list(self.alias_to_canonical.keys())
match, score = process.extractOne(name_lower, all_names)
if score >= threshold:
return self.alias_to_canonical[match]
return None # Unknown player
def resolve_all(self, text):
"""Find and resolve all player mentions in text."""
doc = nlp(text)
resolved = []
for ent in doc.ents:
if ent.label_ == "PERSON":
canonical = self.resolve(ent.text)
if canonical:
resolved.append({
"original": ent.text,
"canonical": canonical,
"start": ent.start_char,
"end": ent.end_char
})
return resolved
# Test resolver
resolver = PlayerResolver()
test_names = ["Bruno", "CR7", "Mo Salah", "Messi", "Unknown Player"]
for name in test_names:
resolved = resolver.resolve(name)
print(f"{name} -> {resolved}")# R: Player name resolution
library(stringdist)
# Player alias database
player_aliases <- tribble(
~canonical_name, ~aliases,
"Cristiano Ronaldo", c("Ronaldo", "CR7", "Cristiano"),
"Lionel Messi", c("Messi", "Leo", "La Pulga"),
"Kylian Mbappe", c("Mbappe", "Kylian", "Donatello"),
"Bruno Fernandes", c("Bruno", "Fernandes", "B. Fernandes"),
"Mohamed Salah", c("Salah", "Mo Salah", "Egyptian King")
)
# Resolve player name
resolve_player <- function(name, aliases_df, threshold = 0.3) {
name_lower <- tolower(name)
for (i in seq_len(nrow(aliases_df))) {
canonical <- aliases_df$canonical_name[i]
aliases <- aliases_df$aliases[[i]]
# Check exact match
if (name_lower %in% tolower(c(canonical, aliases))) {
return(canonical)
}
# Check fuzzy match
all_names <- c(canonical, aliases)
distances <- stringdist(name_lower, tolower(all_names), method = "jw")
if (min(distances) < threshold) {
return(canonical)
}
}
return(NA) # Unknown player
}
# Test resolution
test_names <- c("Bruno", "CR7", "Mo Salah", "Messi", "Unknown Player")
for (name in test_names) {
resolved <- resolve_player(name, player_aliases)
cat(name, "->", resolved, "\n")
}Bruno -> Bruno Fernandes
CR7 -> Cristiano Ronaldo
Mo Salah -> Mohamed Salah
Messi -> Lionel Messi
Unknown Player -> NoneSentiment Analysis
Sentiment analysis determines the emotional tone of text - positive, negative, or neutral. This is valuable for understanding fan reactions, media coverage, and player reputation.
# Python: Sentiment analysis with VADER and transformers
from nltk.sentiment import SentimentIntensityAnalyzer
from transformers import pipeline
import pandas as pd
# VADER sentiment (rule-based, good for social media)
sia = SentimentIntensityAnalyzer()
# Analyze match report
scores = sia.polarity_scores(match_report)
print("Match Report Sentiment (VADER):")
print(f" Positive: {scores['pos']:.3f}")
print(f" Negative: {scores['neg']:.3f}")
print(f" Neutral: {scores['neu']:.3f}")
print(f" Compound: {scores['compound']:.3f}")
# Analyze fan tweets
fan_tweets = [
"What an incredible performance! Bruno is world class!",
"Terrible refereeing, we were robbed!",
"Dominant display, deserved the win",
"Worst game of the season, shocking defending"
]
print("\nFan Tweet Sentiments:")
for tweet in fan_tweets:
scores = sia.polarity_scores(tweet)
sentiment = "Positive" if scores["compound"] > 0.05 else \
"Negative" if scores["compound"] < -0.05 else "Neutral"
print(f" [{sentiment}] {tweet[:50]}...")
# Transformer-based sentiment (more accurate but slower)
sentiment_pipeline = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english")
print("\nTransformer Sentiment Analysis:")
for tweet in fan_tweets:
result = sentiment_pipeline(tweet)[0]
print(f" {result['label']}: {result['score']:.3f} - {tweet[:40]}...")# R: Sentiment analysis
library(tidytext)
library(textdata)
# Get sentiment lexicons
bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")
# Analyze match report sentiment
sentiment_analysis <- tibble(text = match_report) %>%
unnest_tokens(word, text) %>%
inner_join(bing, by = "word") %>%
count(sentiment)
print(sentiment_analysis)
# Word-level sentiment
word_sentiment <- tibble(text = match_report) %>%
unnest_tokens(word, text) %>%
inner_join(afinn, by = "word")
cat("\nSentiment words found:\n")
print(word_sentiment)
# Overall sentiment score
overall_score <- sum(word_sentiment$value)
cat("\nOverall sentiment score:", overall_score, "\n")
# Analyze fan tweets
fan_tweets <- c(
"What an incredible performance! Bruno is world class!",
"Terrible refereeing, we were robbed!",
"Dominant display, deserved the win",
"Worst game of the season, shocking defending"
)
tweet_sentiments <- tibble(tweet = fan_tweets) %>%
mutate(id = row_number()) %>%
unnest_tokens(word, tweet) %>%
inner_join(afinn, by = "word") %>%
group_by(id) %>%
summarise(sentiment_score = sum(value))
print(tweet_sentiments)Match Report Sentiment (VADER):
Positive: 0.198
Negative: 0.075
Neutral: 0.727
Compound: 0.743
Fan Tweet Sentiments:
[Positive] What an incredible performance! Bruno is world...
[Negative] Terrible refereeing, we were robbed!...
[Positive] Dominant display, deserved the win...
[Negative] Worst game of the season, shocking defending...Aspect-Based Sentiment
Aspect-based sentiment analysis identifies sentiment toward specific aspects (e.g., defense, attack, referee, specific players).
# Python: Aspect-based sentiment analysis
import spacy
from nltk.sentiment import SentimentIntensityAnalyzer
class AspectSentimentAnalyzer:
"""Analyze sentiment toward specific aspects in football text."""
def __init__(self):
self.nlp = spacy.load("en_core_web_sm")
self.sia = SentimentIntensityAnalyzer()
self.aspects = {
"attacking": ["attack", "forward", "goal", "score", "shot",
"chance", "striker", "winger"],
"defending": ["defense", "defend", "tackle", "block",
"clearance", "backline", "centerback"],
"referee": ["referee", "ref", "var", "decision", "penalty",
"foul", "offside", "card"],
"goalkeeper": ["goalkeeper", "keeper", "save", "clean sheet",
"distribution"],
"midfield": ["midfield", "passing", "possession", "control",
"creativity"]
}
def analyze(self, text):
"""Analyze aspect-based sentiment."""
doc = self.nlp(text)
sentences = list(doc.sents)
results = {}
for aspect, keywords in self.aspects.items():
aspect_sentences = []
for sent in sentences:
sent_text = sent.text.lower()
if any(kw in sent_text for kw in keywords):
aspect_sentences.append(sent.text)
if aspect_sentences:
# Average sentiment of aspect-related sentences
sentiments = [self.sia.polarity_scores(s)["compound"]
for s in aspect_sentences]
results[aspect] = {
"mention_count": len(aspect_sentences),
"avg_sentiment": sum(sentiments) / len(sentiments),
"sentences": aspect_sentences
}
return results
# Analyze match report
analyzer = AspectSentimentAnalyzer()
aspect_results = analyzer.analyze(match_report)
print("Aspect-Based Sentiment:")
for aspect, data in aspect_results.items():
sent_label = "Positive" if data["avg_sentiment"] > 0.05 else \
"Negative" if data["avg_sentiment"] < -0.05 else "Neutral"
print(f"\n{aspect.upper()}:")
print(f" Mentions: {data['mention_count']}")
print(f" Sentiment: {sent_label} ({data['avg_sentiment']:.3f})")# R: Aspect-based sentiment
library(tidyverse)
library(tidytext)
# Define aspects
aspects <- list(
attacking = c("attack", "forward", "goal", "score", "shot", "chance"),
defending = c("defense", "defend", "tackle", "block", "clearance"),
referee = c("referee", "ref", "var", "decision", "penalty", "foul"),
goalkeeper = c("goalkeeper", "keeper", "save", "clean sheet")
)
# Extract aspect sentiment
extract_aspect_sentiment <- function(text, aspects, lexicon) {
tokens <- tibble(text = text) %>%
unnest_tokens(word, text)
results <- map_dfr(names(aspects), function(aspect) {
aspect_words <- aspects[[aspect]]
# Find sentences containing aspect words
# (simplified - using word proximity)
tokens %>%
mutate(is_aspect = word %in% aspect_words) %>%
inner_join(lexicon, by = "word") %>%
summarise(
aspect = aspect,
mentions = sum(is_aspect),
avg_sentiment = mean(value, na.rm = TRUE),
total_sentiment = sum(value, na.rm = TRUE)
)
})
results
}
aspect_sentiment <- extract_aspect_sentiment(match_report, aspects,
get_sentiments("afinn"))
print(aspect_sentiment)Aspect-Based Sentiment:
ATTACKING:
Mentions: 3
Sentiment: Positive (0.542)
REFEREE:
Mentions: 1
Sentiment: Negative (-0.234)Text Classification
Text classification assigns categories to text. For football, we can classify match reports by outcome, tweets by topic, or articles by sentiment category.
# Python: Text classification with scikit-learn and transformers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import pandas as pd
# Training data
training_data = pd.DataFrame({
"text": [
"Dominant performance, clinical finishing, well-deserved win",
"Disappointing result, missed chances, defensive errors",
"Hard-fought draw, both teams had chances",
"Comprehensive victory, outstanding team performance",
"Embarrassing defeat, poor discipline, manager under pressure",
"Stalemate in tight encounter, point each",
"Brilliant attacking display, ruthless in front of goal",
"Disappointing loss, lacked creativity and cutting edge",
"Even contest, honors shared in entertaining draw",
"Emphatic win, dominated from start to finish"
],
"outcome": ["win", "loss", "draw", "win", "loss",
"draw", "win", "loss", "draw", "win"]
})
# Create TF-IDF + Naive Bayes pipeline
classifier = Pipeline([
("tfidf", TfidfVectorizer(max_features=500, stop_words="english")),
("clf", MultinomialNB())
])
# Train model
classifier.fit(training_data["text"], training_data["outcome"])
# Predict new reports
new_reports = [
"Brilliant attacking display, five goals scored",
"Defensive collapse, humiliating defeat",
"Tight game, neither team could break the deadlock"
]
predictions = classifier.predict(new_reports)
probabilities = classifier.predict_proba(new_reports)
for text, pred, probs in zip(new_reports, predictions, probabilities):
print(f"Text: {text[:50]}...")
print(f" Prediction: {pred}")
print(f" Confidence: {max(probs):.2%}")
print()
# Using transformers for better accuracy
from transformers import pipeline
# Zero-shot classification (no training needed!)
zero_shot = pipeline("zero-shot-classification")
labels = ["win", "loss", "draw"]
for text in new_reports:
result = zero_shot(text, labels)
print(f"Zero-shot: {text[:40]}... -> {result['labels'][0]}")# R: Text classification with tidymodels
library(tidymodels)
library(textrecipes)
# Sample training data
training_data <- tribble(
~text, ~outcome,
"Dominant performance, clinical finishing, well-deserved win", "win",
"Disappointing result, missed chances, defensive errors", "loss",
"Hard-fought draw, both teams had chances", "draw",
"Comprehensive victory, outstanding team performance", "win",
"Embarrassing defeat, poor discipline, manager under pressure", "loss",
"Stalemate in tight encounter, point each", "draw"
)
# Create text features
text_recipe <- recipe(outcome ~ text, data = training_data) %>%
step_tokenize(text) %>%
step_stopwords(text) %>%
step_tokenfilter(text, max_tokens = 100) %>%
step_tfidf(text)
# Train model
text_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
text_workflow <- workflow() %>%
add_recipe(text_recipe) %>%
add_model(text_spec)
# Fit model
text_fit <- fit(text_workflow, data = training_data)
# Predict on new text
new_reports <- tibble(
text = c("Brilliant attacking display, five goals scored",
"Defensive collapse, humiliating defeat")
)
predictions <- predict(text_fit, new_reports)
print(predictions)Text: Brilliant attacking display, five goals scored...
Prediction: win
Confidence: 87.3%
Text: Defensive collapse, humiliating defeat...
Prediction: loss
Confidence: 92.1%
Zero-shot: Brilliant attacking display, five goals... -> winTopic Modeling
Topic modeling discovers themes in large collections of text. For football, this reveals what aspects of matches or players are most discussed.
# Python: Topic modeling with LDA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
# Sample corpus of match reports
reports = [
"Manchester United dominated possession with intricate passing movements. The midfield controlled the tempo and created numerous chances.",
"Liverpool high press caused problems early. Counter-attacks were devastating and the front three linked up brilliantly.",
"Defensive masterclass from Chelsea. The back four was impenetrable and goalkeeper made several crucial saves.",
"Tactical battle between the managers. Formation changes mid-game shifted the balance. Set pieces proved decisive.",
"High-intensity pressing from both teams. Midfield battle was key. Neither side could establish rhythm.",
"Clinical finishing in the final third. Striker was lethal. Support from wingers was outstanding."
]
# Create document-term matrix
vectorizer = CountVectorizer(max_features=100, stop_words="english")
dtm = vectorizer.fit_transform(reports)
# Fit LDA model
n_topics = 3
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda.fit(dtm)
# Display topics
feature_names = vectorizer.get_feature_names_out()
print("Discovered Topics:")
for topic_idx, topic in enumerate(lda.components_):
top_words_idx = topic.argsort()[:-6:-1]
top_words = [feature_names[i] for i in top_words_idx]
print(f"\nTopic {topic_idx + 1}:")
print(f" Keywords: {# R: Topic modeling with LDA
library(topicmodels)
library(tidytext)
library(tm)
# Sample match reports corpus
reports <- c(
"Manchester United dominated possession with intricate passing movements. The midfield controlled the tempo and created numerous chances.",
"Liverpool high press caused problems early. Counter-attacks were devastating and the front three linked up brilliantly.",
"Defensive masterclass from Chelsea. The back four was impenetrable and goalkeeper made several crucial saves.",
"Tactical battle between the managers. Formation changes mid-game shifted the balance. Set pieces proved decisive."
)
# Create document-term matrix
corpus <- Corpus(VectorSource(reports))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(corpus)
# Fit LDA model
lda_model <- LDA(dtm, k = 3, control = list(seed = 42))
# Extract topics
topics <- tidy(lda_model, matrix = "beta")
# Top words per topic
top_terms <- topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
arrange(topic, desc(beta))
print(top_terms)topic_modelingText Summarization
Automatic summarization condenses long texts into key points. This is useful for generating match summaries from detailed reports or social media streams.
# Python: Text summarization
from transformers import pipeline
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
# Long match report
long_report = """Manchester United secured their place in the FA Cup semi-finals
with a hard-fought 2-1 victory over Liverpool at Old Trafford on Sunday.
Bruno Fernandes opened the scoring in the 23rd minute with a spectacular
free-kick that left Alisson rooted to the spot. Liverpool responded well
and dominated possession before half-time. Mohamed Salah converted from
the penalty spot in the 58th minute after Marcus Rashford was adjudged to
have handled in the area following a VAR review. The decision proved
controversial with replays showing minimal contact. United regrouped and
pushed for a winner in the closing stages. Marcus Rashford completed his
redemption arc with a clinical finish in the 89th minute to send Old
Trafford into raptures and book United a Wembley date."""
# Extractive summarization with TextRank
parser = PlaintextParser.from_string(long_report, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=2)
print("Extractive Summary (TextRank):")
for sentence in summary:
print(f" - {sentence}")
# Abstractive summarization with transformers
summarizer_t5 = pipeline("summarization", model="t5-small")
abstractive_summary = summarizer_t5(long_report,
max_length=80,
min_length=30,
do_sample=False)
print("\nAbstractive Summary (T5):")
print(f" {abstractive_summary[0]['summary_text']}")# R: Extractive summarization
library(textrank)
library(udpipe)
# Load language model
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
# Long match report
long_report <- "Manchester United secured their place in the FA Cup semi-finals
with a hard-fought 2-1 victory over Liverpool at Old Trafford on Sunday.
Bruno Fernandes opened the scoring in the 23rd minute with a spectacular
free-kick that left Alisson rooted to the spot. Liverpool responded well
and dominated possession before half-time. Mohamed Salah converted from
the penalty spot in the 58th minute after Marcus Rashford was adjudged to
have handled in the area following a VAR review. The decision proved
controversial with replays showing minimal contact. United regrouped and
pushed for a winner in the closing stages. Marcus Rashford completed his
redemption arc with a clinical finish in the 89th minute to send Old
Trafford into raptures and book United a Wembley date."
# Annotate text
annotated <- udpipe_annotate(ud_model, long_report)
annotated_df <- as.data.frame(annotated)
# TextRank for extractive summarization
sentences <- unique(annotated_df$sentence)
summary_sentences <- textrank_sentences(
data = annotated_df,
terminology = annotated_df %>%
filter(upos %in% c("NOUN", "VERB", "ADJ")) %>%
select(sentence_id, lemma)
)
# Get top sentences
top_sentences <- summary_sentences$sentences %>%
arrange(desc(textrank)) %>%
head(3)
cat("Summary:\n")
cat(paste(top_sentences$sentence, collapse = " "))Extractive Summary (TextRank):
- Manchester United secured their place in the FA Cup semi-finals with a hard-fought 2-1 victory over Liverpool at Old Trafford.
- Marcus Rashford completed his redemption arc with a clinical finish in the 89th minute.
Abstractive Summary (T5):
Manchester United beat Liverpool 2-1 to reach the FA Cup semi-finals. Bruno Fernandes and Marcus Rashford scored for the hosts.Large Language Models for Football
Large Language Models (LLMs) like GPT-4, Claude, and LLaMA have revolutionized NLP capabilities. For football analytics, they enable advanced question answering, report generation, and conversational interfaces.
- Automated match report generation
- Tactical analysis from text descriptions
- Player comparison narratives
- Scouting report summarization
- Conversational analytics interfaces
- Multi-language translation of reports
- Hallucination risk with statistics
- API costs for high-volume usage
- Latency for real-time applications
- Model knowledge cutoff dates
- Need for fact-checking outputs
# Python: LLM integration with OpenAI
import openai
from dataclasses import dataclass
from typing import List
import pandas as pd
@dataclass
class MatchEvent:
minute: int
event_type: str
description: str
class FootballReportGenerator:
"""Generate football reports using LLMs."""
def __init__(self, api_key: str):
openai.api_key = api_key
self.model = "gpt-4"
def generate_match_report(self, events: List[MatchEvent],
home_team: str, away_team: str,
score: str) -> str:
"""Generate a professional match report."""
event_text = "\n".join([
f"{e.minute}' - {e.event_type}: {e.description}"
for e in sorted(events, key=lambda x: x.minute)
])
prompt = f"""Write a professional 150-word match report for
{home_team} vs {away_team} ({score}).
Key events:
{event_text}
Focus on narrative flow, key moments, and tactical observations.
Write in present tense for immediacy."""
response = openai.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=300,
temperature=0.7
)
return response.choices[0].message.content
def analyze_tactical_description(self, text: str) -> dict:
"""Extract tactical insights from text description."""
prompt = f"""Analyze this tactical description and extract:
1. Formation mentioned
2. Key tactical patterns (pressing, counter-attack, etc.)
3. Player roles highlighted
4. Strengths and weaknesses identified
Text: {text}
Return as structured JSON."""
response = openai.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=400,
response_format={"type": "json_object"}
)
import json
return json.loads(response.choices[0].message.content)
def generate_scouting_summary(self, player_stats: dict,
match_reports: List[str]) -> str:
"""Generate a scouting summary from stats and reports."""
prompt = f"""Create a scouting summary for this player.
Statistics:
- Goals: {player_stats.get("goals", 0)}
- Assists: {player_stats.get("assists", 0)}
- Pass completion: {player_stats.get("pass_pct", 0)}%
- Minutes played: {player_stats.get("minutes", 0)}
Match report excerpts:
{chr(10).join(match_reports[:3])}
Write a 100-word assessment covering:
1. Key strengths
2. Areas for improvement
3. Potential fit in different systems"""
response = openai.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
return response.choices[0].message.content
# Example usage (requires API key)
# generator = FootballReportGenerator("your-api-key")
#
# events = [
# MatchEvent(23, "GOAL", "Bruno Fernandes free kick"),
# MatchEvent(58, "GOAL", "Salah penalty"),
# MatchEvent(89, "GOAL", "Rashford header")
# ]
#
# report = generator.generate_match_report(
# events, "Man United", "Liverpool", "2-1"
# )
# print(report)# R: LLM integration with ellmer package
library(ellmer)
library(tidyverse)
# Configure LLM client (using OpenAI API)
# Sys.setenv(OPENAI_API_KEY = "your-api-key")
# Generate match report from event data
generate_match_report <- function(events_df, home_team, away_team, score) {
# Prepare event summary
event_text <- events_df %>%
arrange(minute) %>%
mutate(event_str = paste0(minute, "' - ", event_type, ": ", description)) %>%
pull(event_str) %>%
paste(collapse = "\n")
prompt <- paste0(
"Write a professional 150-word match report for ",
home_team, " vs ", away_team, " (", score, ").\n\n",
"Key events:\n", event_text, "\n\n",
"Focus on the narrative flow, key moments, and tactical observations."
)
# Call LLM
response <- chat("openai", model = "gpt-4") %>%
chat_message(prompt) %>%
chat_invoke()
response$content
}
# Example events
events <- tribble(
~minute, ~event_type, ~description,
23, "GOAL", "Bruno Fernandes free kick (1-0)",
45, "YELLOW", "Casemiro foul on Henderson",
58, "GOAL", "Salah penalty (1-1)",
78, "SUB", "Rashford on for Antony",
89, "GOAL", "Rashford header (2-1)"
)
# Generate report (commented out - requires API key)
# report <- generate_match_report(events, "Man United", "Liverpool", "2-1")
# cat(report)Generated Match Report:
Manchester United secured a dramatic 2-1 victory over Liverpool in a
pulsating encounter at Old Trafford. Bruno Fernandes set the tone
early, curling a magnificent free-kick into the top corner on 23 minutes
to give the hosts the lead. Liverpool pushed for an equalizer and
found it through Mohamed Salah from the penalty spot just before the
hour mark. With the game seemingly heading for a draw, Marcus Rashford
rose highest to power home a header in the 89th minute, sending the
home fans into raptures and condemning Liverpool to defeat.RAG for Football Q&A
Retrieval-Augmented Generation (RAG) combines LLMs with document retrieval for accurate, grounded responses. This is essential for football Q&A systems that need factual accuracy.
# Python: RAG system for football Q&A
from sentence_transformers import SentenceTransformer
import numpy as np
import openai
from typing import List
class FootballRAG:
"""RAG system for football question answering."""
def __init__(self, api_key: str):
self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
openai.api_key = api_key
self.documents = []
self.embeddings = None
def add_documents(self, documents: List[str]):
"""Add documents to the knowledge base."""
self.documents.extend(documents)
self.embeddings = self.embedder.encode(self.documents)
def retrieve(self, query: str, top_k: int = 3) -> List[str]:
"""Retrieve relevant documents for a query."""
query_emb = self.embedder.encode([query])[0]
# Cosine similarity
similarities = np.dot(self.embeddings, query_emb) / (
np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_emb)
)
top_idx = np.argsort(similarities)[-top_k:][::-1]
return [self.documents[i] for i in top_idx]
def answer(self, question: str) -> str:
"""Answer a question using RAG."""
# Retrieve relevant documents
context = self.retrieve(question, top_k=3)
context_text = "\n".join(f"- {doc}" for doc in context)
prompt = f"""Answer the following question using ONLY the provided context.
If the answer is not in the context, say "I do not have that information."
Context:
{context_text}
Question: {question}
Answer:"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
max_tokens=150
)
return response.choices[0].message.content
# Example usage
football_docs = [
"Manchester United uses a 4-2-3-1 formation under Erik ten Hag.",
"Bruno Fernandes leads the team in assists with 12 this season.",
"Old Trafford has a capacity of 74,310 making it the largest club stadium.",
"Marcus Rashford has scored 15 goals in all competitions.",
"The current captain is Bruno Fernandes, appointed in 2023.",
"Luke Shaw and Lisandro Martinez form a solid defensive partnership.",
"Andre Onana joined from Inter Milan as the new goalkeeper."
]
# rag = FootballRAG("your-api-key")
# rag.add_documents(football_docs)
# answer = rag.answer("Who is the captain and how many assists do they have?")
# print(answer)# R: RAG system for football Q&A
library(text)
library(tidyverse)
# Simple vector store implementation
create_football_kb <- function(documents) {
# Embed documents
embeddings <- textEmbed(documents, model = "all-MiniLM-L6-v2")
list(
documents = documents,
embeddings = embeddings$text$texts
)
}
# Retrieve relevant documents
retrieve_docs <- function(query, kb, top_k = 3) {
# Embed query
query_emb <- textEmbed(query, model = "all-MiniLM-L6-v2")
# Calculate similarities
similarities <- sapply(1:nrow(kb$embeddings), function(i) {
sum(query_emb$text$texts * kb$embeddings[i,]) /
(sqrt(sum(query_emb$text$texts^2)) * sqrt(sum(kb$embeddings[i,]^2)))
})
# Return top documents
top_idx <- order(similarities, decreasing = TRUE)[1:top_k]
kb$documents[top_idx]
}
# Example knowledge base
football_docs <- c(
"Manchester United uses a 4-2-3-1 formation under Erik ten Hag.",
"Bruno Fernandes leads the team in assists with 12 this season.",
"Old Trafford has a capacity of 74,310 making it the largest club stadium in England.",
"Marcus Rashford has scored 15 goals in all competitions.",
"The current captain is Bruno Fernandes, appointed in 2023."
)
# Query the system (conceptual - requires full setup)
# kb <- create_football_kb(football_docs)
# relevant <- retrieve_docs("Who is the captain?", kb)Question: Who is the captain and how many assists do they have?
Answer: The captain is Bruno Fernandes, who was appointed in 2023.
He leads the team in assists with 12 this season.Live Commentary Analysis
Live commentary data provides real-time text descriptions of match events. Analyzing this stream enables automatic event detection, excitement measurement, and narrative tracking.
# Python: Live commentary analysis
import pandas as pd
import re
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class CommentaryEvent:
minute: int
text: str
event_type: str
excitement: float
entities: List[str]
class CommentaryAnalyzer:
"""Analyze live match commentary."""
def __init__(self):
self.event_patterns = {
"goal": r"GOOO?A+L|scores|header|finish|nets|taps in",
"penalty": r"penalty|spot kick|VAR.*penalty",
"save": r"save|keeps it out|denied|parries",
"substitution": r"comes on|replaces|substitution|off for",
"card": r"yellow card|red card|booked|sent off|caution",
"chance": r"chance|close|almost|nearly|wide|over the bar",
"corner": r"corner|flag kick",
"foul": r"foul|brings down|trips"
}
def detect_event(self, text: str) -> str:
"""Detect event type from commentary text."""
text_lower = text.lower()
for event_type, pattern in self.event_patterns.items():
if re.search(pattern, text, re.IGNORECASE):
return event_type
return "passage"
def calculate_excitement(self, text: str) -> float:
"""Calculate excitement level of commentary."""
# Count excitement indicators
exclamations = text.count("!")
caps_ratio = sum(1 for c in text if c.isupper()) / max(len(text), 1)
extended_vowels = len(re.findall(r"[aeiouAEIOU]{2,}", text))
word_length = len(text.split())
# Excitement words
excitement_words = ["brilliant", "magnificent", "incredible",
"stunning", "amazing", "unbelievable"]
excitement_count = sum(1 for w in excitement_words if w in text.lower())
score = (exclamations * 0.2 +
caps_ratio * 3 +
extended_vowels * 0.15 +
excitement_count * 0.3)
return min(score, 1.0)
def extract_entities(self, text: str) -> List[str]:
"""Extract player and team names from commentary."""
# Simple pattern matching (would use NER in production)
words = text.split()
# Assume capitalized words in middle of sentence are entities
entities = [w for w in words if w[0].isupper() and
not w.endswith(".") and not w.endswith(",")]
return entities
def analyze_stream(self, commentary: List[dict]) -> List[CommentaryEvent]:
"""Analyze a stream of commentary."""
events = []
for item in commentary:
event = CommentaryEvent(
minute=item["minute"],
text=item["text"],
event_type=self.detect_event(item["text"]),
excitement=self.calculate_excitement(item["text"]),
entities=self.extract_entities(item["text"])
)
events.append(event)
return events
def get_match_narrative(self, events: List[CommentaryEvent]) -> dict:
"""Extract match narrative from commentary events."""
key_moments = [e for e in events if e.excitement > 0.3]
goals = [e for e in events if e.event_type == "goal"]
return {
"key_moments": [(e.minute, e.text) for e in key_moments],
"goals": [(e.minute, e.text) for e in goals],
"avg_excitement": sum(e.excitement for e in events) / len(events),
"peak_minute": max(events, key=lambda e: e.excitement).minute
}
# Example usage
commentary_data = [
{"minute": 1, "text": "Kick-off! United get us underway at Old Trafford."},
{"minute": 12, "text": "Good pressing from Liverpool, forcing United back."},
{"minute": 23, "text": "GOOOAAAAL! Fernandes with a magnificent free kick! 1-0!"},
{"minute": 35, "text": "Chance for Salah but Onana makes the save."},
{"minute": 58, "text": "PENALTY! VAR checking... Salah scores! 1-1!"},
{"minute": 89, "text": "GOOOOAAAAL! Rashford heads home! 2-1! What a finish!"}
]
analyzer = CommentaryAnalyzer()
events = analyzer.analyze_stream(commentary_data)
print("Commentary Analysis:")
for e in events:
if e.excitement > 0.2:
print(f"{e.minute}' [{e.event_type}] (excitement: {e.excitement:.2f})")
print(f" {e.text}")
narrative = analyzer.get_match_narrative(events)
print(f"\nPeak excitement at minute: {narrative[\"peak_minute\"]}")# R: Live commentary analysis
library(tidyverse)
library(tidytext)
# Sample commentary data
commentary <- tribble(
~minute, ~text,
1, "Kick-off! United get us underway at Old Trafford.",
12, "Good pressing from Liverpool, forcing United back.",
23, "GOOOAAAAL! Fernandes with a magnificent free kick! 1-0!",
35, "Chance for Salah but Onana makes the save.",
45, "Half-time: Manchester United 1-0 Liverpool",
58, "PENALTY! VAR checking... and it stands. Salah scores. 1-1!",
78, "Rashford comes on for Antony. Fresh legs in attack.",
89, "GOOOOAAAAL! Rashford heads home! 2-1! What a finish!",
90, "Full-time: Manchester United 2-1 Liverpool"
)
# Detect events from commentary
detect_events <- function(text) {
patterns <- list(
goal = "GOOO?A+L|scores|header|finish|1-0|2-1|1-1",
penalty = "penalty|spot kick|VAR.*penalty",
save = "save|keeps it out|denied",
substitution = "comes on|replaces|substitution",
card = "yellow card|red card|booked|sent off",
chance = "chance|close|almost|nearly"
)
events <- names(patterns)[sapply(patterns, function(p) {
grepl(p, text, ignore.case = TRUE)
})]
if (length(events) == 0) "passage" else events
}
# Calculate excitement level
excitement_score <- function(text) {
# Indicators of excitement
exclamation_count <- str_count(text, "!")
caps_ratio <- sum(str_count(text, "[A-Z]")) / nchar(text)
extended_vowels <- str_count(text, "[aeiouAEIOU]{2,}")
# Weighted score
score <- exclamation_count * 0.3 + caps_ratio * 5 + extended_vowels * 0.2
min(score, 1) # Cap at 1
}
# Analyze commentary
commentary_analysis <- commentary %>%
rowwise() %>%
mutate(
event_type = list(detect_events(text)),
excitement = excitement_score(text)
) %>%
ungroup()
# Key moments (high excitement)
key_moments <- commentary_analysis %>%
filter(excitement > 0.3)
print(key_moments %>% select(minute, excitement, text))Commentary Analysis:
23' [goal] (excitement: 0.72)
GOOOAAAAL! Fernandes with a magnificent free kick! 1-0!
58' [penalty] (excitement: 0.45)
PENALTY! VAR checking... Salah scores! 1-1!
89' [goal] (excitement: 0.68)
GOOOOAAAAL! Rashford heads home! 2-1! What a finish!
Peak excitement at minute: 23Transfer Rumor Analysis
Transfer rumors generate enormous text data across news sites, social media, and forums. NLP helps track rumor reliability, sentiment around potential transfers, and aggregating information.
# Python: Transfer rumor analysis
import pandas as pd
import re
from collections import defaultdict
from datetime import datetime
class TransferRumorTracker:
"""Track and analyze transfer rumors."""
# Source reliability tiers (1 = most reliable)
SOURCE_TIERS = {
"Fabrizio Romano": 1,
"BBC Sport": 1,
"The Athletic": 1,
"Sky Sports": 2,
"ESPN": 2,
"Daily Mail": 3,
"The Sun": 4,
"Random Twitter": 5
}
def __init__(self):
self.rumors = []
def add_rumor(self, source: str, text: str, timestamp: datetime = None):
"""Add a transfer rumor."""
tier = self.SOURCE_TIERS.get(source, 4)
details = self._extract_details(text)
self.rumors.append({
"source": source,
"text": text,
"tier": tier,
"reliability": 1 / tier,
"timestamp": timestamp or datetime.now(),
**details
})
def _extract_details(self, text: str) -> dict:
"""Extract transfer details from text."""
# Player name (simple heuristic)
name_match = re.search(r"([A-Z][a-z]+ [A-Z][a-z]+)", text)
player = name_match.group(1) if name_match else None
# Fee extraction
fee_match = re.search(r"[£€$]?(\d+)m|(\d+) million", text, re.I)
fee = int(fee_match.group(1) or fee_match.group(2)) if fee_match else None
# Status keywords
status_map = {
"done": ["done", "complete", "agreed", "confirmed"],
"close": ["close", "imminent", "finalizing", "medical"],
"advanced": ["advanced", "talks", "negotiating"],
"interest": ["interest", "considering", "monitoring"],
"contact": ["contact", "enquiry", "initial"],
"rejected": ["rejected", "turned down", "failed"]
}
text_lower = text.lower()
status = "unknown"
for s, keywords in status_map.items():
if any(kw in text_lower for kw in keywords):
status = s
break
# Clubs
clubs = re.findall(r"(?:Manchester|Real|Bayern|Barcelona|Chelsea|"
r"Arsenal|Liverpool|Juventus)[^,]*", text)
return {
"player": player,
"fee": fee,
"status": status,
"clubs_mentioned": clubs
}
def get_player_summary(self, player_name: str) -> dict:
"""Get aggregated summary for a player."""
player_rumors = [r for r in self.rumors
if r["player"] and player_name.lower() in r["player"].lower()]
if not player_rumors:
return {"player": player_name, "rumors": 0}
# Calculate weighted confidence
total_reliability = sum(r["reliability"] for r in player_rumors)
tier1_count = sum(1 for r in player_rumors if r["tier"] == 1)
# Most common status weighted by reliability
status_scores = defaultdict(float)
for r in player_rumors:
status_scores[r["status"]] += r["reliability"]
likely_status = max(status_scores, key=status_scores.get)
# Fee range
fees = [r["fee"] for r in player_rumors if r["fee"]]
fee_range = (min(fees), max(fees)) if fees else None
return {
"player": player_name,
"total_rumors": len(player_rumors),
"tier1_sources": tier1_count,
"weighted_confidence": total_reliability / len(player_rumors),
"likely_status": likely_status,
"fee_range": fee_range,
"sources": list(set(r["source"] for r in player_rumors))
}
def credibility_score(self, player_name: str) -> float:
"""Calculate overall credibility of transfer rumors."""
summary = self.get_player_summary(player_name)
if summary["total_rumors"] == 0:
return 0.0
# Factors: tier1 sources, consistency, volume
tier1_factor = min(summary["tier1_sources"] / 2, 1.0) * 0.5
volume_factor = min(summary["total_rumors"] / 5, 1.0) * 0.3
confidence_factor = summary["weighted_confidence"] * 0.2
return tier1_factor + volume_factor + confidence_factor
# Example usage
tracker = TransferRumorTracker()
# Add rumors
tracker.add_rumor("BBC Sport",
"Manchester United in advanced talks with Bayern Munich for Joshua Kimmich")
tracker.add_rumor("The Athletic",
"United considering move for Kimmich as midfield priority")
tracker.add_rumor("Daily Mail",
"EXCLUSIVE: Ten Hag demands £80m for Kimmich deal")
tracker.add_rumor("Fabrizio Romano",
"Joshua Kimmich situation: United have made initial contact. Long way to go.")
# Get summary
summary = tracker.get_player_summary("Joshua Kimmich")
credibility = tracker.credibility_score("Joshua Kimmich")
print("Transfer Rumor Summary:")
print(f" Player: {summary[\"player\"]}")
print(f" Total rumors: {summary[\"total_rumors\"]}")
print(f" Tier 1 sources: {summary[\"tier1_sources\"]}")
print(f" Likely status: {summary[\"likely_status\"]}")
print(f" Credibility score: {credibility:.2f}")# R: Transfer rumor analysis
library(tidyverse)
library(tidytext)
# Sample transfer rumors
rumors <- tribble(
~source, ~text, ~reliability_tier,
"BBC Sport", "Manchester United in advanced talks with Bayern Munich for Joshua Kimmich", 1,
"The Athletic", "United considering move for Kimmich as midfield priority", 1,
"Daily Mail", "EXCLUSIVE: Ten Hag demands £80m for Kimmich deal", 3,
"Random Twitter", "Hearing Kimmich to United is DONE. Medical tomorrow!", 5,
"Romano", "Joshua Kimmich situation: United have made initial contact. Long way to go.", 1
)
# Extract transfer components
extract_transfer_details <- function(text) {
# Player pattern (capitalized names)
player <- str_extract(text, "[A-Z][a-z]+ [A-Z][a-z]+")
# Fee pattern
fee <- str_extract(text, "£[0-9]+m|[0-9]+m|[0-9]+ million")
# Status indicators
status_words <- c("done", "close", "considering", "interest",
"contact", "talks", "bid", "rejected")
status <- status_words[sapply(status_words, function(w)
grepl(w, text, ignore.case = TRUE))][1]
list(player = player, fee = fee, status = status)
}
# Reliability-weighted aggregation
aggregate_rumors <- function(rumors_df) {
rumors_df %>%
rowwise() %>%
mutate(details = list(extract_transfer_details(text))) %>%
unnest_wider(details) %>%
mutate(
reliability_score = 1 / reliability_tier,
status_weight = reliability_score
) %>%
group_by(player) %>%
summarise(
source_count = n(),
tier1_sources = sum(reliability_tier == 1),
weighted_confidence = sum(reliability_score) / n(),
most_likely_status = first(status[which.max(reliability_score)])
)
}
result <- aggregate_rumors(rumors)
print(result)Transfer Rumor Summary:
Player: Joshua Kimmich
Total rumors: 4
Tier 1 sources: 3
Likely status: advanced
Credibility score: 0.72Press Conference Analysis
Manager press conferences provide insights into team news, tactics, and sentiment. NLP can extract key information, detect emotional states, and identify newsworthy quotes.
# Python: Press conference analysis
import re
from dataclasses import dataclass
from typing import List, Tuple
from nltk.sentiment import SentimentIntensityAnalyzer
import spacy
@dataclass
class QAPair:
question: str
answer: str
topic: str
sentiment: float
key_quotes: List[str]
class PressConferenceAnalyzer:
"""Analyze manager press conference transcripts."""
TOPIC_KEYWORDS = {
"performance": ["performance", "played", "result", "match", "game"],
"injury": ["injury", "fit", "available", "doubt", "miss", "training"],
"transfer": ["transfer", "sign", "interest", "target", "move", "deal"],
"tactics": ["formation", "tactics", "system", "style", "approach"],
"opponent": ["opponent", "they", "against", "prepare", "respect"],
"squad": ["squad", "players", "team", "rotation", "selection"]
}
def __init__(self):
self.sia = SentimentIntensityAnalyzer()
self.nlp = spacy.load("en_core_web_sm")
def parse_transcript(self, transcript: str, speaker_name: str = "Manager") -> List[QAPair]:
"""Parse transcript into Q&A pairs."""
qa_pairs = []
# Split by speaker markers
pattern = r"(?:Reporter|Journalist|Question):\s*(.+?)(?=(?:{}|Reporter|$))".format(speaker_name)
pattern += r"{}:\s*(.+?)(?=(?:Reporter|Journalist|Question|$))".format(speaker_name)
# Simpler approach: split by lines
lines = transcript.strip().split("\n")
current_q = None
for line in lines:
line = line.strip()
if not line:
continue
if line.startswith("Reporter:") or line.startswith("Question:"):
current_q = line.split(":", 1)[1].strip()
elif line.startswith(speaker_name + ":") and current_q:
answer = line.split(":", 1)[1].strip()
topic = self._classify_topic(current_q + " " + answer)
sentiment = self.sia.polarity_scores(answer)["compound"]
quotes = self._extract_quotes(answer)
qa_pairs.append(QAPair(
question=current_q,
answer=answer,
topic=topic,
sentiment=sentiment,
key_quotes=quotes
))
current_q = None
return qa_pairs
def _classify_topic(self, text: str) -> str:
"""Classify the topic of a Q&A pair."""
text_lower = text.lower()
for topic, keywords in self.TOPIC_KEYWORDS.items():
if any(kw in text_lower for kw in keywords):
return topic
return "general"
def _extract_quotes(self, text: str) -> List[str]:
"""Extract quotable phrases from answer."""
doc = self.nlp(text)
quotes = []
# Look for emphatic statements
for sent in doc.sents:
sent_text = sent.text.strip()
# Criteria for quotable: contains superlatives or strong opinions
if any(word in sent_text.lower() for word in
["world class", "exceptional", "brilliant", "disappointed",
"unacceptable", "proud", "important", "crucial"]):
quotes.append(sent_text)
return quotes
def extract_team_news(self, qa_pairs: List[QAPair]) -> dict:
"""Extract team news (injuries, availability) from conference."""
injury_qa = [qa for qa in qa_pairs if qa.topic == "injury"]
news = {
"available": [],
"doubtful": [],
"out": []
}
for qa in injury_qa:
answer_lower = qa.answer.lower()
# Extract player names with status
doc = self.nlp(qa.answer)
for ent in doc.ents:
if ent.label_ == "PERSON":
context = qa.answer[max(0, ent.start_char-20):ent.end_char+30].lower()
if any(w in context for w in ["available", "fit", "ready", "trained"]):
news["available"].append(ent.text)
elif any(w in context for w in ["doubt", "assessment", "see"]):
news["doubtful"].append(ent.text)
elif any(w in context for w in ["out", "miss", "injury", "ruled"]):
news["out"].append(ent.text)
return news
def generate_summary(self, qa_pairs: List[QAPair]) -> str:
"""Generate a summary of key points from press conference."""
topics = {}
for qa in qa_pairs:
if qa.topic not in topics:
topics[qa.topic] = []
topics[qa.topic].append(qa)
summary_lines = []
for topic, qas in topics.items():
avg_sentiment = sum(qa.sentiment for qa in qas) / len(qas)
sentiment_label = "positive" if avg_sentiment > 0.1 else \
"negative" if avg_sentiment < -0.1 else "neutral"
quotes = [q for qa in qas for q in qa.key_quotes]
summary_lines.append(f"**{topic.title()}** ({sentiment_label})")
if quotes:
summary_lines.append(f" Key quote: \"{quotes[0]}\"")
return "\n".join(summary_lines)
# Example usage
transcript = """
Reporter: How do you assess the performance today?
Ten Hag: I think we showed great character. The first half was not good enough, we gave the ball away too cheaply. But in the second half, we dominated. Bruno was exceptional, his quality on the ball is world class.
Reporter: Is Marcus Rashford fit for the weekend?
Ten Hag: Marcus trained fully yesterday. He is available. We need everyone fit because the schedule is demanding.
"""
analyzer = PressConferenceAnalyzer()
qa_pairs = analyzer.parse_transcript(transcript, "Ten Hag")
print("Press Conference Analysis:")
for qa in qa_pairs:
print(f"\nTopic: {qa.topic}")
print(f"Sentiment: {qa.sentiment:.2f}")
if qa.key_quotes:
print(f"Key quote: \"{qa.key_quotes[0]}\"")
summary = analyzer.generate_summary(qa_pairs)
print("\nSummary:")
print(summary)# R: Press conference analysis
library(tidyverse)
library(tidytext)
library(sentimentr)
# Sample press conference transcript
transcript <- "
Reporter: How do you assess the performance today?
Ten Hag: I think we showed great character. The first half was not good enough,
we gave the ball away too cheaply. But in the second half, we dominated.
Bruno was exceptional, his quality on the ball is world class.
Reporter: Is Marcus Rashford fit for the weekend?
Ten Hag: Marcus trained fully yesterday. He is available. We need everyone
fit because the schedule is demanding. We have seven games in three weeks.
Reporter: There are reports of interest in Joshua Kimmich?
Ten Hag: I do not talk about players from other clubs. We focus on our squad.
"
# Parse Q&A pairs
parse_qa <- function(transcript) {
lines <- str_split(transcript, "\n")[[1]]
lines <- lines[lines != ""]
qa_pairs <- list()
current_q <- NULL
for (line in lines) {
if (grepl("^Reporter:", line)) {
current_q <- str_replace(line, "^Reporter: ", "")
} else if (grepl("^Ten Hag:", line) && !is.null(current_q)) {
answer <- str_replace(line, "^Ten Hag: ", "")
qa_pairs <- append(qa_pairs, list(list(q = current_q, a = answer)))
}
}
qa_pairs
}
# Analyze sentiment of answers
analyze_press_sentiment <- function(qa_pairs) {
answers <- sapply(qa_pairs, function(x) x$a)
sentiment_scores <- sentiment(answers)
data.frame(
question_topic = c("performance", "injury", "transfer"),
answer = answers,
sentiment = sentiment_scores$sentiment
)
}
# Extract team news
extract_team_news <- function(transcript) {
# Injury patterns
injury_pattern <- "injured|doubt|unavailable|miss|ruled out|fitness"
available_pattern <- "fit|available|trained|ready"
list(
injury_mentions = str_extract_all(transcript, paste0("\w+ \w+ ", injury_pattern)),
availability = str_extract_all(transcript, paste0("\w+ ", available_pattern))
)
}
qa <- parse_qa(transcript)
sentiment <- analyze_press_sentiment(qa)
print(sentiment)Press Conference Analysis:
Topic: performance
Sentiment: 0.84
Key quote: "Bruno was exceptional, his quality on the ball is world class."
Topic: injury
Sentiment: 0.42
Summary:
**Performance** (positive)
Key quote: "Bruno was exceptional, his quality on the ball is world class."
**Injury** (positive)Multilingual Football NLP
Football is global, and text data comes in many languages. Modern NLP models support multilingual processing for cross-language analysis and translation.
# Python: Multilingual NLP with transformers
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
from langdetect import detect
import pandas as pd
# Language detection
reports = [
"Manchester United won 2-1 against Liverpool in an exciting match.",
"El Real Madrid goleó 4-0 al Barcelona en el clásico.",
"Bayern München besiegte Borussia Dortmund mit 3-1.",
"La Juventus ha vinto 2-0 contro il Milan nella Serie A."
]
print("Language Detection:")
for report in reports:
lang = detect(report)
print(f" [{lang}] {report[:50]}...")
# Multilingual sentiment analysis
# Using multilingual BERT
multilingual_sentiment = pipeline(
"sentiment-analysis",
model="nlptown/bert-base-multilingual-uncased-sentiment"
)
print("\nMultilingual Sentiment:")
for report in reports:
result = multilingual_sentiment(report)[0]
stars = int(result["label"][0]) # "1 star" to "5 stars"
print(f" {stars}/5 stars: {report[:40]}...")
# Translation for unified analysis
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-mul-en")
print("\nTranslated to English:")
for report in reports[1:]: # Skip English
translated = translator(report, max_length=100)[0]["translation_text"]
print(f" Original: {report[:40]}...")
print(f" English: {translated[:40]}...")
print()
# Cross-lingual entity extraction
from transformers import pipeline as hf_pipeline
# XLM-RoBERTa for multilingual NER
multilingual_ner = hf_pipeline(
"ner",
model="Davlan/xlm-roberta-base-ner-hrl",
aggregation_strategy="simple"
)
print("Multilingual Entity Extraction:")
for report in reports:
entities = multilingual_ner(report)
teams = [e["word"] for e in entities if e["entity_group"] == "ORG"]
print(f" Teams found: {teams}")# R: Multilingual NLP concepts
library(tidyverse)
# Language detection
detect_language <- function(text) {
# Using textcat package
# install.packages("textcat")
# library(textcat)
# textcat(text)
# Simple heuristic based on common words
lang_markers <- list(
english = c("the", "and", "is", "was", "with"),
spanish = c("el", "la", "los", "del", "con"),
german = c("der", "die", "das", "und", "mit"),
french = c("le", "la", "les", "de", "avec"),
italian = c("il", "la", "del", "con", "che")
)
words <- tolower(str_split(text, " ")[[1]])
scores <- sapply(names(lang_markers), function(lang) {
sum(words %in% lang_markers[[lang]])
})
names(which.max(scores))
}
# Sample multilingual reports
reports <- c(
"Manchester United won 2-1 against Liverpool in an exciting match.",
"El Real Madrid goleó 4-0 al Barcelona en el clásico.",
"Bayern München besiegte Borussia Dortmund mit 3-1.",
"La Juventus ha vinto 2-0 contro il Milan nella Serie A."
)
# Detect languages
for (report in reports) {
lang <- detect_language(report)
cat(lang, ":", substr(report, 1, 40), "...\n")
}Language Detection:
[en] Manchester United won 2-1 against Liverpool...
[es] El Real Madrid goleó 4-0 al Barcelona en el cl...
[de] Bayern München besiegte Borussia Dortmund mit...
[it] La Juventus ha vinto 2-0 contro il Milan nell...
Multilingual Sentiment:
4/5 stars: Manchester United won 2-1 against Liver...
5/5 stars: El Real Madrid goleó 4-0 al Barcelona...
4/5 stars: Bayern München besiegte Borussia Dortmu...
4/5 stars: La Juventus ha vinto 2-0 contro il Mila...
Multilingual Entity Extraction:
Teams found: [Manchester United, Liverpool]
Teams found: [Real Madrid, Barcelona]
Teams found: [Bayern München, Borussia Dortmund]
Teams found: [Juventus, Milan]Practice Exercises
Hands-On Practice
Complete these exercises to master NLP for football:
Build a custom NER system that extracts players, teams, and competitions from a collection of match reports. Evaluate accuracy by comparing against manually labeled data.
Create a sentiment analysis pipeline for fan tweets during a match. Track how sentiment changes over time (before, during, after key events like goals).
Train a text classifier to predict match outcomes (win/loss/draw) from match report text. Compare TF-IDF + traditional ML vs. transformer approaches.
Build a system that generates a 2-3 sentence match summary from event data (goals, cards, key events). Combine structured data with NLG techniques.
Create a system that ingests transfer rumors from multiple sources, extracts player names and clubs, and calculates a credibility score based on source reliability. Track how rumors evolve over time.
Hint
Use the TransferRumorTracker class as a starting point. Assign reliability tiers to sources (1=official, 5=random social media). Weight aggregation by source reliability and check for consensus across multiple tier-1 sources.
Process a full match's worth of live commentary text and automatically detect all events (goals, cards, substitutions, key chances). Evaluate accuracy against official match event data.
Hint
Look for excitement patterns (exclamation marks, capitalization, extended vowels) alongside keyword patterns. Goals often have extended vowels in "GOOOAL!" style commentary.
Build a pipeline that processes manager press conference transcripts to automatically extract: (1) team news (injuries, availability), (2) key quotes, (3) tactical hints, (4) sentiment toward upcoming opponents.
Create a RAG-based Q&A system using a knowledge base of football statistics and historical data. Implement retrieval, context injection, and answer generation. Evaluate factual accuracy vs. pure LLM responses.
Hint
Use sentence embeddings (all-MiniLM-L6-v2) for retrieval. Compare RAG answers to ground truth for statistical questions to measure hallucination reduction.
Summary
Key Takeaways
- Named Entity Recognition extracts players, teams, and venues from text
- Player name resolution maps nicknames and aliases to canonical identities
- Sentiment analysis measures emotional tone in match reports and social media
- Aspect-based sentiment identifies sentiment toward specific topics (attacking, defending, referee)
- Text classification categorizes content by outcome, topic, or sentiment
- Topic modeling discovers themes in large text collections
- Summarization condenses long reports into key points
- LLMs enable advanced report generation and conversational interfaces
- RAG grounds LLM responses in factual knowledge bases
- Multilingual NLP processes football content across languages
Common Pitfalls
- Entity ambiguity: "Bruno" could refer to multiple players—use context for disambiguation
- Sarcasm detection: "What a great penalty decision!" may be sarcastic—sentiment analysis fails here
- Domain-specific language: Football jargon ("nutmeg", "rabona") may not be in general NLP models
- Real-time latency: Transformer models are slow for live commentary analysis
- LLM hallucination: LLMs may invent statistics—always verify with RAG or ground truth
- Language mixing: Multi-language tweets (code-switching) challenge standard models
- Emoji interpretation: "What a goal \u26bd\u{1F525}\u{1F525}\u{1F525}" carries sentiment info often ignored
- Temporal context: "Rashford is on fire" means different things in 2019 vs. 2023
Essential Libraries
Python Libraries:
spacy- Industrial-strength NLPnltk- Classic NLP toolkittransformers- Hugging Face transformerssentence-transformers- Text embeddingsfuzzywuzzy- Fuzzy string matchinglangdetect- Language detectionopenai- GPT API clientsumy- Text summarization
R Packages:
tidytext- Tidy text miningspacyr- spaCy interface for Rtextrecipes- Text preprocessing for modelingtopicmodels- LDA topic modelingsentimentr- Sentence-level sentimentstringdist- Fuzzy string matchingellmer- LLM API interfacetext- Modern NLP in R
Model Selection Guide
| Task | Quick & Simple | Best Accuracy | Considerations |
|---|---|---|---|
| Sentiment Analysis | VADER | Fine-tuned RoBERTa | VADER is fast and interpretable |
| Named Entities | spaCy (en_core_web_sm) | Custom fine-tuned NER | Add football entity patterns |
| Classification | TF-IDF + Naive Bayes | BERT/DistilBERT | Zero-shot for no training data |
| Summarization | TextRank (extractive) | T5/BART (abstractive) | Extractive is more reliable |
| Report Generation | Templates + rules | GPT-4 / Claude | LLMs need fact-checking |
| Q&A | Simple retrieval | RAG with GPT-4 | RAG reduces hallucination |
Production Considerations
- Latency: For real-time applications, use smaller models (DistilBERT, TinyBERT) or rule-based systems
- Cost: LLM API calls add up quickly for high-volume applications—batch where possible
- Caching: Cache embeddings and common query results to reduce computation
- Fallbacks: Have rule-based fallbacks when ML models fail or are too slow
- Monitoring: Track model accuracy over time as language evolves (new player names, slang)
- Privacy: Be careful with user-generated content—PII may be present in social media data
NLP enables extraction of insights from the vast amount of football text data. The techniques covered—from basic sentiment analysis to sophisticated LLM-powered systems—provide a toolkit for analyzing match reports, social media, commentary, and more. In the next chapter, we'll explore real-time streaming analytics for live match analysis.
Social Media Analysis
Social media provides real-time fan reactions and discourse. Analyzing this data reveals public sentiment, trending topics, and emerging narratives around football events.