Chapter 60

Capstone - Complete Analytics System

Intermediate 30 min read 5 sections 10 code examples
0 of 60 chapters completed (0%)

The Football Analytics Research Landscape

Football analytics has grown from a niche interest to a vibrant research field spanning academia, industry, and the open-source community. Understanding how to consume, contribute to, and publish research is essential for advancing your analytics career.

Major Research Venues

Academic Conferences
  • MIT Sloan Sports Analytics Conference - Premier venue
  • ECML/PKDD Sports Analytics Workshop - Machine learning focus
  • KDD Sports Analytics Workshop - Data mining
  • StatsBomb Conference - Industry + research
  • Opta Forum - Industry applications
Key Publications
  • Journal of Sports Analytics - Peer-reviewed
  • Journal of Quantitative Analysis in Sports
  • International Journal of Performance Analysis
  • arXiv (cs.LG, stat.AP) - Preprints
  • SSRN Sports Research Network
script

import pandas as pd
from tabulate import tabulate

# Football Analytics Research Taxonomy
research_areas = pd.DataFrame({
    "area": ["Expected Goals Models", "Player Valuation", "Tactical Analysis",
             "Tracking Data", "Injury Prediction", "Match Outcome Prediction",
             "Player Similarity", "Team Style Clustering", "Set Piece Analysis",
             "Goalkeeper Analysis"],
    "maturity": ["High", "High", "Medium-High", "Medium", "Medium",
                 "High", "Medium-High", "Medium", "Medium", "Medium"],
    "key_methods": ["Logistic Regression, XGBoost, Neural Nets",
                    "Market Models, Performance Metrics",
                    "Network Analysis, Clustering",
                    "Computer Vision, Spatial Statistics",
                    "Survival Analysis, Time Series",
                    "Poisson Models, Machine Learning",
                    "Embedding, Distance Metrics",
                    "K-Means, Hierarchical Clustering",
                    "Expected Goals, Game Theory",
                    "GSAA, Positioning Models"],
    "data_requirements": ["Event Data", "Event + Market Data", "Event/Tracking Data",
                          "Tracking Data", "Medical + Physical Data",
                          "Historical Match Data", "Event Data",
                          "Event/Tracking Data", "Event Data", "Event/Tracking Data"],
    "industry_adoption": ["Universal", "High", "Growing", "Elite Clubs",
                          "Growing", "Betting Industry", "Recruitment",
                          "Analysis Teams", "Set Piece Coaches", "Growing"]
})

print("Football Analytics Research Areas")
print("=" * 100)
print(tabulate(research_areas, headers="keys", tablefmt="grid", showindex=False))

# Research opportunity scoring
print("\n\nResearch Opportunity Assessment:")
maturity_scores = {"High": 1, "Medium-High": 2, "Medium": 3, "Low": 4}
research_areas["opportunity_score"] = research_areas["maturity"].map(maturity_scores)

opportunities = research_areas.nlargest(5, "opportunity_score")[["area", "maturity", "key_methods"]]
print("\nHigh-Opportunity Research Areas (Less Mature = More Opportunity):")
print(opportunities.to_string(index=False))

library(tidyverse)
library(gt)

# Football Analytics Research Taxonomy
research_areas <- tibble(
  area = c("Expected Goals Models", "Player Valuation", "Tactical Analysis",
           "Tracking Data", "Injury Prediction", "Match Outcome Prediction",
           "Player Similarity", "Team Style Clustering", "Set Piece Analysis",
           "Goalkeeper Analysis"),
  maturity = c("High", "High", "Medium-High", "Medium", "Medium",
               "High", "Medium-High", "Medium", "Medium", "Medium"),
  key_methods = c("Logistic Regression, XGBoost, Neural Nets",
                  "Market Models, Performance Metrics",
                  "Network Analysis, Clustering",
                  "Computer Vision, Spatial Statistics",
                  "Survival Analysis, Time Series",
                  "Poisson Models, Machine Learning",
                  "Embedding, Distance Metrics",
                  "K-Means, Hierarchical Clustering",
                  "Expected Goals, Game Theory",
                  "GSAA, Positioning Models"),
  data_requirements = c("Event Data", "Event + Market Data", "Event/Tracking Data",
                        "Tracking Data", "Medical + Physical Data",
                        "Historical Match Data", "Event Data",
                        "Event/Tracking Data", "Event Data", "Event/Tracking Data"),
  industry_adoption = c("Universal", "High", "Growing", "Elite Clubs",
                        "Growing", "Betting Industry", "Recruitment",
                        "Analysis Teams", "Set Piece Coaches", "Growing")
)

research_areas %>%
  gt() %>%
  tab_header(
    title = "Football Analytics Research Areas",
    subtitle = "Current state and opportunities"
  ) %>%
  cols_label(
    area = "Research Area",
    maturity = "Maturity",
    key_methods = "Key Methods",
    data_requirements = "Data Required",
    industry_adoption = "Industry Adoption"
  ) %>%
  tab_style(
    style = cell_fill(color = "#C8E6C9"),
    locations = cells_body(rows = maturity == "High")
  ) %>%
  tab_style(
    style = cell_fill(color = "#FFF9C4"),
    locations = cells_body(rows = maturity == "Medium")
  )

Conducting Football Analytics Research

Rigorous research methodology separates publishable work from casual analysis. This section covers the research process from hypothesis formation through validation and publication.

Research Workflow

script

import pandas as pd
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from datetime import datetime

# Research Project Framework

@dataclass
class ResearchProject:
    """Framework for conducting rigorous football analytics research."""

    title: str
    hypothesis: str
    status: str = "planning"
    data_sources: Dict = field(default_factory=dict)
    methodology: Dict = field(default_factory=dict)
    results: Dict = field(default_factory=dict)

    def __post_init__(self):
        print(f"Research project initialized: {self.title}")

    def conduct_literature_review(self, keywords: List[str],
                                   databases: List[str] = None) -> Dict:
        """Phase 1: Systematic literature review."""
        print("Conducting literature review...")

        if databases is None:
            databases = ["Google Scholar", "arXiv", "SSRN"]

        search_strategy = {
            "keywords": keywords,
            "databases": databases,
            "date_range": (2015, 2024),
            "inclusion_criteria": [
                "Peer-reviewed or reputable preprint",
                "Football/soccer specific or methodology transferable",
                "Publicly available or accessible"
            ],
            "exclusion_criteria": [
                "Non-English publications",
                "Conference abstracts only",
                "Superseded by newer work"
            ]
        }

        self.methodology["literature_review"] = search_strategy
        print("Literature review strategy documented")

        return search_strategy

    def setup_data_pipeline(self, sources: List[str],
                            validation_checks: List[str]) -> Dict:
        """Phase 2: Data collection and validation setup."""
        print("Setting up data pipeline...")

        data_pipeline = {
            "sources": sources,
            "collection_date": datetime.now().isoformat(),
            "validation": validation_checks,
            "preprocessing_steps": [
                "Remove duplicates",
                "Handle missing values",
                "Validate data ranges",
                "Cross-reference with external sources"
            ]
        }

        self.data_sources = data_pipeline
        self.status = "data_collection"

        return data_pipeline

    def design_methodology(self, approach: str,
                           baseline_comparisons: List[str],
                           metrics: List[str]) -> Dict:
        """Phase 3: Methodology design."""
        print("Designing methodology...")

        methodology = {
            "approach": approach,
            "baselines": baseline_comparisons,
            "evaluation_metrics": metrics,
            "statistical_tests": ["t-test", "Mann-Whitney U", "Bootstrap CI"],
            "cross_validation": {
                "method": "5-fold stratified",
                "holdout_set": 0.2,
                "temporal_split": True
            }
        }

        self.methodology["analysis"] = methodology
        self.status = "methodology"

        return methodology

    def run_experiments(self, model_configs: List[str]) -> pd.DataFrame:
        """Phase 4: Run experiments."""
        print("Running experiments...")

        experiments = pd.DataFrame({
            "experiment_id": [f"exp_{i+1}" for i in range(len(model_configs))],
            "config": model_configs,
            "status": "pending",
            "start_time": None,
            "end_time": None,
            "primary_metric": None
        })

        self.results["experiments"] = experiments
        self.status = "experimentation"

        return experiments

    def validate_results(self, results_df: pd.DataFrame,
                         alpha: float = 0.05) -> Dict:
        """Phase 5: Statistical validation."""
        print("Validating results...")

        validation = {
            "sample_size": len(results_df),
            "statistical_power": None,
            "confidence_level": 1 - alpha,
            "tests_performed": [],
            "multiple_comparison_correction": "Bonferroni",
            "effect_sizes": {}
        }

        self.results["validation"] = validation
        self.status = "validation"

        return validation

    def generate_summary(self) -> Dict:
        """Generate research summary."""
        return {
            "title": self.title,
            "hypothesis": self.hypothesis,
            "status": self.status,
            "data_sources": len(self.data_sources.get("sources", [])),
            "methodology": self.methodology.get("analysis", {}).get("approach"),
            "key_findings": self.results
        }


# Example: Setting up an xG research project
xg_project = ResearchProject(
    title="Contextual Expected Goals: Incorporating Defensive Pressure",
    hypothesis="Adding defensive pressure features improves xG model accuracy by >5%"
)

# Literature review
lit_review = xg_project.conduct_literature_review(
    keywords=["expected goals", "xG", "defensive pressure", "football analytics"],
    databases=["Google Scholar", "arXiv", "MIT Sloan Archives"]
)

# Data setup
data_pipeline = xg_project.setup_data_pipeline(
    sources=["StatsBomb Open Data", "Wyscout", "Custom Tracking"],
    validation_checks=["Shot count verification", "xG range validation",
                       "Missing coordinate check"]
)

# Methodology
methodology = xg_project.design_methodology(
    approach="Gradient Boosting with feature engineering",
    baseline_comparisons=["Basic xG (distance/angle)", "StatsBomb xG", "Public models"],
    metrics=["Log Loss", "Brier Score", "AUC-ROC", "Calibration"]
)

print("\nResearch Project Summary:")
print("=" * 50)
summary = xg_project.generate_summary()
for key, value in summary.items():
    print(f"{key}: {value}")

library(tidyverse)
library(R6)

# Research Project Framework
ResearchProject <- R6Class("ResearchProject",
  public = list(
    title = NULL,
    hypothesis = NULL,
    data_sources = list(),
    methodology = NULL,
    results = list(),
    status = "planning",

    initialize = function(title, hypothesis) {
      self$title <- title
      self$hypothesis <- hypothesis
      self$status <- "planning"
      message(paste("Research project initialized:", title))
    },

    # Phase 1: Literature Review
    conduct_literature_review = function(keywords, databases = c("Google Scholar",
                                                                  "arXiv", "SSRN")) {
      message("Conducting literature review...")

      # Structure for tracking related work
      literature <- tibble(
        paper_id = character(),
        title = character(),
        authors = character(),
        year = integer(),
        venue = character(),
        key_findings = character(),
        methodology = character(),
        relevance = character()
      )

      # Search strategy
      search_strategy <- list(
        keywords = keywords,
        databases = databases,
        date_range = c(2015, 2024),
        inclusion_criteria = c(
          "Peer-reviewed or reputable preprint",
          "Football/soccer specific or methodology transferable",
          "Publicly available or accessible"
        ),
        exclusion_criteria = c(
          "Non-English publications",
          "Conference abstracts only",
          "Superseded by newer work"
        )
      )

      self$methodology$literature_review <- search_strategy
      message("Literature review strategy documented")

      return(search_strategy)
    },

    # Phase 2: Data Collection & Validation
    setup_data_pipeline = function(sources, validation_checks) {
      message("Setting up data pipeline...")

      data_pipeline <- list(
        sources = sources,
        collection_period = Sys.Date(),
        validation = validation_checks,
        preprocessing_steps = c(
          "Remove duplicates",
          "Handle missing values",
          "Validate data ranges",
          "Cross-reference with external sources"
        )
      )

      self$data_sources <- data_pipeline
      self$status <- "data_collection"

      return(data_pipeline)
    },

    # Phase 3: Methodology Design
    design_methodology = function(approach, baseline_comparisons, metrics) {
      message("Designing methodology...")

      methodology <- list(
        approach = approach,
        baselines = baseline_comparisons,
        evaluation_metrics = metrics,
        statistical_tests = c("t-test", "Mann-Whitney U", "Bootstrap CI"),
        cross_validation = list(
          method = "5-fold stratified",
          holdout_set = 0.2,
          temporal_split = TRUE  # Important for time-series data
        )
      )

      self$methodology$analysis <- methodology
      self$status <- "methodology"

      return(methodology)
    },

    # Phase 4: Run Experiments
    run_experiments = function(model_configs) {
      message("Running experiments...")

      # Experiment tracking
      experiments <- tibble(
        experiment_id = paste0("exp_", 1:length(model_configs)),
        config = model_configs,
        status = "pending",
        start_time = as.POSIXct(NA),
        end_time = as.POSIXct(NA),
        primary_metric = NA_real_,
        secondary_metrics = list()
      )

      self$results$experiments <- experiments
      self$status <- "experimentation"

      return(experiments)
    },

    # Phase 5: Statistical Validation
    validate_results = function(results_df, alpha = 0.05) {
      message("Validating results...")

      validation <- list(
        sample_size = nrow(results_df),
        statistical_power = NA,  # Calculate based on effect size
        confidence_level = 1 - alpha,
        tests_performed = list(),
        multiple_comparison_correction = "Bonferroni",
        effect_sizes = list()
      )

      self$results$validation <- validation
      self$status <- "validation"

      return(validation)
    },

    # Generate research summary
    generate_summary = function() {
      summary <- list(
        title = self$title,
        hypothesis = self$hypothesis,
        status = self$status,
        data_sources = length(self$data_sources),
        methodology = self$methodology$analysis$approach,
        key_findings = self$results
      )

      return(summary)
    }
  )
)

# Example: Setting up an xG research project
xg_project <- ResearchProject$new(

  title = "Contextual Expected Goals: Incorporating Defensive Pressure",
  hypothesis = "Adding defensive pressure features improves xG model accuracy by >5%"
)

# Literature review
lit_review <- xg_project$conduct_literature_review(
  keywords = c("expected goals", "xG", "defensive pressure", "football analytics"),
  databases = c("Google Scholar", "arXiv", "MIT Sloan Archives")
)

# Data setup
data_pipeline <- xg_project$setup_data_pipeline(
  sources = c("StatsBomb Open Data", "Wyscout", "Custom Tracking"),
  validation_checks = c("Shot count verification", "xG range validation",
                        "Missing coordinate check")
)

# Methodology
methodology <- xg_project$design_methodology(
  approach = "Gradient Boosting with feature engineering",
  baseline_comparisons = c("Basic xG (distance/angle)", "StatsBomb xG", "Public models"),
  metrics = c("Log Loss", "Brier Score", "AUC-ROC", "Calibration")
)

print("Research Project Summary:")
print(xg_project$generate_summary())

Writing Research Papers

Academic writing in sports analytics follows established conventions while requiring clear communication to both technical and domain expert audiences.

Paper Structure

Standard Research Paper Sections
  1. Abstract (150-300 words)
    • Problem statement
    • Methodology summary
    • Key results with numbers
    • Main contribution
  2. Introduction
    • Context and motivation
    • Problem definition
    • Research questions/hypotheses
    • Contributions summary
    • Paper organization
  3. Related Work
    • Prior approaches
    • Gaps in existing research
    • How your work differs
  4. Data
    • Data sources and collection
    • Dataset statistics
    • Preprocessing steps
    • Train/validation/test splits
  5. Methodology
    • Technical approach
    • Model architecture
    • Feature engineering
    • Baseline comparisons
  6. Experiments & Results
    • Experimental setup
    • Evaluation metrics
    • Main results tables/figures
    • Statistical significance
  7. Discussion
    • Interpretation of results
    • Practical implications
    • Limitations
  8. Conclusion
    • Summary of contributions
    • Future work
script

import pandas as pd
import numpy as np
from scipy import stats

# Creating Publication-Quality Tables and Statistics

def create_results_table(results_data: pd.DataFrame) -> str:
    """
    Create a formatted results table for publication.
    """

    # Highlight best performance
    best_idx = results_data["log_loss"].idxmin()

    # Format table
    table_str = """
    ╔════════════════════════════════════════════════════════════════════════╗
    ║     Expected Goals Model Performance Comparison                        ║
    ╠═══════════════════════╦═══════════╦═════════════╦═════════╦════════════╣
    ║ Model                 ║ Log Loss  ║ Brier Score ║ AUC-ROC ║ Calibration║
    ╠═══════════════════════╬═══════════╬═════════════╬═════════╬════════════╣"""

    for idx, row in results_data.iterrows():
        marker = " *" if idx == best_idx else "  "
        table_str += f"""
    ║ {row['model']:<21}║ {row['log_loss']:.4f}{marker} ║ {row['brier_score']:.4f}     ║ {row['auc_roc']:.3f}  ║ {row['calibration']:.4f}    ║"""

    table_str += """
    ╚═══════════════════════╩═══════════╩═════════════╩═════════╩════════════╝
    * Best performance (all differences significant at p < 0.05)
    """

    return table_str


# Example results data
model_results = pd.DataFrame({
    "model": ["Baseline (Dist/Angle)", "Random Forest", "XGBoost",
              "Neural Network", "Our Model (Pressure)"],
    "log_loss": [0.3421, 0.3156, 0.3089, 0.3102, 0.2934],
    "brier_score": [0.0923, 0.0867, 0.0842, 0.0851, 0.0798],
    "auc_roc": [0.762, 0.789, 0.801, 0.798, 0.824],
    "calibration": [0.0312, 0.0245, 0.0198, 0.0221, 0.0156]
})

print(create_results_table(model_results))


def report_significance(metric_a: np.ndarray, metric_b: np.ndarray,
                        test_type: str = "paired_t") -> str:
    """
    Generate statistical significance report.
    """

    # Perform statistical test
    if test_type == "paired_t":
        stat, p_value = stats.ttest_rel(metric_a, metric_b)
    else:  # wilcoxon
        stat, p_value = stats.wilcoxon(metric_a, metric_b)

    # Effect size (Cohen's d)
    diff = metric_a - metric_b
    cohens_d = np.mean(diff) / np.std(diff)

    # Bootstrap confidence interval
    n_bootstrap = 1000
    improvements = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(len(metric_a), len(metric_a), replace=True)
        improvements.append(
            (1 - np.mean(metric_b[idx]) / np.mean(metric_a[idx])) * 100
        )

    ci_low, ci_high = np.percentile(improvements, [2.5, 97.5])

    # Format p-value
    p_str = "< 0.001" if p_value < 0.001 else f"= {p_value:.3f}"

    return (f"Improvement: {(1 - np.mean(metric_b) / np.mean(metric_a)) * 100:.1f}% "
            f"(95% CI: [{ci_low:.1f}%, {ci_high:.1f}%], p {p_str}, d = {cohens_d:.2f})")


# Example comparison
np.random.seed(42)
baseline_scores = np.random.normal(0.34, 0.05, 100)
improved_scores = np.random.normal(0.29, 0.05, 100)

print("\nStatistical Report:")
print(report_significance(baseline_scores, improved_scores))

library(tidyverse)
library(knitr)
library(kableExtra)

# Creating Publication-Quality Tables

# Example: xG Model Comparison Results Table
create_results_table <- function(results_data) {
  results_data %>%
    kbl(
      caption = "Expected Goals Model Performance Comparison",
      booktabs = TRUE,
      digits = 4,
      col.names = c("Model", "Log Loss", "Brier Score", "AUC-ROC", "Calibration")
    ) %>%
    kable_styling(
      latex_options = c("striped", "hold_position"),
      full_width = FALSE
    ) %>%
    row_spec(
      which.min(results_data$log_loss),
      bold = TRUE,
      background = "#E8F5E9"
    ) %>%
    footnote(
      general = "Best performance highlighted. All differences significant at p < 0.05.",
      general_title = "Note: "
    )
}

# Example results data
model_results <- tibble(
  model = c("Baseline (Dist/Angle)", "Random Forest", "XGBoost",
            "Neural Network", "Our Model (Pressure)"),
  log_loss = c(0.3421, 0.3156, 0.3089, 0.3102, 0.2934),
  brier_score = c(0.0923, 0.0867, 0.0842, 0.0851, 0.0798),
  auc_roc = c(0.762, 0.789, 0.801, 0.798, 0.824),
  calibration = c(0.0312, 0.0245, 0.0198, 0.0221, 0.0156)
)

# Create table
results_table <- create_results_table(model_results)
print(results_table)

# Statistical Significance Reporting
report_significance <- function(metric_a, metric_b, test_type = "paired_t") {
  # Perform statistical test
  if (test_type == "paired_t") {
    test_result <- t.test(metric_a, metric_b, paired = TRUE)
  } else if (test_type == "wilcoxon") {
    test_result <- wilcox.test(metric_a, metric_b, paired = TRUE)
  }

  # Effect size (Cohen d)
  cohens_d <- (mean(metric_a) - mean(metric_b)) / sd(metric_a - metric_b)

  # Format result
  sprintf(
    "Improvement: %.1f%% (95%% CI: [%.1f%%, %.1f%%], p %s, d = %.2f)",
    (1 - mean(metric_b) / mean(metric_a)) * 100,
    test_result$conf.int[1] / mean(metric_a) * 100,
    test_result$conf.int[2] / mean(metric_a) * 100,
    ifelse(test_result$p.value < 0.001, "< 0.001",
           sprintf("= %.3f", test_result$p.value)),
    cohens_d
  )
}

# Example comparison
baseline_scores <- rnorm(100, mean = 0.34, sd = 0.05)
improved_scores <- rnorm(100, mean = 0.29, sd = 0.05)

significance_report <- report_significance(baseline_scores, improved_scores)
print(paste("Statistical Report:", significance_report))

Reproducibility and Open Science

Reproducibility is a cornerstone of credible research. Football analytics research should enable others to verify and build upon your findings.

script

import pandas as pd
import numpy as np
import hashlib
import json
from typing import Dict, Any
import sys

# Reproducibility Best Practices

def setup_reproducible_environment() -> Dict[str, Any]:
    """
    Document environment for reproducibility.
    """

    environment_doc = {
        "python_version": sys.version,
        "platform": sys.platform,
        "packages": {},  # Would use pip freeze in practice
        "snapshot_date": pd.Timestamp.now().isoformat()
    }

    # In practice, use:
    # pip freeze > requirements.txt
    # or
    # conda env export > environment.yml

    return environment_doc


def create_data_dictionary(dataset: pd.DataFrame,
                           dataset_name: str) -> pd.DataFrame:
    """
    Generate comprehensive data dictionary.
    """

    dictionary = pd.DataFrame({
        "variable": dataset.columns,
        "type": [str(dtype) for dtype in dataset.dtypes],
        "n_unique": [dataset[col].nunique() for col in dataset.columns],
        "n_missing": [dataset[col].isna().sum() for col in dataset.columns],
        "pct_missing": [dataset[col].isna().mean() * 100 for col in dataset.columns],
        "example_values": [
            ", ".join(map(str, dataset[col].dropna().unique()[:3]))
            for col in dataset.columns
        ]
    })

    # Save to file
    # dictionary.to_csv(f"{dataset_name}_dictionary.csv", index=False)

    return dictionary


def document_analysis_pipeline() -> Dict[str, Any]:
    """
    Document the analysis pipeline for reproducibility.
    """

    pipeline = {
        "steps": [
            "1. Data Loading: src/load_raw_data.py",
            "2. Preprocessing: src/preprocess_events.py",
            "3. Feature Engineering: src/create_features.py",
            "4. Model Training: src/train_model.py",
            "5. Evaluation: src/evaluate_model.py",
            "6. Visualization: src/create_figures.py"
        ],
        "execution_order": "Run python main.py or use make",
        "expected_runtime": "~2 hours on standard hardware",
        "hardware_requirements": "16GB RAM recommended for full dataset"
    }

    return pipeline


def create_results_checksum(results_df: pd.DataFrame) -> Dict[str, Any]:
    """
    Create checksum for results verification.
    """

    # Select numeric columns
    numeric_cols = results_df.select_dtypes(include=[np.number])

    checksum = {
        "n_rows": len(results_df),
        "n_cols": len(results_df.columns),
        "column_sums": numeric_cols.sum().to_dict(),
        "md5_hash": hashlib.md5(
            pd.util.hash_pandas_object(results_df).values
        ).hexdigest()
    }

    return checksum


# Example usage
np.random.seed(42)
example_data = pd.DataFrame({
    "shot_id": range(1, 1001),
    "xg": np.random.uniform(0, 1, 1000),
    "goal": np.random.binomial(1, 0.1, 1000),
    "distance": np.random.uniform(5, 35, 1000)
})

# Create documentation
print("Data Dictionary:")
print("=" * 70)
data_dict = create_data_dictionary(example_data, "shots")
print(data_dict.to_string(index=False))

print("\n\nAnalysis Pipeline:")
print("=" * 70)
pipeline_doc = document_analysis_pipeline()
for key, value in pipeline_doc.items():
    print(f"\n{key}:")
    if isinstance(value, list):
        for item in value:
            print(f"  {item}")
    else:
        print(f"  {value}")

print("\n\nResults Checksum:")
print("=" * 70)
checksum = create_results_checksum(example_data)
print(json.dumps(checksum, indent=2))

library(tidyverse)
library(renv)

# Reproducibility Best Practices

# 1. Environment Management with renv
setup_reproducible_environment <- function(project_path) {
  # Initialize renv for dependency tracking
  # renv::init()

  # Snapshot current dependencies
  # renv::snapshot()

  # Document R version
  session_info <- sessionInfo()

  environment_doc <- list(
    r_version = paste(R.version$major, R.version$minor, sep = "."),
    platform = R.version$platform,
    packages = installed.packages()[, c("Package", "Version")],
    snapshot_date = Sys.Date()
  )

  return(environment_doc)
}

# 2. Data Documentation
create_data_dictionary <- function(dataset, dataset_name) {
  # Generate data dictionary
  dictionary <- tibble(
    variable = names(dataset),
    type = sapply(dataset, class),
    n_unique = sapply(dataset, function(x) length(unique(x))),
    n_missing = sapply(dataset, function(x) sum(is.na(x))),
    pct_missing = sapply(dataset, function(x) mean(is.na(x)) * 100),
    example_values = sapply(dataset, function(x) {
      paste(head(unique(x), 3), collapse = ", ")
    })
  )

  # Save to file
  # write_csv(dictionary, paste0(dataset_name, "_dictionary.csv"))

  return(dictionary)
}

# 3. Analysis Pipeline Documentation
document_analysis_pipeline <- function() {
  pipeline <- list(
    steps = c(
      "1. Data Loading: load_raw_data.R",
      "2. Preprocessing: preprocess_events.R",
      "3. Feature Engineering: create_features.R",
      "4. Model Training: train_model.R",
      "5. Evaluation: evaluate_model.R",
      "6. Visualization: create_figures.R"
    ),
    execution_order = "Run scripts in numbered order or use make",
    expected_runtime = "~2 hours on standard hardware",
    hardware_requirements = "16GB RAM recommended for full dataset"
  )

  return(pipeline)
}

# 4. Results Verification Checksums
create_results_checksum <- function(results_df) {
  # Create reproducibility checksum
  checksum <- list(
    n_rows = nrow(results_df),
    n_cols = ncol(results_df),
    column_sums = sapply(results_df[sapply(results_df, is.numeric)], sum),
    md5_hash = digest::digest(results_df, algo = "md5")
  )

  return(checksum)
}

# Example usage
example_data <- tibble(
  shot_id = 1:1000,
  xg = runif(1000, 0, 1),
  goal = rbinom(1000, 1, 0.1),
  distance = runif(1000, 5, 35)
)

# Create documentation
data_dict <- create_data_dictionary(example_data, "shots")
print("Data Dictionary:")
print(data_dict)

pipeline_doc <- document_analysis_pipeline()
print("\nAnalysis Pipeline:")
print(pipeline_doc)
Code Repository Best Practices
  • README.md - Clear setup and execution instructions
  • requirements.txt / environment.yml - Dependency versions
  • data/ - Sample data or download scripts
  • src/ - Well-organized source code
  • notebooks/ - Exploratory analysis
  • results/ - Output figures and tables
  • tests/ - Unit tests for key functions
  • LICENSE - Clear usage terms

Industry Publication & Communication

Not all valuable research belongs in academic venues. Industry blogs, conference talks, and open-source projects are equally important for advancing the field.

script

import pandas as pd
from dataclasses import dataclass
from typing import List, Dict

# Industry Publication Framework

@dataclass
class IndustryPublication:
    """Framework for creating industry-focused content."""

    title: str
    format: str
    audience: str

    def create_blog_outline(self, key_finding: str,
                            data_viz_count: int = 3) -> Dict:
        """Create blog post outline."""

        outline = {
            "hook": "Compelling opening that challenges conventional wisdom",
            "context": "Brief background (2-3 paragraphs)",
            "methodology": "High-level approach without excessive technical detail",
            "key_findings": {
                "finding_1": "Primary insight with supporting visualization",
                "finding_2": "Secondary insight",
                "finding_3": "Practical implication"
            },
            "visualizations": f"{data_viz_count} clear, annotated charts",
            "conclusion": "Actionable takeaway for readers",
            "technical_appendix": "Optional: detailed methodology"
        }

        return outline

    def create_talk_outline(self, duration_minutes: int = 20) -> Dict:
        """Create conference talk outline."""

        outline = {
            "intro": {
                "duration": "2 minutes",
                "content": ["Hook/problem statement", "Why this matters",
                           "Preview of key insight"]
            },
            "context": {
                "duration": "3 minutes",
                "content": ["Industry background", "Current approaches",
                           "Gap/opportunity"]
            },
            "methodology": {
                "duration": "5 minutes",
                "content": ["Data sources", "Analytical approach",
                           "Key innovations"]
            },
            "results": {
                "duration": "7 minutes",
                "content": ["Main findings (3 max)", "Visualizations",
                           "Statistical support"]
            },
            "implications": {
                "duration": "2 minutes",
                "content": ["Practical applications", "Limitations",
                           "Future directions"]
            },
            "qa": {
                "duration": "1 minute buffer",
                "content": "Prepare for common questions"
            }
        }

        return outline

    def create_twitter_thread(self, key_points: List[str],
                              max_tweets: int = 10) -> Dict:
        """Create Twitter/X thread outline."""

        thread = {
            "tweet_1": "Hook: Surprising finding or question",
            "tweets_middle": [f"Key point {i+1}" for i in range(min(len(key_points), max_tweets - 2))],
            "final_tweet": "Call to action: Link to full analysis",
            "media": "Include 1-2 compelling visualizations",
            "hashtags": ["#FootballAnalytics", "#xG", "#SportScience"]
        }

        return thread


# Example: Creating content for different formats
content = IndustryPublication(
    title="Why Traditional Assists Are Misleading",
    format="multi-platform",
    audience="Football analysts and enthusiasts"
)

blog_outline = content.create_blog_outline(
    key_finding="xA provides 40% better prediction of future assists"
)

talk_outline = content.create_talk_outline(duration_minutes=15)

print("Blog Post Outline")
print("=" * 50)
for section, description in blog_outline.items():
    print(f"\n{section}:")
    if isinstance(description, dict):
        for k, v in description.items():
            print(f"  - {k}: {v}")
    else:
        print(f"  {description}")

print("\n\nConference Talk Outline")
print("=" * 50)
for section, details in talk_outline.items():
    print(f"\n{section} ({details['duration']}):")
    if isinstance(details["content"], list):
        for item in details["content"]:
            print(f"  - {item}")
    else:
        print(f"  {details['content']}")

library(tidyverse)
library(R6)

# Industry Publication Framework
IndustryPublication <- R6Class("IndustryPublication",
  public = list(
    title = NULL,
    format = NULL,
    audience = NULL,

    initialize = function(title, format, audience) {
      self$title <- title
      self$format <- format
      self$audience <- audience
    },

    # Blog Post Structure
    create_blog_outline = function(key_finding, data_viz_count = 3) {
      outline <- list(
        hook = "Compelling opening that challenges conventional wisdom",
        context = "Brief background (2-3 paragraphs)",
        methodology = "High-level approach without excessive technical detail",
        key_findings = list(
          finding_1 = "Primary insight with supporting visualization",
          finding_2 = "Secondary insight",
          finding_3 = "Practical implication"
        ),
        visualizations = paste(data_viz_count, "clear, annotated charts"),
        conclusion = "Actionable takeaway for readers",
        technical_appendix = "Optional: detailed methodology for interested readers"
      )

      return(outline)
    },

    # Conference Talk Structure
    create_talk_outline = function(duration_minutes = 20) {
      slides_count <- duration_minutes * 1.5  # Rough estimate

      outline <- list(
        intro = list(
          duration = "2 minutes",
          content = c("Hook/problem statement", "Why this matters", "Preview of key insight")
        ),
        context = list(
          duration = "3 minutes",
          content = c("Industry background", "Current approaches", "Gap/opportunity")
        ),
        methodology = list(
          duration = "5 minutes",
          content = c("Data sources", "Analytical approach", "Key innovations")
        ),
        results = list(
          duration = "7 minutes",
          content = c("Main findings (3 max)", "Visualizations", "Statistical support")
        ),
        implications = list(
          duration = "2 minutes",
          content = c("Practical applications", "Limitations", "Future directions")
        ),
        qa = list(
          duration = "1 minute buffer",
          content = "Prepare for common questions"
        )
      )

      return(outline)
    },

    # Social Media Thread Structure
    create_twitter_thread = function(key_points, max_tweets = 10) {
      thread <- list(
        tweet_1 = "Hook: Surprising finding or question (with emoji)",
        tweets_2_to_n = paste("Key point", 1:min(length(key_points), max_tweets - 2)),
        final_tweet = "Call to action: Link to full analysis",
        media = "Include 1-2 compelling visualizations",
        hashtags = c("#FootballAnalytics", "#xG", "#SportScience")
      )

      return(thread)
    }
  )
)

# Example: Creating content for different formats
content <- IndustryPublication$new(
  title = "Why Traditional Assists Are Misleading",
  format = "multi-platform",
  audience = "Football analysts and enthusiasts"
)

blog_outline <- content$create_blog_outline(
  key_finding = "xA provides 40% better prediction of future assists than traditional assists"
)

talk_outline <- content$create_talk_outline(duration_minutes = 15)

print("Blog Post Outline:")
print(blog_outline)

print("\nConference Talk Outline:")
print(talk_outline)
Common Publication Pitfalls
  • Overfitting claims - Test on holdout data, not training data
  • Cherry-picking examples - Show representative cases, not just best results
  • Ignoring baselines - Always compare against simple alternatives
  • Data leakage - Ensure temporal integrity in train/test splits
  • P-hacking - Pre-register hypotheses when possible
  • Overcomplicating - Simple models often perform comparably

Building Your Research Reputation

A strong research reputation opens doors to collaborations, job opportunities, and influence in the field.

Academic Track
  • Publish in peer-reviewed venues
  • Present at academic conferences
  • Collaborate with university researchers
  • Pursue PhD or postdoc in sports analytics
  • Review papers for journals/conferences
  • Teach courses or workshops
Industry Track
  • Write technical blog posts
  • Speak at industry conferences
  • Release open-source tools
  • Engage on analytics Twitter/X
  • Contribute to open datasets
  • Mentor junior analysts
script

import pandas as pd
from dataclasses import dataclass, field
from typing import List, Dict, Optional

# Research Portfolio Tracker

@dataclass
class ResearchPortfolio:
    """Track research outputs and impact."""

    publications: pd.DataFrame = field(default_factory=lambda: pd.DataFrame(
        columns=["title", "venue", "year", "type", "citations", "url"]
    ))
    projects: pd.DataFrame = field(default_factory=lambda: pd.DataFrame(
        columns=["name", "description", "github_stars", "downloads", "status"]
    ))
    talks: pd.DataFrame = field(default_factory=lambda: pd.DataFrame(
        columns=["title", "venue", "date", "audience_size", "recording_url"]
    ))
    collaborations: pd.DataFrame = field(default_factory=lambda: pd.DataFrame(
        columns=["partner", "organization", "project", "status"]
    ))
    metrics: Dict = field(default_factory=lambda: {
        "h_index": None,
        "total_citations": None,
        "github_followers": None,
        "twitter_followers": None
    })

    def add_publication(self, title: str, venue: str, year: int,
                        pub_type: str, citations: int = 0,
                        url: str = "") -> None:
        """Add a publication to portfolio."""
        new_pub = pd.DataFrame([{
            "title": title, "venue": venue, "year": year,
            "type": pub_type, "citations": citations, "url": url
        }])
        self.publications = pd.concat([self.publications, new_pub],
                                       ignore_index=True)

    def add_project(self, name: str, description: str,
                    github_stars: int = 0, downloads: int = 0,
                    status: str = "active") -> None:
        """Add an open-source project."""
        new_project = pd.DataFrame([{
            "name": name, "description": description,
            "github_stars": github_stars, "downloads": downloads,
            "status": status
        }])
        self.projects = pd.concat([self.projects, new_project],
                                   ignore_index=True)

    def calculate_impact(self) -> Dict:
        """Calculate impact metrics."""
        return {
            "academic_impact": {
                "publications": len(self.publications),
                "peer_reviewed": len(self.publications[
                    self.publications["type"] == "peer-reviewed"
                ]),
                "total_citations": self.publications["citations"].sum(),
                "h_index": self.metrics.get("h_index")
            },
            "industry_impact": {
                "blog_posts": len(self.publications[
                    self.publications["type"] == "blog"
                ]),
                "talks_given": len(self.talks),
                "open_source_projects": len(self.projects),
                "total_github_stars": self.projects["github_stars"].sum()
            },
            "community_impact": {
                "collaborations": len(self.collaborations),
                "mentees": None,
                "datasets_released": None
            }
        }


# Example portfolio
my_portfolio = ResearchPortfolio()

# Add a publication
my_portfolio.add_publication(
    title="Contextual Expected Goals with Defensive Pressure",
    venue="MIT Sloan Sports Analytics Conference",
    year=2024,
    pub_type="peer-reviewed",
    citations=15,
    url="https://example.com/paper"
)

# Add an open-source project
my_portfolio.add_project(
    name="football-xg",
    description="Open-source expected goals model library",
    github_stars=245,
    downloads=12000,
    status="active"
)

# Calculate impact
impact = my_portfolio.calculate_impact()

print("Research Impact Summary")
print("=" * 50)
for category, metrics in impact.items():
    print(f"\n{category}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value}")

library(tidyverse)

# Research Portfolio Tracker
create_research_portfolio <- function() {
  portfolio <- list(
    publications = tibble(
      title = character(),
      venue = character(),
      year = integer(),
      type = character(),  # peer-reviewed, preprint, blog, talk
      citations = integer(),
      url = character()
    ),

    projects = tibble(
      name = character(),
      description = character(),
      github_stars = integer(),
      downloads = integer(),
      status = character()
    ),

    talks = tibble(
      title = character(),
      venue = character(),
      date = character(),
      audience_size = integer(),
      recording_url = character()
    ),

    collaborations = tibble(
      partner = character(),
      organization = character(),
      project = character(),
      status = character()
    ),

    metrics = list(
      h_index = NA,
      total_citations = NA,
      github_followers = NA,
      twitter_followers = NA
    )
  )

  return(portfolio)
}

# Calculate impact metrics
calculate_impact <- function(portfolio) {
  impact <- list(
    academic_impact = list(
      publications = nrow(portfolio$publications),
      peer_reviewed = sum(portfolio$publications$type == "peer-reviewed"),
      total_citations = sum(portfolio$publications$citations, na.rm = TRUE),
      h_index = portfolio$metrics$h_index
    ),

    industry_impact = list(
      blog_posts = sum(portfolio$publications$type == "blog"),
      talks_given = nrow(portfolio$talks),
      open_source_projects = nrow(portfolio$projects),
      total_github_stars = sum(portfolio$projects$github_stars, na.rm = TRUE)
    ),

    community_impact = list(
      collaborations = nrow(portfolio$collaborations),
      mentees = NA,  # Track separately
      datasets_released = NA
    )
  )

  return(impact)
}

# Example portfolio
my_portfolio <- create_research_portfolio()

# Add a publication
my_portfolio$publications <- bind_rows(
  my_portfolio$publications,
  tibble(
    title = "Contextual Expected Goals with Defensive Pressure",
    venue = "MIT Sloan Sports Analytics Conference",
    year = 2024,
    type = "peer-reviewed",
    citations = 15,
    url = "https://example.com/paper"
  )
)

# Add an open-source project
my_portfolio$projects <- bind_rows(
  my_portfolio$projects,
  tibble(
    name = "football-xg",
    description = "Open-source expected goals model library",
    github_stars = 245,
    downloads = 12000,
    status = "active"
  )
)

impact <- calculate_impact(my_portfolio)
print("Research Impact Summary:")
print(impact)

Practice Exercises

Exercise 59.1: Literature Review

Conduct a systematic literature review on a football analytics topic of your choice (e.g., goalkeeper evaluation, pressing metrics, injury prediction). Document at least 10 relevant papers with key findings and methodology.

Use Google Scholar, arXiv (cs.LG, stat.AP), and the MIT Sloan archives. Search for both football-specific papers and methodology papers from other domains that could apply. Track papers in a structured table with columns for: citation, year, key contribution, methodology, data used, limitations.
Exercise 59.2: Write a Technical Blog Post

Write a 1500-2000 word blog post on a football analytics insight. Include at least 3 original visualizations and ensure the content is accessible to non-technical football fans while maintaining analytical rigor.

Start with a hook that challenges a common belief. Use analogies to explain technical concepts. Each visualization should tell a clear story. End with actionable insights for coaches, fans, or analysts. Consider publishing on Medium, Substack, or a personal blog.
Exercise 59.3: Create a Reproducibility Package

Take an analysis you've previously completed and create a full reproducibility package. Include environment specification, data dictionary, documented code, and verification checksums. Test by having someone else reproduce your results.

Create a GitHub repository with: README (clear instructions), requirements.txt/environment.yml, data/ folder with sample data or download script, src/ with numbered scripts, results/ for outputs, and tests/ for verification. Use random seeds and version-pin all dependencies.

Chapter Summary

Contributing to football analytics research advances both your career and the field as a whole. Whether through peer-reviewed papers, open-source tools, or thoughtful blog posts, sharing your work enables others to build upon your insights.