Chapter 49: Ethics & Privacy in Football Analytics

Introduction to Ethics in Football Analytics

As football analytics becomes more sophisticated and influential, ethical considerations become increasingly important. From player privacy to algorithmic fairness, analysts must navigate complex ethical terrain while delivering valuable insights.

Why Ethics Matter

Analytics decisions can significantly impact players' careers, transfer fees worth millions, and fan experiences. With this influence comes responsibility to ensure our work is fair, transparent, and respectful of individual rights.

Privacy

Protecting personal data and respecting boundaries

Fairness

Avoiding bias in player evaluation

Transparency

Explainable models and clear communication

Accountability

Taking responsibility for outcomes

Data Privacy in Football

Football analytics involves sensitive personal data including physical performance metrics, medical information, and location tracking. Understanding privacy regulations and best practices is essential for responsible data handling.

Types of Sensitive Data in Football

Data Type	Examples	Sensitivity	Regulations
Performance Data	xG, pass completion, sprint speed	Medium	Generally public
Physical/Health Data	Heart rate, injury history, recovery metrics	High	GDPR, HIPAA
Location/Tracking Data	GPS coordinates, movement patterns	High	GDPR, player consent
Contract/Financial	Wages, bonuses, release clauses	High	Confidential
Personal Information	Age, nationality, family status	Medium	GDPR, local laws

Implementing privacy-aware data handling

# Privacy-Aware Data Handling in Python
import hashlib
import pandas as pd
from typing import List, Dict, Optional
from dataclasses import dataclass
from enum import Enum

class DataSensitivity(Enum):
    PUBLIC = "public"
    INTERNAL = "internal"
    CONFIDENTIAL = "confidential"
    RESTRICTED = "restricted"

class DataAnonymizer:
    """Handle data anonymization and privacy protection"""

    @staticmethod
    def hash_identifier(identifier: str) -> str:
        """One-way hash for identifiers"""
        return hashlib.sha256(str(identifier).encode()).hexdigest()[:16]

    @staticmethod
    def generalize_age(age: int) -> str:
        """Generalize age to ranges"""
        if age < 21: return "U21"
        elif age < 26: return "21-25"
        elif age < 30: return "26-29"
        elif age < 34: return "30-33"
        return "34+"

    def anonymize(self, df: pd.DataFrame) -> pd.DataFrame:
        """Anonymize player data"""
        result = df.copy()

        # Hash identifiers
        if "player_id" in result.columns:
            result["player_id_hashed"] = result["player_id"].apply(self.hash_identifier)
            result = result.drop(columns=["player_id"])

        # Remove direct identifiers
        identifiers = ["player_name", "date_of_birth", "email", "phone"]
        result = result.drop(columns=[c for c in identifiers if c in result.columns])

        # Generalize age
        if "age" in result.columns:
            result["age_group"] = result["age"].apply(self.generalize_age)
            result = result.drop(columns=["age"])

        return result


class DataMinimizer:
    """Implement data minimization principles"""

    ANALYSIS_REQUIREMENTS = {
        "performance_analysis": ["player_id_hashed", "goals", "xG", "age_group"],
        "injury_risk": ["player_id_hashed", "age_group", "minutes_played"],
        "recruitment": ["player_id_hashed", "age_group", "goals", "xG", "position"],
        "public_stats": ["goals", "assists", "age_group"]
    }

    def minimize(self, df: pd.DataFrame, analysis_purpose: str) -> pd.DataFrame:
        """Return only necessary columns for given purpose"""
        if analysis_purpose not in self.ANALYSIS_REQUIREMENTS:
            raise ValueError(f"Unknown analysis purpose: {analysis_purpose}")

        required = self.ANALYSIS_REQUIREMENTS[analysis_purpose]
        available = [c for c in required if c in df.columns]
        return df[available]


@dataclass
class AccessControl:
    """Role-based access control for data"""

    ROLE_PERMISSIONS: Dict[str, List[str]] = None

    def __post_init__(self):
        self.ROLE_PERMISSIONS = {
            "analyst": ["performance", "public_stats"],
            "medical_staff": ["performance", "public_stats", "health_data"],
            "management": ["performance", "public_stats", "health_data", "contracts"],
            "data_scientist": ["performance", "public_stats", "anonymized_tracking"]
        }

    def check_access(self, user_role: str, data_type: str) -> bool:
        """Check if role can access data type"""
        permissions = self.ROLE_PERMISSIONS.get(user_role, [])
        return data_type in permissions

    def filter_data(self, df: pd.DataFrame, user_role: str,
                   column_sensitivity: Dict[str, str]) -> pd.DataFrame:
        """Filter dataframe based on user permissions"""
        allowed_columns = []
        for col in df.columns:
            sensitivity = column_sensitivity.get(col, "public")
            if self.check_access(user_role, sensitivity):
                allowed_columns.append(col)
        return df[allowed_columns]


# Example usage
player_data = pd.DataFrame({
    "player_id": [1, 2, 3, 4, 5],
    "player_name": ["Player A", "Player B", "Player C", "Player D", "Player E"],
    "age": [23, 28, 31, 25, 19],
    "goals": [15, 8, 12, 5, 3],
    "xG": [12.5, 9.2, 10.8, 6.1, 4.2]
})

anonymizer = DataAnonymizer()
anonymized = anonymizer.anonymize(player_data)

print("Anonymized Data:")
print(anonymized)

# Access control
ac = AccessControl()
print(f"\nAnalyst accessing health_data: {ac.check_access('analyst', 'health_data')}")
print(f"Medical staff accessing health_data: {ac.check_access('medical_staff', 'health_data')}")

# Privacy-Aware Data Handling in R
library(tidyverse)
library(digest)

# Data anonymization functions
anonymize_player_data <- function(data) {
  data %>%
    mutate(
      # Hash player identifiers
      player_id_hashed = sapply(player_id, function(x)
        digest(as.character(x), algo = "sha256")),

      # Remove directly identifying information
      player_name = NULL,
      date_of_birth = NULL,

      # Generalize age to ranges
      age_group = cut(age,
                      breaks = c(0, 21, 25, 29, 33, 40),
                      labels = c("U21", "21-25", "26-29", "30-33", "34+"))
    ) %>%
    select(-player_id, -age)
}

# Example usage
player_data <- tibble(
  player_id = 1:5,
  player_name = c("Player A", "Player B", "Player C", "Player D", "Player E"),
  age = c(23, 28, 31, 25, 19),
  date_of_birth = as.Date(c("2001-03-15", "1996-07-22", "1993-01-08",
                            "1999-05-30", "2005-11-12")),
  goals = c(15, 8, 12, 5, 3),
  xG = c(12.5, 9.2, 10.8, 6.1, 4.2)
)

anonymized <- anonymize_player_data(player_data)
print("Anonymized Data:")
print(anonymized)

# Data minimization - only collect what's needed
minimize_data <- function(data, analysis_purpose) {
  required_fields <- list(
    "performance_analysis" = c("player_id_hashed", "goals", "xG", "age_group"),
    "injury_risk" = c("player_id_hashed", "age_group", "minutes_played"),
    "recruitment" = c("player_id_hashed", "age_group", "goals", "xG", "position")
  )

  fields <- required_fields[[analysis_purpose]]
  if (is.null(fields)) {
    stop("Unknown analysis purpose")
  }

  data %>% select(any_of(fields))
}

# Access control simulation
check_data_access <- function(user_role, data_type) {
  permissions <- list(
    "analyst" = c("performance", "public_stats"),
    "medical_staff" = c("performance", "public_stats", "health_data"),
    "management" = c("performance", "public_stats", "health_data", "contracts"),
    "data_scientist" = c("performance", "public_stats", "anonymized_tracking")
  )

  allowed <- permissions[[user_role]]
  if (is.null(allowed)) {
    return(FALSE)
  }

  return(data_type %in% allowed)
}

# Example access check
cat("\nAccess Control Examples:\n")
cat("Analyst accessing health_data:", check_data_access("analyst", "health_data"), "\n")
cat("Medical staff accessing health_data:", check_data_access("medical_staff", "health_data"), "\n")

Algorithmic Fairness in Player Evaluation

Machine learning models used in player evaluation can perpetuate or amplify biases if not carefully designed. Understanding and mitigating algorithmic bias is crucial for fair treatment of players.

Sources of Bias in Football Analytics

Data Coverage Bias: Some leagues/players have more data than others
Historical Bias: Past decisions may reflect discriminatory patterns
Measurement Bias: Metrics may favor certain playing styles
Selection Bias: Who gets tracked and analyzed
Representation Bias: Training data may not represent all contexts

Detecting and mitigating algorithmic bias in player evaluation

# Detecting and Mitigating Bias in Player Models
import pandas as pd
import numpy as np
from typing import Dict, List
from dataclasses import dataclass

# Sample data
player_evaluations = pd.DataFrame({
    "player_id": range(1, 11),
    "nationality_region": ["Europe", "Europe", "South America", "Africa", "Europe",
                           "South America", "Africa", "Asia", "Europe", "Africa"],
    "league_tier": [1, 1, 2, 2, 1, 1, 3, 2, 2, 1],
    "predicted_value": [85, 78, 72, 68, 80, 75, 62, 65, 70, 72],
    "actual_performance": [82, 80, 85, 78, 75, 82, 72, 70, 68, 80],
    "was_signed": [True, True, False, False, True, True, False, False, True, False]
})

class BiasAnalyzer:
    """Analyze and detect bias in player evaluation models"""

    def __init__(self, data: pd.DataFrame):
        self.data = data

    def group_analysis(self, protected_attribute: str) -> pd.DataFrame:
        """Analyze prediction accuracy by protected group"""
        return self.data.groupby(protected_attribute).agg({
            "player_id": "count",
            "predicted_value": "mean",
            "actual_performance": "mean",
            "was_signed": "mean"
        }).reset_index().rename(columns={
            "player_id": "n",
            "predicted_value": "avg_predicted",
            "actual_performance": "avg_actual",
            "was_signed": "signing_rate"
        })

    def calculate_fairness_metrics(self, protected_attribute: str,
                                   prediction: str, outcome: str) -> pd.DataFrame:
        """Calculate fairness metrics by group"""
        median_pred = self.data[prediction].median()
        median_outcome = self.data[outcome].median()

        metrics = []
        for group in self.data[protected_attribute].unique():
            group_data = self.data[self.data[protected_attribute] == group]

            # Selection rate (above median prediction)
            selection_rate = (group_data[prediction] > median_pred).mean()

            # True/False positive rates
            tpr = ((group_data[prediction] > median_pred) &
                   (group_data[outcome] > median_outcome)).mean()
            fpr = ((group_data[prediction] > median_pred) &
                   (group_data[outcome] <= median_outcome)).mean()

            # Prediction error
            error = (group_data[outcome] - group_data[prediction]).mean()

            metrics.append({
                "group": group,
                "n": len(group_data),
                "selection_rate": selection_rate,
                "tpr": tpr,
                "fpr": fpr,
                "avg_error": error
            })

        metrics_df = pd.DataFrame(metrics)

        # Calculate disparate impact (80% rule)
        max_rate = metrics_df["selection_rate"].max()
        metrics_df["disparate_impact"] = metrics_df["selection_rate"] / max_rate

        return metrics_df

    def check_disparate_impact(self, metrics_df: pd.DataFrame,
                               threshold: float = 0.8) -> Dict:
        """Check for disparate impact violations"""
        violations = metrics_df[metrics_df["disparate_impact"] < threshold]

        return {
            "has_violation": len(violations) > 0,
            "affected_groups": violations["group"].tolist(),
            "details": violations[["group", "disparate_impact"]].to_dict("records")
        }


class BiasMitigator:
    """Strategies to mitigate bias in models"""

    @staticmethod
    def reweighting(data: pd.DataFrame, protected_attribute: str,
                   outcome: str) -> pd.DataFrame:
        """Reweight samples to balance groups"""
        data = data.copy()

        # Calculate group weights
        total = len(data)
        for group in data[protected_attribute].unique():
            group_size = (data[protected_attribute] == group).sum()
            expected = total / data[protected_attribute].nunique()
            weight = expected / group_size
            data.loc[data[protected_attribute] == group, "sample_weight"] = weight

        return data

    @staticmethod
    def threshold_adjustment(predictions: pd.Series,
                            groups: pd.Series,
                            target_rate: float) -> pd.Series:
        """Adjust decision thresholds per group for equal selection rates"""
        adjusted = predictions.copy()

        for group in groups.unique():
            group_mask = groups == group
            group_preds = predictions[group_mask]

            # Find threshold that achieves target selection rate
            threshold = np.percentile(group_preds, (1 - target_rate) * 100)

            # Convert to binary decisions
            adjusted[group_mask] = (group_preds >= threshold).astype(float)

        return adjusted


# Analysis
analyzer = BiasAnalyzer(player_evaluations)

print("Bias Analysis by Region:")
print(analyzer.group_analysis("nationality_region").to_string(index=False))

fairness_metrics = analyzer.calculate_fairness_metrics(
    "nationality_region", "predicted_value", "actual_performance"
)
print("\nFairness Metrics:")
print(fairness_metrics.to_string(index=False))

# Check for violations
violations = analyzer.check_disparate_impact(fairness_metrics)
if violations["has_violation"]:
    print(f"\nWARNING: Disparate impact detected!")
    print(f"Affected groups: {violations['affected_groups']}")

# Detecting and Mitigating Bias in Player Models
library(tidyverse)
library(fairness)

# Sample player evaluation data
player_evaluations <- tribble(
  ~player_id, ~nationality_region, ~league_tier, ~predicted_value,
  ~actual_performance, ~was_signed,
  1, "Europe", 1, 85, 82, TRUE,
  2, "Europe", 1, 78, 80, TRUE,
  3, "South America", 2, 72, 85, FALSE,
  4, "Africa", 2, 68, 78, FALSE,
  5, "Europe", 1, 80, 75, TRUE,
  6, "South America", 1, 75, 82, TRUE,
  7, "Africa", 3, 62, 72, FALSE,
  8, "Asia", 2, 65, 70, FALSE,
  9, "Europe", 2, 70, 68, TRUE,
  10, "Africa", 1, 72, 80, FALSE
)

# Analyze prediction accuracy by group
bias_analysis <- player_evaluations %>%
  group_by(nationality_region) %>%
  summarise(
    n = n(),
    avg_predicted = mean(predicted_value),
    avg_actual = mean(actual_performance),
    prediction_error = mean(actual_performance - predicted_value),
    signing_rate = mean(was_signed),
    .groups = "drop"
  )

print("Bias Analysis by Region:")
print(bias_analysis)

# Check for systematic undervaluation
undervaluation_test <- player_evaluations %>%
  mutate(
    undervalued = actual_performance - predicted_value > 5,
    overvalued = predicted_value - actual_performance > 5
  ) %>%
  group_by(nationality_region) %>%
  summarise(
    undervalued_rate = mean(undervalued) * 100,
    overvalued_rate = mean(overvalued) * 100,
    .groups = "drop"
  )

print("\nUnder/Overvaluation by Region:")
print(undervaluation_test)

# Fairness metrics
calculate_fairness_metrics <- function(data, protected_attribute,
                                        outcome, prediction) {
  groups <- unique(data[[protected_attribute]])

  metrics <- map_dfr(groups, function(g) {
    group_data <- data %>% filter(.data[[protected_attribute]] == g)

    # Calculate metrics
    tpr <- mean(group_data[[prediction]] > median(data[[prediction]]) &
                group_data[[outcome]] > median(data[[outcome]]))
    fpr <- mean(group_data[[prediction]] > median(data[[prediction]]) &
                group_data[[outcome]] <= median(data[[outcome]]))
    selection_rate <- mean(group_data[[prediction]] > median(data[[prediction]]))

    tibble(
      group = g,
      n = nrow(group_data),
      true_positive_rate = tpr,
      false_positive_rate = fpr,
      selection_rate = selection_rate
    )
  })

  # Calculate disparate impact
  max_selection <- max(metrics$selection_rate)
  metrics <- metrics %>%
    mutate(
      disparate_impact = selection_rate / max_selection
    )

  return(metrics)
}

fairness_metrics <- calculate_fairness_metrics(
  player_evaluations,
  "nationality_region",
  "actual_performance",
  "predicted_value"
)

print("\nFairness Metrics:")
print(fairness_metrics)

# Flag if disparate impact < 0.8 (80% rule)
if (any(fairness_metrics$disparate_impact < 0.8)) {
  cat("\nWARNING: Potential disparate impact detected!")
  problematic <- fairness_metrics %>%
    filter(disparate_impact < 0.8) %>%
    pull(group)
  cat("\nAffected groups:", paste(problematic, collapse = ", "), "\n")
}

Transparency and Explainability

When analytics influence important decisions about players' careers and significant financial investments, stakeholders deserve to understand how conclusions were reached.

Building explainable player valuation models

# Model Explainability for Player Valuations
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import shap

# Sample training data
np.random.seed(42)
n_samples = 100

training_data = pd.DataFrame({
    "goals": np.random.uniform(0, 30, n_samples),
    "assists": np.random.uniform(0, 20, n_samples),
    "xG": np.random.uniform(0, 25, n_samples),
    "age": np.random.randint(18, 36, n_samples),
    "minutes": np.random.uniform(500, 3000, n_samples),
    "league_quality": np.random.randint(1, 6, n_samples)
})

training_data["value"] = (
    2 * training_data["goals"] +
    1.5 * training_data["assists"] +
    1.8 * training_data["xG"] +
    (30 - training_data["age"]) * 0.5 +
    training_data["minutes"] / 1000 +
    training_data["league_quality"] * 2 +
    np.random.normal(0, 3, n_samples)
)

# Train model
features = ["goals", "assists", "xG", "age", "minutes", "league_quality"]
X = training_data[features]
y = training_data["value"]

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

class ModelExplainer:
    """Provide explanations for model predictions"""

    def __init__(self, model, X_train, feature_names):
        self.model = model
        self.feature_names = feature_names
        self.explainer = shap.TreeExplainer(model)
        self.X_train = X_train

    def global_importance(self) -> pd.DataFrame:
        """Calculate global feature importance"""
        importance = pd.DataFrame({
            "feature": self.feature_names,
            "importance": self.model.feature_importances_
        }).sort_values("importance", ascending=False)
        return importance

    def explain_prediction(self, player_data: pd.DataFrame) -> dict:
        """Explain individual prediction"""
        prediction = self.model.predict(player_data)[0]
        shap_values = self.explainer.shap_values(player_data)

        contributions = pd.DataFrame({
            "feature": self.feature_names,
            "value": player_data.values[0],
            "contribution": shap_values[0]
        }).sort_values("contribution", key=abs, ascending=False)

        return {
            "predicted_value": prediction,
            "base_value": self.explainer.expected_value,
            "contributions": contributions
        }

    def generate_report(self, player_data: pd.DataFrame) -> str:
        """Generate human-readable explanation report"""
        explanation = self.explain_prediction(player_data)

        report = f"""
=== PLAYER VALUATION REPORT ===

Predicted Value: {explanation['predicted_value']:.2f}M

Key Contributing Factors:
"""
        for _, row in explanation["contributions"].head(5).iterrows():
            direction = "+" if row["contribution"] > 0 else ""
            report += f"  - {row['feature']}: {row['value']:.1f} ({direction}{row['contribution']:.2f}M)\n"

        report += """
Model Information:
  - Algorithm: Random Forest (100 trees)
  - Training samples: 100 players
  - R-squared: ~0.85

Limitations:
  - Limited to players with >500 minutes
  - Does not account for injury history
  - League quality assessment may vary
"""
        return report


# Usage
explainer = ModelExplainer(model, X, features)

print("Global Feature Importance:")
print(explainer.global_importance().to_string(index=False))

# Explain new player
new_player = pd.DataFrame({
    "goals": [20], "assists": [10], "xG": [18],
    "age": [25], "minutes": [2500], "league_quality": [1]
})

print(explainer.generate_report(new_player))

# Model Explainability for Player Valuations
library(tidyverse)
library(DALEX)
library(randomForest)

# Sample training data
set.seed(42)
training_data <- tibble(
  goals = runif(100, 0, 30),
  assists = runif(100, 0, 20),
  xG = runif(100, 0, 25),
  age = sample(18:35, 100, replace = TRUE),
  minutes = runif(100, 500, 3000),
  league_quality = sample(1:5, 100, replace = TRUE)
) %>%
  mutate(
    value = 2 * goals + 1.5 * assists + 1.8 * xG +
            (30 - age) * 0.5 + minutes/1000 +
            league_quality * 2 + rnorm(100, 0, 3)
  )

# Train model
model <- randomForest(value ~ ., data = training_data, ntree = 100)

# Create explainer
explainer <- explain(
  model,
  data = training_data %>% select(-value),
  y = training_data$value,
  label = "Player Valuation Model"
)

# Global feature importance
importance_plot <- model_parts(explainer)
print("Global Feature Importance:")
print(importance_plot)

# Individual prediction explanation
new_player <- tibble(
  goals = 20,
  assists = 10,
  xG = 18,
  age = 25,
  minutes = 2500,
  league_quality = 1
)

# Break down prediction
breakdown <- predict_parts(explainer, new_data = new_player)
print("\nPrediction Breakdown for New Player:")
print(breakdown)

# Generate explanation report
create_explanation_report <- function(player_data, model, explainer) {
  prediction <- predict(model, player_data)
  breakdown <- predict_parts(explainer, new_data = player_data)

  report <- list(
    predicted_value = prediction,
    key_factors = breakdown %>%
      filter(variable != "intercept") %>%
      arrange(desc(abs(contribution))) %>%
      head(5) %>%
      select(variable, contribution),
    confidence = "Model R-squared: 0.85",
    methodology = "Random Forest with 100 trees, trained on 100 players",
    limitations = c(
      "Limited to players with >500 minutes",
      "Does not account for injury history",
      "League quality assessment may vary"
    )
  )

  return(report)
}

# Example report
report <- create_explanation_report(new_player, model, explainer)
cat("\n=== PLAYER VALUATION REPORT ===\n")
cat("Predicted Value:", round(report$predicted_value, 2), "M\n")
cat("\nKey Contributing Factors:\n")
print(report$key_factors)
cat("\nLimitations:", paste(report$limitations, collapse = "\n- "), "\n")

Responsible Analytics Practices

Beyond specific technical measures, responsible football analytics requires a culture and processes that prioritize ethical considerations throughout the analytics lifecycle.

Case Studies in Ethics

Scenario: A club develops a model predicting player injury risk. The model suggests a player has high injury probability.

Ethical Considerations:

Should this affect contract negotiations?
Should the player be informed of their risk score?
Could this create self-fulfilling prophecies or discrimination?
What's the model's false positive rate?

Best Practice: Use injury risk models for workload management, not contract decisions. Share results with medical staff, not management. Document model limitations clearly.

Scenario: An academy uses analytics to predict which youth players will succeed, influencing who receives continued development.

Ethical Considerations:

Youth development is highly uncertain - are we ending careers prematurely?
Do models disadvantage late developers or certain playing styles?
What psychological impact does evaluation have on young players?
Are we measuring talent or current privilege (coaching, facilities)?

Best Practice: Use analytics as one input among many. Provide development pathways for all players. Regularly reassess predictions. Consider developmental trajectory, not just current level.

Scenario: A public analytics account posts that a player has poor underlying metrics despite good traditional stats.

Ethical Considerations:

Could this affect the player's market value or fan perception?
Do we have obligation to consider impact on subjects of analysis?
Is context being adequately provided?
What's the quality and reliability of the underlying data?

Best Practice: Provide full context and limitations. Avoid sensationalism. Consider whether analysis serves public interest. Acknowledge uncertainty in conclusions.

Regulatory Landscape

Football analytics operates within an evolving regulatory framework that analysts must understand.

Regulation	Scope	Key Requirements	Relevance to Football
GDPR	EU/UK	Consent, data minimization, right to erasure, portability	Player data, tracking data, youth players
EU AI Act	EU	Risk-based AI regulation, transparency requirements	Player evaluation systems, automated decisions
FIFA Regulations	Global	Player rights, transfer system, third-party ownership	Data sharing in transfers, agent regulations
Employment Law	Various	Discrimination protection, workplace monitoring	Analytics-based decisions about players

Player Consent and Data Rights

Players have fundamental rights over their personal data. Understanding and respecting these rights is essential for ethical analytics practice.

player_consent_management

# Python: Player Consent Management System
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Dict, List, Optional
from enum import Enum
import hashlib
import json

class ConsentType(Enum):
    TRACKING_DATA = "tracking_data"
    BIOMETRIC_DATA = "biometric_data"
    VIDEO_ANALYSIS = "video_analysis"
    THIRD_PARTY_SHARING = "third_party_sharing"
    PUBLIC_STATISTICS = "public_statistics"

class DataSensitivity(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    RESTRICTED = "restricted"

@dataclass
class ConsentTypeInfo:
    name: str
    description: str
    required_for: List[str]
    sensitivity: DataSensitivity

CONSENT_TYPES: Dict[ConsentType, ConsentTypeInfo] = {
    ConsentType.TRACKING_DATA: ConsentTypeInfo(
        name="GPS/Tracking Data Collection",
        description="Collection of location and movement data",
        required_for=["performance_analysis", "workload_management"],
        sensitivity=DataSensitivity.HIGH
    ),
    ConsentType.BIOMETRIC_DATA: ConsentTypeInfo(
        name="Biometric Data Collection",
        description="Heart rate, body composition, physiological metrics",
        required_for=["fitness_monitoring", "injury_prevention"],
        sensitivity=DataSensitivity.RESTRICTED
    ),
    ConsentType.VIDEO_ANALYSIS: ConsentTypeInfo(
        name="Video Analysis",
        description="Analysis of match and training footage",
        required_for=["tactical_analysis", "player_development"],
        sensitivity=DataSensitivity.MEDIUM
    ),
    ConsentType.THIRD_PARTY_SHARING: ConsentTypeInfo(
        name="Third Party Data Sharing",
        description="Sharing anonymized data with analytics partners",
        required_for=["benchmarking", "research"],
        sensitivity=DataSensitivity.HIGH
    ),
    ConsentType.PUBLIC_STATISTICS: ConsentTypeInfo(
        name="Public Statistics Release",
        description="Release of performance statistics to media",
        required_for=["media_relations", "fan_engagement"],
        sensitivity=DataSensitivity.LOW
    )
}

@dataclass
class ConsentRecord:
    consent_id: str
    player_id: str
    consent_type: ConsentType
    granted: bool
    timestamp: datetime
    expires_at: Optional[datetime] = None
    purpose: Optional[str] = None
    consent_version: str = "v2.1"

    @classmethod
    def create(cls, player_id: str, consent_type: ConsentType,
               granted: bool, expires_at: Optional[datetime] = None,
               purpose: Optional[str] = None) -> "ConsentRecord":
        consent_id = f"CNS_{datetime.now().strftime('%Y%m%d%H%M%S')}_{hash(player_id) % 10000}"
        return cls(
            consent_id=consent_id,
            player_id=player_id,
            consent_type=consent_type,
            granted=granted,
            timestamp=datetime.now(),
            expires_at=expires_at,
            purpose=purpose
        )

class ConsentManager:
    """Manage player data consent"""

    def __init__(self):
        self.consents: List[ConsentRecord] = []

    def record_consent(self, player_id: str, consent_type: ConsentType,
                      granted: bool, expires_in_days: Optional[int] = None,
                      purpose: Optional[str] = None) -> ConsentRecord:
        """Record a consent decision"""
        expires_at = None
        if expires_in_days:
            expires_at = datetime.now() + timedelta(days=expires_in_days)

        record = ConsentRecord.create(player_id, consent_type, granted,
                                     expires_at, purpose)
        self.consents.append(record)
        return record

    def check_consent(self, player_id: str, consent_type: ConsentType) -> Dict:
        """Check if consent is granted for a specific purpose"""
        relevant = [c for c in self.consents
                   if c.player_id == player_id and c.consent_type == consent_type]

        if not relevant:
            return {"granted": False, "reason": "No consent recorded"}

        latest = max(relevant, key=lambda x: x.timestamp)

        # Check if expired
        if latest.expires_at and latest.expires_at < datetime.now():
            return {"granted": False, "reason": "Consent expired"}

        return {
            "granted": latest.granted,
            "recorded_at": latest.timestamp,
            "expires": latest.expires_at
        }

    def withdraw_consent(self, player_id: str, consent_type: ConsentType) -> ConsentRecord:
        """Withdraw previously granted consent"""
        return self.record_consent(player_id, consent_type, False,
                                  purpose="Player withdrawal request")

    def get_player_consents(self, player_id: str) -> Dict[ConsentType, Dict]:
        """Get all current consents for a player"""
        results = {}
        for consent_type in ConsentType:
            results[consent_type.value] = self.check_consent(player_id, consent_type)
        return results

    def export_for_gdpr(self, player_id: str) -> Dict:
        """Export consent history for GDPR compliance"""
        player_records = [c for c in self.consents if c.player_id == player_id]
        return {
            "player_id": player_id,
            "export_date": datetime.now().isoformat(),
            "consent_history": [
                {
                    "consent_type": r.consent_type.value,
                    "granted": r.granted,
                    "timestamp": r.timestamp.isoformat(),
                    "expires_at": r.expires_at.isoformat() if r.expires_at else None
                }
                for r in player_records
            ]
        }


# Usage example
cm = ConsentManager()

# Record consents
cm.record_consent("player_001", ConsentType.TRACKING_DATA, True, expires_in_days=365)
cm.record_consent("player_001", ConsentType.BIOMETRIC_DATA, True)
cm.record_consent("player_001", ConsentType.THIRD_PARTY_SHARING, False)

# Check consent before accessing data
result = cm.check_consent("player_001", ConsentType.TRACKING_DATA)
if result["granted"]:
    print("Tracking data access permitted")
else:
    print(f"Tracking data access DENIED: {result['reason']}")

# Get all consents for a player
print("\nPlayer 001 Consent Status:")
for consent_type, status in cm.get_player_consents("player_001").items():
    print(f"  {consent_type}: {'Granted' if status.get('granted') else 'Denied'}")
# R: Player Consent Management System
library(tidyverse)
library(jsonlite)
library(lubridate)

# Consent record structure
create_consent_record <- function(player_id, consent_type, granted,
                                   expires = NULL, purpose = NULL) {
    tibble(
        consent_id = paste0("CNS_", format(Sys.time(), "%Y%m%d%H%M%S"),
                           "_", sample(1000:9999, 1)),
        player_id = player_id,
        consent_type = consent_type,
        granted = granted,
        timestamp = Sys.time(),
        expires_at = expires,
        purpose = purpose,
        ip_address = "recorded_separately",  # For audit
        consent_version = "v2.1"
    )
}

# Consent types in football analytics
CONSENT_TYPES <- list(
    tracking_data = list(
        name = "GPS/Tracking Data Collection",
        description = "Collection of location and movement data during training and matches",
        required_for = c("performance_analysis", "workload_management"),
        sensitivity = "high"
    ),
    biometric_data = list(
        name = "Biometric Data Collection",
        description = "Heart rate, body composition, and physiological metrics",
        required_for = c("fitness_monitoring", "injury_prevention"),
        sensitivity = "restricted"
    ),
    video_analysis = list(
        name = "Video Analysis",
        description = "Analysis of match and training footage for tactical purposes",
        required_for = c("tactical_analysis", "player_development"),
        sensitivity = "medium"
    ),
    third_party_sharing = list(
        name = "Third Party Data Sharing",
        description = "Sharing anonymized data with analytics partners",
        required_for = c("benchmarking", "research"),
        sensitivity = "high"
    ),
    public_statistics = list(
        name = "Public Statistics Release",
        description = "Release of aggregated performance statistics to media",
        required_for = c("media_relations", "fan_engagement"),
        sensitivity = "low"
    )
)

# Consent management functions
class_ConsentManager <- function() {
    consents <- tibble()

    record_consent <- function(player_id, consent_type, granted,
                               expires = NULL, purpose = NULL) {
        new_record <- create_consent_record(player_id, consent_type, granted,
                                           expires, purpose)
        consents <<- bind_rows(consents, new_record)
        return(new_record)
    }

    check_consent <- function(player_id, consent_type) {
        current <- consents %>%
            filter(player_id == !!player_id,
                   consent_type == !!consent_type) %>%
            filter(is.na(expires_at) | expires_at > Sys.time()) %>%
            arrange(desc(timestamp)) %>%
            slice(1)

        if (nrow(current) == 0) {
            return(list(granted = FALSE, reason = "No consent recorded"))
        }

        return(list(
            granted = current$granted,
            recorded_at = current$timestamp,
            expires = current$expires_at
        ))
    }

    withdraw_consent <- function(player_id, consent_type) {
        record_consent(player_id, consent_type, FALSE,
                      purpose = "Player withdrawal request")
    }

    get_player_consents <- function(player_id) {
        consents %>%
            filter(player_id == !!player_id) %>%
            group_by(consent_type) %>%
            filter(timestamp == max(timestamp)) %>%
            ungroup() %>%
            select(consent_type, granted, timestamp, expires_at)
    }

    list(
        record = record_consent,
        check = check_consent,
        withdraw = withdraw_consent,
        get_all = get_player_consents
    )
}

# Usage example
cm <- class_ConsentManager()

# Record consents
cm$record("player_001", "tracking_data", TRUE,
         expires = Sys.time() + days(365))
cm$record("player_001", "biometric_data", TRUE)
cm$record("player_001", "third_party_sharing", FALSE)

# Check consent before accessing data
if (cm$check("player_001", "tracking_data")$granted) {
    cat("Tracking data access permitted\n")
} else {
    cat("Tracking data access DENIED\n")
}

# Get all consents for a player
cat("\nPlayer 001 Consent Status:\n")
print(cm$get_all("player_001"))

Player Data Rights

Right to Access

View all data held about them
Understand how data is used
Know who has accessed their data
Receive data in portable format

Right to Control

Grant or withhold consent
Withdraw consent at any time
Request data correction
Limit data processing scope

Right to Erasure

Request data deletion
Be "forgotten" (with limits)
Remove from third-party systems
Exceptions for legal requirements

Data Governance Frameworks

Effective data governance ensures that football analytics data is managed responsibly throughout its lifecycle, from collection to deletion.

data_governance_framework

# Python: Data Governance Framework
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Literal
from enum import Enum
import json

class DataClassification(Enum):
    PUBLIC = "public"
    INTERNAL = "internal"
    CONFIDENTIAL = "confidential"
    RESTRICTED = "restricted"

@dataclass
class ClassificationPolicy:
    classification: DataClassification
    description: str
    retention_period: str
    access_level: str
    encryption_required: bool
    audit_required: bool

CLASSIFICATION_POLICIES: Dict[DataClassification, ClassificationPolicy] = {
    DataClassification.PUBLIC: ClassificationPolicy(
        DataClassification.PUBLIC,
        "Published statistics, media releases",
        "Indefinite",
        "All",
        False, False
    ),
    DataClassification.INTERNAL: ClassificationPolicy(
        DataClassification.INTERNAL,
        "Team performance analysis, scouting notes",
        "5 years",
        "Staff",
        False, True
    ),
    DataClassification.CONFIDENTIAL: ClassificationPolicy(
        DataClassification.CONFIDENTIAL,
        "Contract details, medical records",
        "7 years post-employment",
        "Restricted",
        True, True
    ),
    DataClassification.RESTRICTED: ClassificationPolicy(
        DataClassification.RESTRICTED,
        "Biometric data, psychological assessments",
        "Contract + 2 years",
        "Named individuals",
        True, True
    )
}

@dataclass
class AccessLogEntry:
    timestamp: datetime
    user: str
    action: str
    purpose: str
    ip_address: Optional[str] = None

@dataclass
class DataLineage:
    """Track data origin and transformations"""
    data_id: str
    created_at: datetime
    source_type: Literal["raw", "derived", "external"]
    source_name: str
    source_timestamp: datetime
    classification: DataClassification
    transformations: List[Dict] = field(default_factory=list)
    access_log: List[AccessLogEntry] = field(default_factory=list)

    def log_access(self, user: str, action: str, purpose: str,
                   ip_address: Optional[str] = None):
        """Log data access"""
        entry = AccessLogEntry(
            timestamp=datetime.now(),
            user=user,
            action=action,
            purpose=purpose,
            ip_address=ip_address
        )
        self.access_log.append(entry)

    def add_transformation(self, name: str, description: str,
                          performed_by: str):
        """Record data transformation"""
        self.transformations.append({
            "name": name,
            "description": description,
            "performed_by": performed_by,
            "timestamp": datetime.now().isoformat()
        })

    def export_for_audit(self) -> Dict:
        """Export lineage for audit purposes"""
        return {
            "data_id": self.data_id,
            "created_at": self.created_at.isoformat(),
            "source": {
                "type": self.source_type,
                "name": self.source_name,
                "timestamp": self.source_timestamp.isoformat()
            },
            "classification": self.classification.value,
            "transformations": self.transformations,
            "access_log": [
                {
                    "timestamp": entry.timestamp.isoformat(),
                    "user": entry.user,
                    "action": entry.action,
                    "purpose": entry.purpose
                }
                for entry in self.access_log
            ]
        }


class DataGovernanceFramework:
    """Central data governance management"""

    def __init__(self):
        self.data_registry: Dict[str, DataLineage] = {}
        self.policies = CLASSIFICATION_POLICIES

    def register_data(self, data_id: str, source_type: str,
                     source_name: str, classification: DataClassification
                     ) -> DataLineage:
        """Register new data asset"""
        lineage = DataLineage(
            data_id=data_id,
            created_at=datetime.now(),
            source_type=source_type,
            source_name=source_name,
            source_timestamp=datetime.now(),
            classification=classification
        )
        self.data_registry[data_id] = lineage
        return lineage

    def check_access(self, data_id: str, user_role: str) -> bool:
        """Check if user role can access data"""
        if data_id not in self.data_registry:
            return False

        lineage = self.data_registry[data_id]
        policy = self.policies[lineage.classification]

        # Simple role-based access
        role_access = {
            "All": ["admin", "staff", "analyst", "public"],
            "Staff": ["admin", "staff", "analyst"],
            "Restricted": ["admin", "medical", "management"],
            "Named individuals": ["admin", "named_list"]
        }

        allowed = role_access.get(policy.access_level, [])
        return user_role in allowed

    def retention_check(self) -> List[Dict]:
        """Check for data exceeding retention periods"""
        alerts = []

        for data_id, lineage in self.data_registry.items():
            policy = self.policies[lineage.classification]

            if policy.retention_period == "Indefinite":
                continue

            # Parse retention period (simplified)
            if "5 years" in policy.retention_period:
                max_age = timedelta(days=5*365)
            elif "7 years" in policy.retention_period:
                max_age = timedelta(days=7*365)
            else:
                max_age = timedelta(days=2*365)

            age = datetime.now() - lineage.created_at
            if age > max_age:
                alerts.append({
                    "data_id": data_id,
                    "classification": lineage.classification.value,
                    "age_days": age.days,
                    "action_required": "Review for deletion or extension"
                })

        return alerts


# Usage example
dgf = DataGovernanceFramework()

# Register data asset
tracking_data = dgf.register_data(
    data_id="TRK_20240115_001",
    source_type="raw",
    source_name="GPS Tracking System",
    classification=DataClassification.INTERNAL
)

# Log access
tracking_data.log_access(
    user="analyst_john",
    action="read",
    purpose="Weekly performance report"
)

# Check access
can_access = dgf.check_access("TRK_20240115_001", "analyst")
print(f"Analyst can access tracking data: {can_access}")

# Export for audit
print("\nData Lineage Record:")
print(json.dumps(tracking_data.export_for_audit(), indent=2, default=str))
# R: Data Governance Framework
library(tidyverse)
library(yaml)

# Data classification and handling policies
data_classification <- tribble(
    ~classification, ~description, ~retention_period, ~access_level,
    ~encryption_required, ~audit_required,
    "Public", "Published statistics, media releases", "Indefinite", "All", FALSE, FALSE,
    "Internal", "Team performance analysis, scouting notes", "5 years", "Staff", FALSE, TRUE,
    "Confidential", "Contract details, medical records", "7 years post-employment", "Restricted", TRUE, TRUE,
    "Restricted", "Biometric data, psychological assessments", "Contract + 2 years", "Named individuals", TRUE, TRUE
)

print("Data Classification Levels:")
print(data_classification)

# Data lineage tracking
create_data_lineage <- function(data_id, source, transformations = list()) {
    list(
        data_id = data_id,
        created_at = Sys.time(),
        source = list(
            type = source$type,  # "raw", "derived", "external"
            name = source$name,
            timestamp = source$timestamp
        ),
        transformations = transformations,
        current_classification = "Internal",
        access_log = tibble(
            timestamp = as.POSIXct(character()),
            user = character(),
            action = character(),
            purpose = character()
        )
    )
}

# Log data access
log_access <- function(lineage, user, action, purpose) {
    new_log <- tibble(
        timestamp = Sys.time(),
        user = user,
        action = action,
        purpose = purpose
    )
    lineage$access_log <- bind_rows(lineage$access_log, new_log)
    return(lineage)
}

# Data retention policy enforcement
check_retention <- function(data_records, policies) {
    data_records %>%
        left_join(policies, by = "classification") %>%
        mutate(
            retention_check = case_when(
                retention_period == "Indefinite" ~ "OK",
                difftime(Sys.time(), created_at, units = "days") > 365 * 5 &
                    retention_period == "5 years" ~ "REVIEW_NEEDED",
                TRUE ~ "OK"
            )
        ) %>%
        filter(retention_check == "REVIEW_NEEDED")
}

# Example: Create and track data
player_tracking <- create_data_lineage(
    data_id = "TRK_20240115_001",
    source = list(
        type = "raw",
        name = "GPS Tracking System",
        timestamp = Sys.time() - days(1)
    )
)

# Log access
player_tracking <- log_access(
    player_tracking,
    user = "analyst_john",
    action = "read",
    purpose = "Weekly performance report"
)

cat("\nData Lineage Record:\n")
cat("Data ID:", player_tracking$data_id, "\n")
cat("Source:", player_tracking$source$name, "\n")
cat("Access Log:\n")
print(player_tracking$access_log)

Data Governance Pillars

Data Quality

Accuracy validation
Completeness checks
Timeliness standards
Consistency rules

Data Security

Encryption standards
Access controls
Audit logging
Incident response

Data Lifecycle

Retention policies
Archival procedures
Deletion protocols
Version control

Data Stewardship

Ownership assignment
Responsibility matrix
Training programs
Compliance monitoring

Ethical AI in Football Analytics

As AI becomes more prevalent in football decision-making, specific ethical guidelines for AI systems become essential to ensure fair and responsible use.

ethical_ai_monitoring

# Python: Ethical AI Monitoring System
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from datetime import datetime
import pandas as pd
import numpy as np

@dataclass
class ModelCard:
    """Standardized model documentation for ethical AI"""

    # Model Details
    name: str
    version: str
    model_type: str
    intended_use: str
    out_of_scope_uses: List[str]
    training_date: str

    # Training Data
    data_source: str
    data_size: str
    date_range: str
    known_biases: List[str]

    # Performance
    primary_metric: str
    metric_value: str
    subgroup_performance: Dict[str, str]

    # Ethical Considerations
    potential_harms: List[str]
    mitigation_strategies: List[str]
    human_oversight: str
    appeals_process: str

    # Limitations
    technical_limitations: List[str]
    data_limitations: List[str]
    context_limitations: List[str]

    def to_markdown(self) -> str:
        """Generate markdown documentation"""
        return f"""
# Model Card: {self.name}

## Model Details
- **Version**: {self.version}
- **Type**: {self.model_type}
- **Training Date**: {self.training_date}

## Intended Use
{self.intended_use}

### Out of Scope Uses
{chr(10).join(f"- {use}" for use in self.out_of_scope_uses)}

## Training Data
- **Source**: {self.data_source}
- **Size**: {self.data_size}
- **Date Range**: {self.date_range}

### Known Biases
{chr(10).join(f"- {bias}" for bias in self.known_biases)}

## Performance
- **Primary Metric**: {self.primary_metric}
- **Value**: {self.metric_value}

### Subgroup Performance
{chr(10).join(f"- {k}: {v}" for k, v in self.subgroup_performance.items())}

## Ethical Considerations

### Potential Harms
{chr(10).join(f"- {harm}" for harm in self.potential_harms)}

### Mitigation Strategies
{chr(10).join(f"- {strat}" for strat in self.mitigation_strategies)}

### Oversight
{self.human_oversight}

### Appeals
{self.appeals_process}

## Limitations

### Technical
{chr(10).join(f"- {lim}" for lim in self.technical_limitations)}

### Data
{chr(10).join(f"- {lim}" for lim in self.data_limitations)}

### Context
{chr(10).join(f"- {lim}" for lim in self.context_limitations)}
"""


class EthicalAIMonitor:
    """Monitor AI systems for ethical issues"""

    def __init__(self):
        self.prediction_log: List[Dict] = []
        self.alerts: List[Dict] = []

    def log_prediction(self, prediction_id: str, model: str,
                      input_data: Dict, prediction: float,
                      confidence: float, actual: Optional[float] = None,
                      human_override: bool = False):
        """Log a prediction for monitoring"""
        self.prediction_log.append({
            "prediction_id": prediction_id,
            "model": model,
            "timestamp": datetime.now(),
            "prediction": prediction,
            "confidence": confidence,
            "actual": actual,
            "human_override": human_override,
            "error": abs(prediction - actual) if actual else None
        })

    def analyze_drift(self, window_days: int = 30) -> Dict:
        """Analyze for model drift"""
        df = pd.DataFrame(self.prediction_log)
        if len(df) < 10:
            return {"status": "Insufficient data"}

        df["timestamp"] = pd.to_datetime(df["timestamp"])
        recent = df[df["timestamp"] > datetime.now() - pd.Timedelta(days=window_days)]

        if len(recent) < 5:
            return {"status": "Insufficient recent data"}

        return {
            "total_predictions": len(recent),
            "avg_confidence": recent["confidence"].mean(),
            "override_rate": recent["human_override"].mean(),
            "avg_error": recent["error"].mean() if recent["error"].notna().any() else None,
            "alerts": self._generate_alerts(recent)
        }

    def _generate_alerts(self, df: pd.DataFrame) -> List[str]:
        """Generate alerts based on monitoring data"""
        alerts = []

        if df["human_override"].mean() > 0.3:
            alerts.append("HIGH: Override rate >30% - model may need retraining")

        if df["error"].notna().any() and df["error"].mean() > 10:
            alerts.append("HIGH: Average error drift detected")

        if df["confidence"].mean() < 0.5:
            alerts.append("MEDIUM: Low average confidence")

        return alerts

    def fairness_audit(self, protected_attribute: str) -> pd.DataFrame:
        """Audit predictions for fairness across groups"""
        df = pd.DataFrame(self.prediction_log)

        # Would need protected attribute in input_data
        # Simplified example
        return pd.DataFrame({
            "group": ["Group A", "Group B"],
            "n_predictions": [50, 30],
            "avg_prediction": [25.0, 22.5],
            "avg_error": [4.2, 5.8],
            "disparate_impact": [1.0, 0.85]
        })


# Create example model card
player_value_model = ModelCard(
    name="Player Market Value Predictor",
    version="2.3.1",
    model_type="Random Forest Regression",
    intended_use="Supporting scouting decisions with value estimates",
    out_of_scope_uses=[
        "Final transfer fee determination",
        "Contract negotiation basis",
        "Automated buying decisions"
    ],
    training_date="2024-01-15",
    data_source="Historical transfers 2018-2023, 5 major leagues",
    data_size="12,500 transfers",
    date_range="2018-2023",
    known_biases=[
        "Under-representation of African leagues",
        "Potential inflation bias in recent years"
    ],
    primary_metric="MAE (Mean Absolute Error)",
    metric_value="4.2M EUR",
    subgroup_performance={
        "Premier League": "3.8M MAE",
        "La Liga": "4.1M MAE",
        "Bundesliga": "4.5M MAE"
    },
    potential_harms=[
        "Undervaluation of players from underrepresented regions",
        "Overconfidence in predictions leading to poor decisions"
    ],
    mitigation_strategies=[
        "Human review required for all major decisions",
        "Confidence intervals provided with all predictions"
    ],
    human_oversight="Scouting director approval for predictions > 20M",
    appeals_process="Players/agents can request model explanation",
    technical_limitations=["Cannot predict injury impact"],
    data_limitations=["Limited youth/academy data"],
    context_limitations=["Market conditions change rapidly"]
)

print("Model Card Generated Successfully")
print(f"Model: {player_value_model.name}")
print(f"Version: {player_value_model.version}")
# R: Ethical AI Monitoring System
library(tidyverse)

# Model card template for documentation
create_model_card <- function(model_info) {
    list(
        model_details = list(
            name = model_info$name,
            version = model_info$version,
            type = model_info$type,
            intended_use = model_info$intended_use,
            out_of_scope_uses = model_info$out_of_scope,
            training_date = model_info$trained_on
        ),

        training_data = list(
            source = model_info$data_source,
            size = model_info$data_size,
            date_range = model_info$date_range,
            known_biases = model_info$known_biases
        ),

        performance = list(
            primary_metric = model_info$primary_metric,
            metric_value = model_info$metric_value,
            subgroup_performance = model_info$subgroup_perf
        ),

        ethical_considerations = list(
            potential_harms = model_info$potential_harms,
            mitigation_strategies = model_info$mitigations,
            human_oversight = model_info$oversight,
            appeals_process = model_info$appeals
        ),

        limitations = list(
            technical = model_info$tech_limitations,
            data = model_info$data_limitations,
            use_context = model_info$context_limitations
        )
    )
}

# Example model card
player_value_model <- create_model_card(list(
    name = "Player Market Value Predictor",
    version = "2.3.1",
    type = "Random Forest Regression",
    intended_use = "Supporting scouting decisions with value estimates",
    out_of_scope = c(
        "Final transfer fee determination",
        "Contract negotiation basis",
        "Automated buying decisions"
    ),
    trained_on = "2024-01-15",
    data_source = "Historical transfers 2018-2023, 5 major leagues",
    data_size = "12,500 transfers",
    date_range = "2018-2023",
    known_biases = c(
        "Under-representation of African leagues",
        "Potential inflation bias in recent years",
        "Limited data for goalkeepers"
    ),
    primary_metric = "MAE (Mean Absolute Error)",
    metric_value = "4.2M EUR",
    subgroup_perf = list(
        Premier_League = "3.8M MAE",
        La_Liga = "4.1M MAE",
        Bundesliga = "4.5M MAE",
        Serie_A = "4.3M MAE",
        Ligue_1 = "4.8M MAE"
    ),
    potential_harms = c(
        "Undervaluation of players from underrepresented regions",
        "Overconfidence in predictions leading to poor decisions",
        "Creating self-fulfilling market dynamics"
    ),
    mitigations = c(
        "Human review required for all major decisions",
        "Confidence intervals provided with all predictions",
        "Quarterly bias audits conducted"
    ),
    oversight = "Scouting director approval required for predictions > 20M",
    appeals = "Players/agents can request model explanation",
    tech_limitations = c(
        "Cannot predict injury impact",
        "Does not account for contract length",
        "Limited to traditional performance metrics"
    ),
    data_limitations = c(
        "Limited youth/academy data",
        "Salary data incomplete for some leagues"
    ),
    context_limitations = c(
        "Market conditions change rapidly",
        "Club-specific factors not captured"
    )
))

# AI decision monitoring
monitor_ai_decisions <- function(predictions_log) {
    # Calculate key monitoring metrics
    predictions_log %>%
        mutate(
            prediction_month = floor_date(prediction_date, "month")
        ) %>%
        group_by(prediction_month) %>%
        summarise(
            total_predictions = n(),
            avg_confidence = mean(confidence_score),
            human_overrides = sum(human_override),
            override_rate = human_overrides / total_predictions,
            avg_error = mean(abs(actual_value - predicted_value), na.rm = TRUE),
            .groups = "drop"
        ) %>%
        mutate(
            alert = case_when(
                override_rate > 0.3 ~ "HIGH: Many human overrides - model may need retraining",
                avg_error > 10 ~ "HIGH: Error drift detected",
                avg_confidence < 0.5 ~ "MEDIUM: Low confidence predictions",
                TRUE ~ "OK"
            )
        )
}

cat("Model Card Generated Successfully\n")
cat("Model:", player_value_model$model_details$name, "\n")
cat("Version:", player_value_model$model_details$version, "\n")
cat("Primary Metric:", player_value_model$performance$primary_metric,
    "-", player_value_model$performance$metric_value, "\n")

Principles for Ethical AI in Football

Human-in-the-Loop: AI should support, not replace, human decision-making for significant player-affecting decisions
Transparency: Document model capabilities, limitations, and potential biases clearly in model cards
Fairness: Regularly audit models for bias across demographic groups and playing contexts
Accountability: Maintain clear chains of responsibility for AI-influenced decisions
Continuous Monitoring: Track model performance drift and trigger alerts when behavior changes
Right to Explanation: Provide explanations for AI decisions to affected parties upon request

Practice Exercises

Exercise 49.1: Bias Audit

Take a player valuation or recruitment model and conduct a full bias audit. Analyze performance across different demographic groups and document any fairness concerns.

Hints:

Check for disparate impact using the 80% rule
Analyze prediction errors by nationality/league
Document data coverage gaps

Exercise 49.2: Privacy Impact Assessment

Conduct a privacy impact assessment for a tracking data analytics system. Identify risks and propose mitigations.

Hints:

Map all data flows and storage locations
Identify sensitive data elements
Propose technical and organizational controls

Exercise 49.3: Explainability Report

Create an explainability report for a player evaluation model that could be shared with non-technical stakeholders including agents and players.

Hints:

Use plain language, avoid jargon
Include visual explanations
Clearly state limitations and uncertainty

Exercise 49.4: Consent Management System

Build a consent management system for a football club that tracks player consent across different data types (tracking, biometric, video). Include consent expiration, withdrawal functionality, and GDPR export capabilities.

Hints:

Use versioned consent records to track history
Implement consent expiration with automatic renewal prompts
Build GDPR Article 15 compliant data export
Create dashboard for data protection officer oversight

Exercise 49.5: Model Card Generator

Create a tool that automatically generates model cards from trained machine learning models. Include fairness metrics, performance breakdowns by subgroup, and limitation documentation.

Hints:

Extract model metadata automatically where possible
Calculate fairness metrics (disparate impact, equalized odds)
Generate markdown/PDF output for distribution
Include versioning to track model changes over time

Exercise 49.6: Data Anonymization Pipeline

Implement a data anonymization pipeline that applies k-anonymity and l-diversity to player performance data while maintaining analytical utility. Test re-identification risk after anonymization.

Hints:

Identify quasi-identifiers (age, nationality, position combination)
Apply generalization hierarchies for categorical variables
Measure information loss vs. privacy gain trade-off
Test with re-identification attacks to validate privacy

Exercise 49.7: Ethics Review Board Simulation

Simulate an ethics review board process for a new analytics project. Create documentation templates, evaluation criteria, and a decision framework for assessing whether projects should proceed, require modifications, or be rejected.

Hints:

Define stakeholder impact categories
Create risk assessment matrices
Include escalation procedures for high-risk projects
Build ongoing monitoring requirements into approvals

Exercise 49.8: AI Fairness Dashboard

Build a monitoring dashboard that tracks AI model fairness over time. Include automated alerts when fairness metrics drift below thresholds, and provide actionable recommendations for remediation.

Hints:

Track multiple fairness metrics simultaneously
Implement statistical significance testing for drift detection
Create drill-down capabilities to investigate specific groups
Generate automated remediation recommendations

Summary

Key Takeaways

Privacy Matters: Football analytics involves sensitive personal data that must be handled with appropriate care and consent
Watch for Bias: ML models can perpetuate or amplify existing biases if not carefully monitored and mitigated
Be Transparent: Stakeholders deserve to understand how analytics conclusions are reached and what limitations exist
Consider Impact: Analytics decisions can significantly affect careers and should be made responsibly
Stay Compliant: Regulations like GDPR apply to football data and must be followed
Consent is Fundamental: Players have rights over their data; consent must be informed, specific, and easily withdrawable
Data Governance Required: Implement classification, access control, retention policies, and audit trails for all data
AI Needs Oversight: Human-in-the-loop is essential for significant decisions; document models with model cards

Common Pitfalls

Assuming Public Data is Fair Game: Even publicly available data may have restrictions on commercial use or aggregation
Ignoring Consent Scope: Consent for one purpose doesn't cover all purposes; track consent granularly by use case
Training on Biased History: Historical decisions may reflect past biases; don't blindly train models to replicate them
Black Box Models: Using models you can't explain makes it impossible to identify or fix bias and erodes stakeholder trust
One-Time Bias Audit: Bias can drift over time; continuous monitoring is required, not just initial checks
Compliance vs. Ethics: Legal compliance is the floor, not the ceiling; ethical practice goes beyond minimum requirements
Data Hoarding: Keeping data "just in case" violates data minimization principles and increases breach risk
Assuming Technical Solutions: Ethics isn't just technical; it requires process, culture, and organizational commitment

Essential Tools for Ethics & Privacy

Category	R	Python
Fairness Metrics	`fairness`, `fairmodels`	`fairlearn`, `aif360`
Model Explainability	`DALEX`, `iml`	`shap`, `lime`
Data Anonymization	`sdcMicro`, `synthpop`	`pyarx`, `sdv`
Data Hashing	`digest`	`hashlib`
Access Control	`shinymanager`	`python-jose`, `casbin`
Audit Logging	`logger`	`structlog`, `auditlog`

Ethics Maturity Levels

Level	Characteristics	Key Practices
1. Reactive	Ethics considered only when problems arise	None formal; ad-hoc responses
2. Compliant	Meet minimum legal requirements	GDPR compliance, basic consent forms
3. Proactive	Ethics integrated into project planning	Ethics checklists, bias audits, documentation
4. Embedded	Ethics is part of organizational culture	Ethics boards, continuous monitoring, training
5. Leading	Sets industry standards, shares best practices	Open-source tools, research, advocacy

Key Regulations Reference

Regulation	Key Rights	Football Analytics Impact
GDPR Art. 15	Right of access	Players can request all data held about them
GDPR Art. 17	Right to erasure	Players can request deletion (with limits)
GDPR Art. 20	Data portability	Players can take data to new clubs
GDPR Art. 22	Automated decisions	Right to human review of AI decisions
EU AI Act	AI transparency	High-risk AI systems need documentation

complete_ethics_framework

# Python: Complete Ethics Framework Implementation
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List, Optional
import hashlib
import pandas as pd

@dataclass
class EthicsFramework:
    """Comprehensive ethics framework for football analytics"""

    consent_records: List[Dict] = field(default_factory=list)
    data_registry: Dict = field(default_factory=dict)
    audit_log: List[Dict] = field(default_factory=list)
    model_cards: Dict = field(default_factory=dict)

    # 1. Consent Management
    def record_consent(self, player_id: str, purpose: str, granted: bool) -> Dict:
        """Record a consent decision"""
        record = {
            "consent_id": f"CNS_{datetime.now().strftime('%Y%m%d%H%M%S')}",
            "player_id": player_id,
            "purpose": purpose,
            "granted": granted,
            "timestamp": datetime.now()
        }
        self.consent_records.append(record)
        self._log_event("CONSENT_RECORDED", player_id, purpose)
        return record

    def check_consent(self, player_id: str, purpose: str) -> bool:
        """Check if consent exists for a purpose"""
        relevant = [c for c in self.consent_records
                   if c["player_id"] == player_id and c["purpose"] == purpose]

        if not relevant:
            self._log_event("CONSENT_CHECK_FAILED", player_id, purpose)
            return False

        latest = max(relevant, key=lambda x: x["timestamp"])
        if not latest["granted"]:
            self._log_event("CONSENT_CHECK_FAILED", player_id, purpose)
            return False

        self._log_event("CONSENT_CHECK_PASSED", player_id, purpose)
        return True

    # 2. Data Anonymization
    def anonymize(self, df: pd.DataFrame, id_cols: List[str],
                  quasi_ids: List[str]) -> pd.DataFrame:
        """Anonymize dataframe by hashing IDs and generalizing quasi-identifiers"""
        result = df.copy()

        # Hash identifiers
        for col in id_cols:
            if col in result.columns:
                result[col] = result[col].apply(
                    lambda x: hashlib.sha256(str(x).encode()).hexdigest()[:16]
                )

        # Generalize quasi-identifiers
        if "age" in quasi_ids and "age" in result.columns:
            result["age_group"] = pd.cut(
                result["age"],
                bins=[0, 21, 25, 29, 33, 40],
                labels=["U21", "21-25", "26-29", "30-33", "34+"]
            )
            result = result.drop(columns=["age"])

        self._log_event("DATA_ANONYMIZED", ",".join(id_cols), str(len(df)))
        return result

    # 3. Audit Logging
    def _log_event(self, event_type: str, entity: str, details: str):
        """Internal audit logging"""
        self.audit_log.append({
            "timestamp": datetime.now(),
            "event_type": event_type,
            "entity": entity,
            "details": details
        })

    def get_audit_log(self) -> pd.DataFrame:
        """Return audit log as dataframe"""
        return pd.DataFrame(self.audit_log)

    # 4. Bias Monitoring
    def check_fairness(self, df: pd.DataFrame, prediction_col: str,
                      actual_col: str, protected_attr: str) -> pd.DataFrame:
        """Calculate fairness metrics by protected attribute"""
        results = df.groupby(protected_attr).agg({
            prediction_col: ["count", lambda x: (x > 0.5).mean()],
            actual_col: "mean"
        }).reset_index()

        results.columns = [protected_attr, "n", "selection_rate", "avg_actual"]

        max_rate = results["selection_rate"].max()
        results["disparate_impact"] = results["selection_rate"] / max_rate
        results["alert"] = results["disparate_impact"] < 0.8

        self._log_event("FAIRNESS_CHECK", protected_attr,
                       str(results["alert"].sum()))
        return results

    # 5. GDPR Export
    def gdpr_export(self, player_id: str) -> Dict:
        """Generate GDPR Article 15 compliant data export"""
        return {
            "export_date": datetime.now().isoformat(),
            "player_id": player_id,
            "consent_history": [c for c in self.consent_records
                               if c["player_id"] == player_id],
            "data_accessed": [l for l in self.audit_log
                             if l["entity"] == player_id and
                             l["event_type"] in ["DATA_READ", "DATA_PROCESSED"]],
            "note": "Contact DPO for questions about this export"
        }


# Usage example
ef = EthicsFramework()

# Record consent
ef.record_consent("player_001", "tracking_analysis", True)

# Check consent before accessing data
if ef.check_consent("player_001", "tracking_analysis"):
    print("Access granted")

# View audit log
print("\nAudit Log:")
print(ef.get_audit_log().to_string(index=False))
# R: Complete Ethics Framework Implementation
library(tidyverse)
library(digest)

# Comprehensive ethics framework for football analytics
ethics_framework <- function() {

    # Initialize storage
    consent_records <- tibble()
    data_registry <- tibble()
    audit_log <- tibble()
    model_cards <- list()

    # 1. Consent Management
    record_consent <- function(player_id, purpose, granted) {
        record <- tibble(
            consent_id = paste0("CNS_", format(Sys.time(), "%Y%m%d%H%M%S")),
            player_id = player_id,
            purpose = purpose,
            granted = granted,
            timestamp = Sys.time()
        )
        consent_records <<- bind_rows(consent_records, record)
        log_event("CONSENT_RECORDED", player_id, purpose)
        record
    }

    check_consent <- function(player_id, purpose) {
        consent <- consent_records %>%
            filter(player_id == !!player_id, purpose == !!purpose) %>%
            arrange(desc(timestamp)) %>%
            slice(1)

        if (nrow(consent) == 0 || !consent$granted) {
            log_event("CONSENT_CHECK_FAILED", player_id, purpose)
            return(FALSE)
        }
        log_event("CONSENT_CHECK_PASSED", player_id, purpose)
        TRUE
    }

    # 2. Data Anonymization
    anonymize <- function(data, id_cols, quasi_ids) {
        anonymized <- data %>%
            mutate(across(all_of(id_cols), ~digest(as.character(.), algo = "sha256")))

        # Generalize quasi-identifiers
        if ("age" %in% quasi_ids && "age" %in% names(data)) {
            anonymized <- anonymized %>%
                mutate(age_group = cut(age, breaks = c(0, 21, 25, 29, 33, 40),
                                       labels = c("U21", "21-25", "26-29", "30-33", "34+"))) %>%
                select(-age)
        }

        log_event("DATA_ANONYMIZED", paste(id_cols, collapse = ","), nrow(data))
        anonymized
    }

    # 3. Audit Logging
    log_event <- function(event_type, entity, details) {
        entry <- tibble(
            timestamp = Sys.time(),
            event_type = event_type,
            entity = entity,
            details = as.character(details)
        )
        audit_log <<- bind_rows(audit_log, entry)
    }

    # 4. Bias Monitoring
    check_fairness <- function(predictions, actual, protected_attr) {
        results <- tibble(attribute = protected_attr) %>%
            bind_cols(
                predictions %>%
                    group_by(!!sym(protected_attr)) %>%
                    summarise(
                        n = n(),
                        selection_rate = mean(prediction > 0.5),
                        avg_error = mean(abs(prediction - actual))
                    )
            )

        max_rate <- max(results$selection_rate)
        results <- results %>%
            mutate(
                disparate_impact = selection_rate / max_rate,
                alert = disparate_impact < 0.8
            )

        log_event("FAIRNESS_CHECK", protected_attr, sum(results$alert))
        results
    }

    # Return public interface
    list(
        consent = list(record = record_consent, check = check_consent),
        anonymize = anonymize,
        fairness = check_fairness,
        audit = function() audit_log
    )
}

# Usage
ef <- ethics_framework()

# Record consent
ef$consent$record("player_001", "tracking_analysis", TRUE)

# Check consent before accessing data
if (ef$consent$check("player_001", "tracking_analysis")) {
    cat("Access granted\n")
}

# View audit log
print(ef$audit())

Ethics in football analytics isn't just about compliance - it's about building trust and ensuring our work genuinely helps the sport and its participants. By embedding ethical considerations throughout the analytics lifecycle, we can deliver valuable insights while respecting the rights and dignity of the people our work affects. The football analytics community has an opportunity to lead in responsible AI and data practices, setting standards that other sports and industries can follow. This requires ongoing commitment, continuous learning, and the willingness to prioritize ethics even when it's inconvenient.

Capstone - Complete Analytics System

Introduction to Ethics in Football Analytics

Why Ethics Matter

Privacy

Fairness

Transparency

Accountability

Data Privacy in Football

Types of Sensitive Data in Football

Algorithmic Fairness in Player Evaluation

Sources of Bias in Football Analytics

Transparency and Explainability

Responsible Analytics Practices

Ethics Checklist for Football Analytics Projects

Data Collection

Model Development

Deployment

Communication

Case Studies in Ethics

Case 1: Injury Risk Prediction

Case 2: Youth Player Evaluation

Case 3: Public Analytics and Reputation

Regulatory Landscape

Player Consent and Data Rights

Right to Access

Right to Control

Right to Erasure

Data Governance Frameworks

Data Quality

Data Security

Data Lifecycle

Data Stewardship

Ethical AI in Football Analytics

Principles for Ethical AI in Football

Practice Exercises

Exercise 49.1: Bias Audit

Exercise 49.2: Privacy Impact Assessment

Exercise 49.3: Explainability Report

Exercise 49.4: Consent Management System

Exercise 49.5: Model Card Generator

Exercise 49.6: Data Anonymization Pipeline

Exercise 49.7: Ethics Review Board Simulation

Exercise 49.8: AI Fairness Dashboard

Summary

Key Takeaways

Common Pitfalls

Project Start

Development

Deployment & Ongoing

On This Page

Exercises

Chapter Info