Football Analytics - Data Preparation

From Web Scrapes to Cleaned Datasets

The dataset is created using a python script that scrapes data from Transfermarkt website using a RESTful API service via FastAPI. How FastAPI works in this repo:

  • Define the API Endpoints: The various endpoints (URLs) are defined in ‘app’ that the API will respond to. Each endpoint is associated with a Python function that handles the request.
  • Data Models: FastAPI uses Pydantic models to define the structure of the request and response data. This ensures that the data is validated and serialized correctly.
  • Asynchronous Requests: FastAPI supports asynchronous programming, allowing you to handle multiple requests concurrently, which is useful for I/O-bound operations like web scraping.

The API has the following endpoints:

Competitions

  • GET /competitions/search/{competition_name}: Search Competitions
  • GET /competitions/{competition_id}/clubs: Get Competition Clubs

Clubs

  • GET /clubs/search/{club_name}: Search Clubs
  • GET /clubs/{club_id}/profile: Get Club Profile
  • GET /clubs/{club_id}/players: Get Club Players

Players

  • GET /players/search/{player_name}: Search Players
  • GET /players/{player_id}/profile: Get Player Profile
  • GET /players/{player_id}/market_value: Get Player Market Value
  • GET /players/{player_id}/transfers: Get Player Transfers
  • GET /players/{player_id}/jersey_numbers: Get Player Jersey Numbers
  • GET /players/{player_id}/stats: Get Player Stats
  • GET /players/{player_id}/injuries: Get Player Injuries
  • GET /players/{player_id}/achievements: Get Player Achievements

Example response from API call (Get Player Profile):

{
   "updatedAt": "2025-02-01T19:51:24.727835",
   "id": "28003",
   "url": "https://www.transfermarkt.com/lionel-messi/profil/spieler/28003",
   "name": "Lionel Messi",
   "description": "Lionel Messi, 37, from Argentina ➤ Inter Miami CF, since 2023 ➤ Right Winger ➤ Market value: €20.00m ➤ * Jun 24, 1987 in Rosario, Argentina",
   "nameInHomeCountry": "Lionel Andrés Messi Cuccitini",
   "imageUrl": "https://img.a.transfermarkt.technology/portrait/header/28003-1710080339.jpg?lm=1",
   "dateOfBirth": "1987-06-24",
   "placeOfBirth": {
      "city": "Rosario",
      "country": "Argentina"
   },
   "age": 37,
   "height": 170,
   "citizenship": [
      "Argentina",
      "Spain"
   ],
   "isRetired": false,
   "position": {
      "main": "Right Winger",
      "other": [
         "Centre-Forward",
         "Second Striker"
      ]
   },
   "foot": "left",
   "shirtNumber": "#10",
   "club": {
      "id": "69261",
      "name": "Miami",
      "joined": "2023-07-15",
      "contractExpires": "2025-12-31"
   },
   "marketValue": 20000000,
   "agent": {
      "name": "Relatives"
   },
   "outfitter": "adidas",
   "socialMedia": [
      "http://www.facebook.com/LeoMessi",
      "http://instagram.com/leomessi",
      "http://www.leomessi.com/"
   ],
   "trainerProfile": {},
   "relatives": [
      {
         "id": "55306",
         "url": "/maximiliano-biancucchi/profil/spieler/55306",
         "name": "Maximiliano Biancucchi",
         "profileType": "player"
      },
      {
         "id": "82727",
         "url": "/emanuel-biancucchi/profil/spieler/82727",
         "name": "Emanuel Biancucchi",
         "profileType": "player"
      }
   ]
}

Let’s see how we gathered raw data and transformed it into a clean and usable dataset. The journey started with extracting player data from the web and ended with a structured dataset ready for analysis.

1. Gathering the Player Universe Our first step was to create a bridge between two popular football data sources: FBref (a source for detailed game statistics) and Transfermarkt (a source for player profiles and market values). We began with a .csv file (fbref_to_tm_mapping.csv) that contained mappings between FBref player URLs and Transfermarkt player URLs. This was obtained from this library. This file was our starting point in creating our initial player universe:

  • We read the file into a pandas DataFrame (links_df).
  • We extracted the FBref player URLs and created a list (fbref_links).
  • We extracted player names from the URLs and added them as a new column (‘Name’).
  • We removed unnecessary columns, renamed the Transfermarkt URL column, and extracted player IDs from the Transfermarkt URLs.
  • The result was saved as player_links.csv a CSV containing Url, Name, and ID, this was a critical reference for scraping player data. This is because we will need the ID to scrape from the Transfermarkt page and Name for easy reference in the analysis.
  • We used this step to create a unique identifier for each player (their Transfermarkt ID), which we use to find all their data.

2. Scraping Player Data

Now we had our list of player IDs, we moved onto the data mining phase. Here’s how we did it:

  • Scraping Logic: We used a Python class called GetPlayerStats, which takes a player’s Transfermarkt ID and scrapes all available information about them. This includes:
    • Profile Data: Name, Date of Birth, Age, Height, Foot, Position, Nationality, Market Value, Outfitter, Club, and Contract information.
    • Season-by-Season Stats: Yellow cards, Second yellow cards, Red Cards, Goals, Assists, Minutes Played, and Appearances for multiple seasons.
    • Market Value History: A timeline of player market values, allowing us to calculate average values each year.
    • Achievements: Number of cups and titles won by the player.
  • Concurrency for Efficiency: Rather than scrape one player after another, we used concurrent.futures.ThreadPoolExecutor to scrape multiple players concurrently. This significantly speeds up the data gathering process. We used 5 threads to reduce rate limit issues.
  • Data Storage: The scraped data for each player was stored as a row in a player_stats.csv file. We did this in append mode so if the scarping process fails or is interrupted, we can continue from where we stopped.

3. Data Cleaning and Transformation

Raw scraped data is rarely ready for analysis. Therefore we needed to do some data cleaning and preparation.

  • Loading the Data: The player_stats.csv file was loaded into a pandas DataFrame (df).
  • Handling Missing Values:
    • We first identify which columns had missing values to determine the best strategy to handling them
    • Several columns had missing values, such as: Height, Age, Foot, Position, National, Outfitter, Club_name, ContractExpiry, ContractOption
    • We dropped all rows that had missing values in Height, Age, Club_name, Position or National.
    • We set foot to ‘hand’ for goalkeepers where the value was missing
    • We filled remaining missing values in Foot by assigning the most common foot for each player’s position using a dictionary lookup for each value.
    • Finally we filled all remaining values for Outfitter with ‘Unknown’, and ContractOption with ‘None’
  • Fixing Encoding Issues: Player names and club names often had encoding issues due to the scraping process. ftfy.fix_text() fixed this to improve data consistency.
  • Type Conversion: The Age column was converted to int and the Height column, after removing characters and replacing commas, was converted to a float
  • Saving the Cleaned Data: The cleaned and transformed DataFrame was saved to a cleaned_player_stats.csv file.

Summary

After the process above we had:

  1. Player ID Mapping : a file containing a unique id for each player (player_links.csv)
  2. Clean Player Data: A structured cleaned_player_stats.csv file containing all of the scraped player data, with a consistent and clean format. This file is now ready for data analysis.

This dataset contains detailed performance metrics, contract details, and market value trends for 12,952 football players. The data spans multiple years, covering key statistics such as goals, assists, appearances, and disciplinary records.

Dataset Structure

The dataset consists of 64 columns, categorized into the following groups:

1. Player Information

  • id - Unique identifier for each player.
  • name - Full name of the player.
  • dateOfBirth - Player’s date of birth.
  • Age - Current age of the player.
  • Height - Player’s height (in meters).
  • Foot - Preferred foot (Left/Right).
  • Position - Primary playing position.
  • OtherPosition - Additional positions the player can play.
  • National - Player’s nationality.

2. Contract and Club Details

  • Club_name - Current club of the player.
  • ContractExpiry - Expiration date of the current contract.
  • ContractOption - Additional contract details (if applicable).
  • Outfitter - Brand sponsoring the player’s kit (if available).

3. Performance Metrics (Yearly Stats)

Performance metrics are recorded for multiple years (2020-2025), indicated by the prefix (20, 21, 22, etc.). Each year contains:

  • YC - Yellow Cards.
  • YC2 - Second Yellow Cards.
  • RC - Red Cards.
  • G - Goals scored.
  • A - Assists.
  • MP - Minutes played.
  • AP - Appearances.

Example:

  • 20G - Goals scored in 2020.
  • 21MP - Minutes played in 2021.
  • MarketValue - Current estimated market value of the player.
  • 2020AvgMV, 2021AvgMV, 2022AvgMV, 2023AvgMV, 2024AvgMV, 2025AvgMV - Average market value for each year.

5. Rankings & Achievements

  • Ranking - Player ranking based on performance.
  • TotalCups - Total number of cups/titles won.

This approach ensured we have high quality data, which means we can get great insights!

Back to Home