Football Analytics - Data Preparation
From Web Scrapes to Cleaned Datasets
The dataset is created using a python script that scrapes data from Transfermarkt website using a RESTful API service via FastAPI. How FastAPI works in this repo:
- Define the API Endpoints: The various endpoints (URLs) are defined in ‘app’ that the API will respond to. Each endpoint is associated with a Python function that handles the request.
- Data Models: FastAPI uses Pydantic models to define the structure of the request and response data. This ensures that the data is validated and serialized correctly.
- Asynchronous Requests: FastAPI supports asynchronous programming, allowing you to handle multiple requests concurrently, which is useful for I/O-bound operations like web scraping.
The API has the following endpoints:
Competitions
- GET /competitions/search/{competition_name}: Search Competitions
- GET /competitions/{competition_id}/clubs: Get Competition Clubs
Clubs
- GET /clubs/search/{club_name}: Search Clubs
- GET /clubs/{club_id}/profile: Get Club Profile
- GET /clubs/{club_id}/players: Get Club Players
Players
- GET /players/search/{player_name}: Search Players
- GET /players/{player_id}/profile: Get Player Profile
- GET /players/{player_id}/market_value: Get Player Market Value
- GET /players/{player_id}/transfers: Get Player Transfers
- GET /players/{player_id}/jersey_numbers: Get Player Jersey Numbers
- GET /players/{player_id}/stats: Get Player Stats
- GET /players/{player_id}/injuries: Get Player Injuries
- GET /players/{player_id}/achievements: Get Player Achievements
Example response from API call (Get Player Profile):
{
"updatedAt": "2025-02-01T19:51:24.727835",
"id": "28003",
"url": "https://www.transfermarkt.com/lionel-messi/profil/spieler/28003",
"name": "Lionel Messi",
"description": "Lionel Messi, 37, from Argentina ➤ Inter Miami CF, since 2023 ➤ Right Winger ➤ Market value: €20.00m ➤ * Jun 24, 1987 in Rosario, Argentina",
"nameInHomeCountry": "Lionel Andrés Messi Cuccitini",
"imageUrl": "https://img.a.transfermarkt.technology/portrait/header/28003-1710080339.jpg?lm=1",
"dateOfBirth": "1987-06-24",
"placeOfBirth": {
"city": "Rosario",
"country": "Argentina"
},
"age": 37,
"height": 170,
"citizenship": [
"Argentina",
"Spain"
],
"isRetired": false,
"position": {
"main": "Right Winger",
"other": [
"Centre-Forward",
"Second Striker"
]
},
"foot": "left",
"shirtNumber": "#10",
"club": {
"id": "69261",
"name": "Miami",
"joined": "2023-07-15",
"contractExpires": "2025-12-31"
},
"marketValue": 20000000,
"agent": {
"name": "Relatives"
},
"outfitter": "adidas",
"socialMedia": [
"http://www.facebook.com/LeoMessi",
"http://instagram.com/leomessi",
"http://www.leomessi.com/"
],
"trainerProfile": {},
"relatives": [
{
"id": "55306",
"url": "/maximiliano-biancucchi/profil/spieler/55306",
"name": "Maximiliano Biancucchi",
"profileType": "player"
},
{
"id": "82727",
"url": "/emanuel-biancucchi/profil/spieler/82727",
"name": "Emanuel Biancucchi",
"profileType": "player"
}
]
}
Let’s see how we gathered raw data and transformed it into a clean and usable dataset. The journey started with extracting player data from the web and ended with a structured dataset ready for analysis.
1. Gathering the Player Universe
Our first step was to create a bridge between two popular football data sources: FBref (a source for detailed game statistics) and Transfermarkt (a source for player profiles and market values). We began with a .csv file (fbref_to_tm_mapping.csv) that contained mappings between FBref player URLs and Transfermarkt player URLs. This was obtained from this library. This file was our starting point in creating our initial player universe:
- We read the file into a pandas DataFrame (
links_df). - We extracted the FBref player URLs and created a list (
fbref_links). - We extracted player names from the URLs and added them as a new column (‘Name’).
- We removed unnecessary columns, renamed the Transfermarkt URL column, and extracted player IDs from the Transfermarkt URLs.
- The result was saved as
player_links.csva CSV containingUrl,Name, andID, this was a critical reference for scraping player data. This is because we will need theIDto scrape from the Transfermarkt page andNamefor easy reference in the analysis. - We used this step to create a unique identifier for each player (their Transfermarkt ID), which we use to find all their data.
2. Scraping Player Data
Now we had our list of player IDs, we moved onto the data mining phase. Here’s how we did it:
- Scraping Logic: We used a Python class called
GetPlayerStats, which takes a player’s Transfermarkt ID and scrapes all available information about them. This includes:- Profile Data: Name, Date of Birth, Age, Height, Foot, Position, Nationality, Market Value, Outfitter, Club, and Contract information.
- Season-by-Season Stats: Yellow cards, Second yellow cards, Red Cards, Goals, Assists, Minutes Played, and Appearances for multiple seasons.
- Market Value History: A timeline of player market values, allowing us to calculate average values each year.
- Achievements: Number of cups and titles won by the player.
- Concurrency for Efficiency: Rather than scrape one player after another, we used
concurrent.futures.ThreadPoolExecutorto scrape multiple players concurrently. This significantly speeds up the data gathering process. We used 5 threads to reduce rate limit issues. - Data Storage: The scraped data for each player was stored as a row in a
player_stats.csvfile. We did this in append mode so if the scarping process fails or is interrupted, we can continue from where we stopped.
3. Data Cleaning and Transformation
Raw scraped data is rarely ready for analysis. Therefore we needed to do some data cleaning and preparation.
- Loading the Data: The
player_stats.csvfile was loaded into a pandas DataFrame (df). - Handling Missing Values:
- We first identify which columns had missing values to determine the best strategy to handling them
- Several columns had missing values, such as:
Height,Age,Foot,Position,National,Outfitter,Club_name,ContractExpiry,ContractOption - We dropped all rows that had missing values in
Height,Age,Club_name,PositionorNational. - We set foot to ‘hand’ for goalkeepers where the value was missing
- We filled remaining missing values in
Footby assigning the most common foot for each player’s position using a dictionary lookup for each value. - Finally we filled all remaining values for
Outfitterwith ‘Unknown’, andContractOptionwith ‘None’
- Fixing Encoding Issues: Player names and club names often had encoding issues due to the scraping process.
ftfy.fix_text()fixed this to improve data consistency. - Type Conversion: The Age column was converted to
intand theHeightcolumn, after removing characters and replacing commas, was converted to afloat - Saving the Cleaned Data: The cleaned and transformed DataFrame was saved to a
cleaned_player_stats.csvfile.
Summary
After the process above we had:
- Player ID Mapping : a file containing a unique id for each player (
player_links.csv) - Clean Player Data: A structured
cleaned_player_stats.csvfile containing all of the scraped player data, with a consistent and clean format. This file is now ready for data analysis.
This dataset contains detailed performance metrics, contract details, and market value trends for 12,952 football players. The data spans multiple years, covering key statistics such as goals, assists, appearances, and disciplinary records.
Dataset Structure
The dataset consists of 64 columns, categorized into the following groups:
1. Player Information
id- Unique identifier for each player.name- Full name of the player.dateOfBirth- Player’s date of birth.Age- Current age of the player.Height- Player’s height (in meters).Foot- Preferred foot (Left/Right).Position- Primary playing position.OtherPosition- Additional positions the player can play.National- Player’s nationality.
2. Contract and Club Details
Club_name- Current club of the player.ContractExpiry- Expiration date of the current contract.ContractOption- Additional contract details (if applicable).Outfitter- Brand sponsoring the player’s kit (if available).
3. Performance Metrics (Yearly Stats)
Performance metrics are recorded for multiple years (2020-2025), indicated by the prefix (20, 21, 22, etc.). Each year contains:
YC- Yellow Cards.YC2- Second Yellow Cards.RC- Red Cards.G- Goals scored.A- Assists.MP- Minutes played.AP- Appearances.
Example:
20G- Goals scored in 2020.21MP- Minutes played in 2021.
4. Market Value Trends
MarketValue- Current estimated market value of the player.2020AvgMV,2021AvgMV,2022AvgMV,2023AvgMV,2024AvgMV,2025AvgMV- Average market value for each year.
5. Rankings & Achievements
Ranking- Player ranking based on performance.TotalCups- Total number of cups/titles won.
This approach ensured we have high quality data, which means we can get great insights!
