Football Analytics - Model Implementation - Support Vector Machine

Overview

Support Vector Machines (SVMs) are one of the most powerful and versatile supervised learning models in machine learning, especially useful for classification tasks.

At their core, SVMs are linear classifiers — they aim to draw a hyperplane between classes in a dataset. But what makes SVMs truly powerful is their ability to handle non-linear data through a method called the kernel trick.

Visual Representation of SVM Classifier

How SVM Works

Imagine a dataset where two classes are neatly split by a straight line. This is the sweet spot for a linear SVM, which seeks to find the best separating line (or in higher dimensions, a hyperplane). SVMs look for the maximum margin separator — the hyperplane that is as far as possible from the nearest points of both classes, known as support vectors. This margin-based approach gives SVMs strong generalization abilities — they often perform well even on unseen data.

The Role of the Dot Product

SVMs make decisions based on the formula:

$f(x) = w^T \cdot x + b$ where:

$w$ is the weight vector,
$x$ is the input vector,
$b$ is the bias term. The dot product $w^T \cdot x$ measures the similarity between the weight vector and the input vector. If the result is positive, the input belongs to one class; if negative, it belongs to the other.

Kernel Trick

The kernel trick is a powerful technique that allows SVMs to operate in high-dimensional spaces without explicitly transforming the data. Instead of computing the coordinates of the data in a higher-dimensional space, SVMs use kernel functions to compute the dot product directly in that space.

This is particularly useful for non-linear classification tasks. Common kernel functions include:

Linear Kernel: This is the simplest kernel, where the decision boundary is a straight line (or hyperplane in higher dimensions). It is defined as the dot product of two input vectors. Formula: $K(x_i, x_j) = x_i^T \cdot x_j$.
Polynomial Kernel: This kernel allows for non-linear decision boundaries by computing the polynomial of the dot product of two input vectors. Formula: $K(x_i, x_j) = (x_i^T \cdot x_j + r)^d$, where $r$ is a constant and $d$ is the degree of the polynomial.
Radial Basis Function (RBF) Kernel: This kernel is particularly effective for non-linear data. It computes the similarity between two input vectors based on their distance in the feature space. Formula: $K(x_i, x_j) = e^{-\gamma \left | \left | x_i - x_j \right |\right |^2}$, where $\gamma$ is a parameter that defines the width of the Gaussian kernel.

Visual Representation of Kernel Trick

Polynomial Kernel Example (r = 1, d = 2)

With $r=1$ and $d=2$, the polynomial kernel becomes:

$K(x_i, x_j) = (x_i^T \cdot x_j + r)^d$

To avoid confusion, let’s denote $x_i$ as $x$ and $x_j$ as $y$. The polynomial kernel can be expressed as:

$K(x, y) = (x^T \cdot y + 1)^2 = (\sum x_i y_i + 1)^2 = (x_{1}y_{1} + x_{2}y_{2} + 1)^2$

Expanding this expression gives us:

$K(x, y) = x_{1}^2y_{1}^2 + x_{2}^2y_{2}^2 + 2x_{1}y_{1}x_{2}y_{2} + 2x_{1}y_{1} + 2x_{2}y_{2} + 1$

The corresponding feature map is:

$ \phi(x_1, x_2) = (x_{1}^2, x_{2}^2, \sqrt{2}x_{1}x_{2}, \sqrt{2}x_{1}, \sqrt{2}x_{2}, 1)$

Now the original 2D point $(x_1, x_2)$ is mapped to 6-dimensional space.

Let’s say we have two points:

Point A: $(1, 2)$
Point B: $(3, 4)$

Let x = (1, 2) and y = (3, 4).

$\phi(1,2) = (1^2, 2^2, \sqrt{2} \cdot 1 \cdot 2, \sqrt{2} \cdot 1, \sqrt{2} \cdot 2, 1)$

$\phi(1,2) = (1, 4, 2\sqrt{2}, \sqrt{2}, 2\sqrt{2}, 1)$

$\phi(3,4) = (3^2, 4^2, \sqrt{2} \cdot 3 \cdot 4, \sqrt{2} \cdot 3, \sqrt{2} \cdot 4, 1)$

$\phi(3,4) = (9, 16, 12\sqrt{2}, 3\sqrt{2}, 4\sqrt{2}, 1)$

The polynomial kernel allows us to create a non-linear decision boundary in the original 2D space by mapping the data to a higher-dimensional space.

Advantages of SVM

Effective in high-dimensional spaces (e.g., text classification, image recognition).
Works well with small datasets where the number of features is greater than the number of samples.
Robust to outliers (when using soft-margin SVM).
Supports both linear and non-linear classification using kernel tricks.

Disadvantages of SVM

Computationally expensive for large datasets.
Sensitive to parameter tuning, especially C and kernel choice.
Not easily interpretable, unlike decision trees.
Struggles with overlapping classes, as it focuses on margin maximization.

Use Cases of SVM

Text classification (e.g., spam detection, sentiment analysis). Image recognition (e.g., face detection). Medical diagnosis (e.g., cancer classification). Bioinformatics (e.g., gene classification).

Dataset Preparation

The dataset consists of football player attributes, including categorical and numerical features. These features help analyze how different factors influence a player’s market value.

Numerical Features:

Performance Metrics Across Seasons: Includes appearances (MP), goals (G), assists (A), yellow cards (YC), red cards (RC), and other relevant statistics.

Target Variable:

ValueCategory: Represents different categories of market value for a player, classifying them into three groups (0, 1 or 2).

Dataset before preprocessing:

Data Preprocessing and Train-Test Split

Before training the model, several preprocessing steps are applied to ensure data quality and improve model performance:

Class Balancing: The dataset may have an imbalance in player value categories, meaning some categories may have significantly fewer samples than others. To address this, we perform resampling to ensure each class has an equal number of samples (1,000 per category).
Train-Test Split: The dataset is split into 80% training and 20% testing, ensuring that the model learns from a diverse set of data while being evaluated on unseen examples.
Outlier Handling: To prevent the model from encountering values in the test set that were not seen during training, any test set values exceeding the maximum observed training values in a particular feature are capped.
Standardization: The numerical features are standardized to have a mean of 0 and a standard deviation of 1. This step is crucial for SVMs, as they are sensitive to the scale of the input features.

The train test split random seed is the same for all models to ensure consistency in the results.

Screenshots and Links to Datasets

X-train

X-train for SVM

The shape of X-train is (1600, 44). Link to dataset

y-train

y-train for SVM

The shape of y-train is (1600,). Link to dataset

X-test

X-test for SVM

The shape of X-test is (400, 44). Link to dataset

y-test

y-test for SVM

The shape of y-test is (400,). Link to dataset

Code files

The code used for dataset preparation and training the Support Vector Machine models is available here

Results

We perform experiments with kernels such as Linear, Polynomial, and RBF. We also test it out with different values of C (0.1, 1, and 10) to see how it affects the model’s performance. The decision boundary is visualized using PCA to reduce the dimensionality of the data to 2D.

Linear Kernel

C = 0.1

Classification Report for Linear Kernel SVM with C=0.1

Confusion Matrix for Linear Kernel SVM with C=0.1

Decision Boundary on 2-D PCA for Linear Kernel SVM with C=0.1

C = 1

Classification Report for Linear Kernel SVM with C=1

Confusion Matrix for Linear Kernel SVM with C=1

Decision Boundary on 2-D PCA for Linear Kernel SVM with C=1

C = 10

Classification Report for Linear Kernel SVM with C=10

Confusion Matrix for Linear Kernel SVM with C=10

Decision Boundary on 2-D PCA for Linear Kernel SVM with C=10