Tracking players and ball
in football videos

Master’s thesis in Data Science and AI

Hugo Ganelius
Jhanzaib Humayun

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2024


Master’s thesis 2024

Tracking players and ball
in football videos

Hugo Ganelius
Jhanzaib Humayun

Department of Electrical Engineering
Chalmers University of Technology

Gothenburg, Sweden 2024


Tracking players and ball in football videos
Hugo Ganelius Jhanzaib Humayun

© Hugo Ganelius, Jhanzaib Humayun, 2024.

Supervisor: Lennart Svensson, Department of Electrical Engineering
Advisor: Anders Sjöberg, Fraunhofer Chalmers Centre
Examiner: Lennart Svensson, Department of Electrical Engineering

Master’s Thesis 2024
Department of Electrical Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Players and ball detected using YOLO.

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Tracking players and ball in football videos
Hugo Ganelius, Jhanzaib Humayun
Department of Electrical Engineering
Chalmers University of Technology

Abstract
This master’s thesis explores challenges and methodologies of accurately tracking
players and the ball within football video sequences, employing advanced object
detection and tracking techniques. The research is integrated into a larger project
aimed at reconstructing football sequences in a virtual reality (VR) environment,
thus enhancing the training and strategic analysis capabilities in sports environ-
ments.

The thesis primarily utilizes models based on convolutional neural networks, such
as the YOLOv9 model for object detection to identify the positions of the players
and the ball within different frames. Advanced methods such as the online track-
ing algorithm BoT-SORT and the minimum cost flow algorithm are employed for
detection-based tracking, optimizing the accuracy of continuous player and ball tra-
jectories in the complex, fast-moving setting of a football game. We also employ the
TrackNet V3 model for predicting ball trajectories, as an alternative to detection-
based tracking. TrackNet utilizes multiple sequential frames to predict the trajectory
of the ball, enhancing detection accuracy even during fast play or when the ball is
momentarily occluded.

Results from these experiments indicate promising player tracking, though chal-
lenges persist when it comes to, e.g., id-switching. Ball tracking is shown to be
more difficult due to its small size and high movement speed, which often leads
to reduced detection reliability. TrackNet successfully predicts the ball’s trajectory
when it is moving quickly and not occluded, but instead struggles when the ball
is still or frequently occluded (such as during dribbling). The outcomes contribute
towards the understanding of object tracking in sports and the development of ap-
plications for enhanced game analysis.

Keywords: computer vision, multiple object tracking, convolutional neural networks,
deep neural networks, engineering, project, thesis.

v


Acknowledgements
We would like to extend our thanks to our academic supervisor and examiner Lennart
Svensson, who first had the idea for this project and graciously chose us to be the
ones to see it through.

We would also like to thank Fraunhofer Chalmers Centre for providing us with a
space to work and resources. In particular we would like to thank Anders Sjöberg
at FCC, for acting as our advisor and providing us with invaluable ideas and input.

Thanks also go out to the football club IFK Göteborg for providing video data, as
well as meeting us at several points in time to discuss what a professional club would
want out of a project like this one.

Lastly, we wish to thank the other participants of the Football in VR project, Filip
Anjou, Albin Ekström, Joakim Osterman, and Olof Sjögren. Despite being deeply
engaged in their own theses, they have generously shared ideas that have enriched
our work.

Hugo Ganelius, Jhanzaib Humayun, Gothenburg, 2024-06-13

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Purpose of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The football in VR project . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Specific Challenges in Tracking for Football . . . . . . . . . . . . . . 2
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5
2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 5
2.1.2 Methods in Object Detection . . . . . . . . . . . . . . . . . . 6
2.1.3 YOLO Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Online vs offline tracking . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Kalman Filter Equations . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Application in Tracking . . . . . . . . . . . . . . . . . . . . . 9

2.4 Appearance feature embedding . . . . . . . . . . . . . . . . . . . . . 10
2.5 Foreground Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Important metrics in multiple object tracking . . . . . . . . . . . . . 11

2.6.1 Intersection over Union . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Some Online Trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7.1 SORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.2 Deep-SORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.3 ByteTrack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.4 BoT-SORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.8 Minimum cost flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8.1 Graph Construction and Edge Definition . . . . . . . . . . . . 16
2.8.2 Edge Cost and Capacity . . . . . . . . . . . . . . . . . . . . . 16
2.8.3 Pathfinding via Flow Optimization . . . . . . . . . . . . . . . 17

2.9 TrackNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Methodology 23

ix


Contents

3.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Foreground Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Removing Stationary Detections . . . . . . . . . . . . . . . . . 24
3.3 Player tracking pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Minimum Cost Flow implementation details . . . . . . . . . . . . . . 26

3.4.1 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Cost Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.3 Improbable Connections for the Players . . . . . . . . . . . . . 29
3.4.4 Ball Tracking Using Minimum Cost Flow . . . . . . . . . . . . 29
3.4.5 Quadratic Interpolation for Ball Path Tracking . . . . . . . . 31

3.5 Tracking the ball using TrackNet . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Adaption to football . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.2 Training the main tracking model . . . . . . . . . . . . . . . . 32
3.5.3 Training the rectification model . . . . . . . . . . . . . . . . . 32

4 Results and discussion 33
4.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Player detection evaluation . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Ball detection evaluation . . . . . . . . . . . . . . . . . . . . . 35

4.2 Results of tracking the players . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Challenges in Automated ID Switch Correction . . . . . . . . 38

4.3 TrackNet training loss . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Evaluation of ball methods on test set . . . . . . . . . . . . . . . . . 40

4.4.1 Visual analysis of TrackNet result . . . . . . . . . . . . . . . . 43

5 Conclusion 45
5.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Contributions to the Field . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Future work 47
6.1 Correcting ID-switches for player Minimum Cost Flow . . . . . . . . 47

6.1.1 Jersey Colors and Deep Features Learned by Neural Networks 47
6.1.2 Jersey Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.3 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1.4 Speed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Better tracking of ball . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3 Predicting the ball’s position in 3D space . . . . . . . . . . . . . . . . 48

Bibliography 51

x


List of Figures

1.1 Examples demonstrating why the ball is difficult to detect. Note in
all of these the lack of distinct recognisable features. . . . . . . . . . . 3

2.1 Graph simplification where each node represents a detection. Nodes
that are aligned horizontally are detections from the same frame.
Edges represent possible transitions between detections, i.e. possi-
ble movement of objects. This graph demonstrates the flexibility in
track management, including frame skipping. . . . . . . . . . . . . . . 17

2.2 Model architecture of TrackNet V3. The numbers at each level corre-
spond to the number of channels for each layer on that level. . . . . . 20

2.3 Example frames from the SoccerNet dataset with visible bounding
boxes and object ID. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Example frames from our own annotated dataset. The point annota-
tions are indicated with a small red circle. . . . . . . . . . . . . . . . 21

3.1 Batches used for training the detection model, after data augmenta-
tion is applied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Player Tracking Pipeline: This diagram illustrates the sequential pro-
cessing steps involved in tracking players in a sports video. The
pipeline includes player detection, tracking using either BoT-SORT
or the minimum cost flow algorithm. . . . . . . . . . . . . . . . . . . 26

3.3 Simple illustration of a minimum cost flow graph from two frames,
each with three detections numbered 1 to 6. Each detection is split
into “In” and “Out” nodes. . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Ball Tracking Pipeline: This figure illustrates the sequence of pro-
cessing steps involved in the ball tracking system, starting from the
video stream and progressing through detection, filtering stationary
detections, tracking, and interpolating the path for missing frames. . 30

4.1 Training and validation loss per epoch for the player detection model 34
4.2 Training and validation loss per epoch for the ball detection model . . 34
4.3 Various metrics for the player model training and evaluation. . . . . . 35
4.4 Various metrics for the ball model training and evaluation . . . . . . 35
4.5 YOLOv9 detection boxes with confidence scores, identifying players’

white shorts, and text in the background, as balls. . . . . . . . . . . . 36

xi


List of Figures

4.6 Foreground extraction. The leftmost image displays the original frame,
the middle image illustrates the background after extraction, and the
rightmost image shows the foreground with the moving objects clearly
isolated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.7 Zoomed-in view of the ball with background masked. . . . . . . . . . 37
4.8 Detections outputted by the YOLO model. Note the presence of

background elements, which are candidates for removal via foreground
extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.9 Comparison of MCF and BoT-SORT algorithms in various matrics,
showing similar performance. . . . . . . . . . . . . . . . . . . . . . . 39

4.10 Loss for training data and validation data during six separate training
runs using different sets of parameters. The x-axis is expressed in
epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.11 Loss for training data and validation data during three separate train-
ing runs, all using sequence length 30 but different initial learning
rates. The initial learning rates are 5e-5 (purple), 5e-4 (orange), and
1e-3 (blue). The x-axis is expressed in epochs. . . . . . . . . . . . . . 40

4.12 Loss for validation data during three separate training runs using
initial learning rate of 5e-4. The sequence lengths are 20 (green), 30
(orange), and 40 (pink). . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.13 Positions predicted by our TrackNet model overlaid on top of the
original video. Videos are from the test set annotated by us. . . . . . 44

xii


List of Tables

4.1 Total number of predictions of each category across all eight test set
sequences, using a distance threshold of 20 pixels. . . . . . . . . . . . 41

4.2 Performance Metrics Comparison across eight annotated sequences
using a distance threshold of 20 pixels. The shown values represent
the mean, with the ± symbol indicating the standard deviation. . . . 42

4.3 Total number of predictions with the wrong location across all eight
test set sequences, using different distance thresholds of 20, 100, and
150 pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xiii


List of Tables

xiv


1
Introduction

Football, also known as soccer in some regions of the world, is not just a sport but
a global phenomenon that captures the hearts of millions. Its universal appeal and
the dramatic moments it produces make it a significant cultural and social fixture.
With its popularity there naturally arises a demand for high-quality analytical tools
that can be used for analyzing play. The excitement and unpredictability of football
make it an excellent candidate for advanced analysis.

This masters thesis is one component of a broader initiative consisting of three
distinct thesis projects, each targeting different aspects of the same overarching
goal: the reconstruction of football sequences in a virtual reality environment. The
initiative aims to leverage the latest in VR and deep learning technology to create
a dynamic and interactive model of football games.

1.1 Purpose of the Thesis
The primary focus of this thesis is to explore the challenges of performing accurate
MOT (Multiple Object Tracking) of football games. The goal is the development of
robust detection and tracking systems for both the football and the players involved
in the game, for the purpose of game sequence recreation and play analysis. We
explore different solutions to find out their strengths and shortcomings, in an effort
to find the best approach to solving the task.

1.2 The football in VR project
The project of which this thesis is a component seeks to offer a novel way to ex-
perience football, providing perspectives and insights that are not possible with
traditional camera perspectives. The completed project would allow the ability to
stand and move around on a virtual field as an accurate replay of a certain game
sequence plays out. This would allow coaches and players to achieve a more precise
and complete understanding of the situation, such as distances between players and
the ball. Furthermore, one could render the sequence from the point of view of a
player, making it possible to understand exactly what that player was able to see
based on their field of view.

The project’s three main parts are as follows:

1


1. Introduction

1. Detection and Tracking (subject of this thesis): Developing the methods and
technologies necessary to detect and continuously track the position and move-
ment of the players and the ball.

2. Pose Estimation (subject of [1]): Once tracking is reliably handled, the next
step is estimating the poses of the players. This involves understanding the po-
sitions and movements of individual player limbs and their orientation during
the game.

3. Virtual Reality Rendering (subject of [2]): The final part of the pipeline will
take the tracking data and pose information to render these elements in a
virtual reality environment, creating an immersive experience that can be used
for analytical purposes.

1.3 Specific Challenges in Tracking for Football
Tracking football presents unique challenges due to the dynamic nature of the game
and the environment in which it is played. Here are some aspects of football footage
that make tracking difficult:

• Fast-Paced Action: Football is a fast-paced sport where players often clus-
ter together, further complicating tracking efforts. Rapid changes in direction,
speed, and group formations can confuse algorithms designed to follow indi-
vidual movements.

• Occlusions: Players frequently block each other or the ball. Both are occa-
sionally obscured by other elements on the field, such as the referee or goal
posts. These occlusions can cause tracking algorithms to temporarily lose
track of a player or the ball.

• Distance from Camera: The wide shots used to capture football matches
mean that objects can appear very small in the frame. This distance reduces
the resolution and detail available to detection and tracking models, making
it more difficult to identify specific features such as numbers on jerseys.

Further complicating matters, players of the same team wear identical jerseys, mak-
ing it difficult to distinguish one player from another, especially from distant camera
angles. In professional football, it is also common for players on the same team to
be of the same gender, ethnicity, and physical build. This uniformity can lead to
frequent ID switches in tracking systems.

When it comes to the ball, we do not need to concern ourselves with object ID.
In this thesis we only concern ourselves with tracking a single ball. Even if the
methods were to be expanded such that they track multiple balls simultaneously,
ID-switching is not a big issue for sports analytics as it seldom matters which ball
is which.

Instead, the difficulty comes from actually detecting the ball in the first place. The
small resolution and lack of distinguishable features mean there is little for a de-
tection model to latch on to, unlike players where it can identify features such as

2


1. Introduction

(a) Small resolution (b) Compression artifacts (c) High background noise,
the ball is above the S at
the bottom.

Figure 1.1: Examples demonstrating why the ball is difficult to detect. Note in all
of these the lack of distinct recognisable features.

arms and legs. Compression artifacts and motion blur further complicates the issue,
as the ball does not have a consistent recognisable shape. Beyond this, there is
often a lot of noise in the image making it difficult even for humans to discern what
is the ball and what is something else. Figure 1.1 contains some example images
that demonstrate why a computer model might struggle to identify the ball in still
frames.

1.4 Scope
As this thesis is a part of a project with a specific goal, some demarcations are made
in the types of solutions that are explored to better fit this goal.

For the purpose of this thesis, we will only apply methods on footage using a fixed
camera perspective. The end goal of this thesis is a program that can create three di-
mensional recreations of football games. Dynamic camera calibration for translating
between pixel coordinates to real world coordinates is not in scope of the project at
large, and thus dynamic perspective such as tilt, pan, and zoom will not be handled
in this thesis.

The scope also does not include the development of a real-time tracking system. In-
stead, the focus is on post-processed analysis, aiming to maximize accuracy and re-
liability in identifying and tracking players and the ball across pre-recorded footage.

3


1. Introduction

4


2
Background

In the upcoming sections, we will delve into prior research and developments in the
field of MOT. We will explore various tracking methodologies, explaining and dis-
tinguishing between them to provide a comprehensive understanding of the subject.
Furthermore, we will detail the specific models employed in this thesis, shedding
light on their unique characteristics and the fundamental principles they leverage.
This will include an examination of the underlying concepts and techniques that
these models utilize to achieve effective tracking performance.

2.1 Object Detection
Object detection is the task of classifying and estimating the location of objects
within an image. This involves not only recognizing the types of objects presented
in an image but also pinpointing their specific positions and sizes, typically by draw-
ing bounding boxes around each object. Object detection serves as a fundamental
component in numerous applications including autonomous driving, video surveil-
lance, and interactive robotic systems. Tracking systems based on the results of a
detection model are known as detection-based trackers.

This section will detail some fundamental concepts in object detection and describe
one of the state of the art series of detection models, the YOLO series of general-
purpose object detectors.

2.1.1 Convolutional Neural Networks
CNNs (Convolutional Neural Networks) are a class of deep neural networks highly
effective in areas such as image recognition and classification. CNNs have become
fundamental to the development of object detection models due to their ability to
automatically detect and learn optimal features from visual data, without the need
for manual feature extraction.

The architecture of a CNN typically involves several layers that transform the input
image to produce a desired output such as a class label or a set of bounding box
coordinates. The key layers used in a CNN include:

• Convolutional Layers: These layers apply a number of convolutional filters
to the input to create feature maps that summarize the presence of detected

5


2. Background

features in the input. Each filter detects different aspects of the image, such
as edges, textures, or more complex patterns in deeper layers.

• Activation Functions: Non-linear activation functions, such as the ReLU
function [3], are applied after each convolution operation to introduce non-
linear properties into the network, allowing it to learn more complex patterns.

• Pooling Layers: Pooling (usually max pooling) reduces the dimensionality
of each feature map but retains the most important information. Pooling
layers help to make the detection of features invariant to changes in scale and
translation.

• Fully Connected Layers: Towards the end of the network, fully connected
layers use the features extracted by convolutional and pooling layers to perform
various tasks. These could include classification, regression, or other types of
predictions based on the training dataset

• Normalization Layers: Batch normalization [4] is often applied to the in-
puts of each layer to stabilize learning. This technique often reduces the
number of training epochs required to effectively train deep networks.

2.1.2 Methods in Object Detection
Object detection methods can be broadly categorized into two types: one-stage
detectors and two-stage detectors. Two-stage detectors, such as R-CNN [5] and
its variants (Fast R-CNN [6], Faster R-CNN [7]), first generate potential regions
of interest and then classifies and locates objects within those regions. In contrast,
one-stage detectors like YOLO (You Only Look Once) [8] and SSD (Single Shot
MultiDetector) [9] bypass the region proposal stage and predict object classes and
locations directly from the image features in a single pass, achieving faster processing
times at the cost of some accuracy.

2.1.3 YOLO Series
Among the one-stage methods, the YOLO series has been particularly influential
due to its speed and efficiency [8]. YOLO frames object detection as a single re-
gression problem, straight from image pixels to bounding box coordinates and class
probabilities. Over several iterations, YOLO has been refined and improved. The
latest in this series, YOLOv9, continues this evolution.

The backbone of YOLO is typically a CNN that is responsible for extracting features
from the image. Initially, YOLO used a custom lightweight CNN, which was com-
posed of convolutional layers with batch normalization and Leaky ReLU activations.
As the versions progressed, the architecture incorporated more advanced backbones
for improved feature extraction.

From YOLOv3 [10] onward, the architecture started using feature pyramids to im-
prove detection at multiple scales. This approach uses pathways to bring high-
resolution features from earlier layers together with lower-resolution features from

6


2. Background

deeper layers, enhancing the network’s ability to detect objects at different sizes.

At the time of writing, YOLOv9 [11] is the latest in the YOLO series. It introduces
several novel concepts in the domain of deep learning and object detection, em-
phasizing the use of PGI (Programmable Gradient Information) and a new network
architecture called GELAN (Generalized Efficient Layer Aggregation Network). The
innovations presented in YOLOv9 aim to address the critical issues of information
loss during deep network computations and to enhance the learning efficacy of object
detection systems.

The concept of PGI is central to YOLOv9. It provides a mechanism to control gra-
dient propagation through the network, ensuring that gradients are more represen-
tative of the underlying data features needed for accurate predictions. In traditional
models, as data passes through multiple layers, some information is invariably lost.
This loss can degrade the quality of the gradients, leading to poor training outcomes.
PGI addresses this by ensuring that the gradient flow maintains high fidelity to the
original data features necessary for accurate predictions.

PGI is designed to work across various model sizes, from lightweight to larger models.
It does this by adjusting the gradient flow based on the model’s architecture, making
it a versatile approach in different deployment scenarios.

GELAN is a new lightweight network architecture that leverages gradient path plan-
ning to optimize the network’s learning capacity and efficiency. This architecture is
designed to minimize parameter count while maximizing inference speed and accu-
racy, making it suitable for real-time applications.

YOLOv9 represents a significant advancement in the field of object detection, ad-
dressing several limitations of previous YOLO versions and other contemporary de-
tection systems. Its innovative approach to managing gradient information and net-
work architecture allows for highly efficient and accurate object detection, suitable
for various applications including autonomous driving, surveillance, and real-time
video analysis.

2.2 Online vs offline tracking
Algorithms for MOT can be categorized into two main types; online and offline
algorithms. These two types differ in the type of data they use and the situations
they can be applied [12].

Online tracking refers to trackers that only rely on current and previous information
when processing a frame in a video. In other words, online trackers do not make
use of information from future frames and thus can run in real time. This real-time
capability makes online tracking algorithms highly valuable in applications where
immediate feedback is crucial, such as in live surveillance, autonomous driving, and
interactive augmented reality systems. Since they run in real time, these trackers
are often designed to be lightweight and efficient, balancing the need for speed with
the requirement for precision.

7


2. Background

Offline tracking algorithms, in stark contrast to online algorithms, utilize informa-
tion from future frames when estimating object trajectories. This approach allows
for more accurate and coherent tracking because it takes into account the entire
sequence of frames rather than making decisions based solely on the current and
past frames. This foresight helps in resolving ambiguities that might arise in real-
time tracking scenarios and can lead to significantly improved tracking performance,
especially in complex and dynamic environments. The trade off for this is that these
algorithms are often more computationally expensive, and less viable to run live.

2.3 Kalman Filter
The Kalman Filter [13] is a powerful predictive and corrective mathematical al-
gorithm used for estimating the state of linear dynamic systems from a series of
incomplete and noisy measurements. It is extensively used in the field of automated
control systems, robotics, aerospace, and, what is relevant for our purpose, in MOT
algorithms.

The Kalman Filter operates in a two-step process, the predict step and the update
step:

• Predict: The filter predicts the current state and covariance estimates to
advance the state ahead temporally. This prediction is based on the system’s
previous state and the known control inputs affecting the system.

• Update: When new measurements are available, the filter updates its esti-
mates to correct the state vector and covariance estimates. This correction is
based on the discrepancy (residual) between the predicted estimates and the
actual measurements obtained at that step.

2.3.1 Kalman Filter Equations
This section outlines the key equations of the Kalman Filter along with a detailed
description of the matrices and variables involved:

1. State Prediction:
x̂k|k−1 = Fkx̂k−1|k−1 + Bkuk,

where:

• x̂k|k−1 is the predicted state estimate,

• Fk is the state transition model applied to the previous state x̂k−1|k−1,

• Bk is the control-input model that scales the control vector uk,

• uk is the control vector, representing external inputs to the system.

2. Covariance Prediction:

Pk|k−1 = FkPk−1|k−1F
T
k + Qk,

where:

8


2. Background

• Pk|k−1 is the predicted covariance estimate of the state,

• Pk−1|k−1 is the covariance of the previous state estimate,

• F T
k is the transpose of the state transition model,

• Qk is the process noise covariance matrix.

3. Kalman Gain Calculation:

Kk = Pk|k−1H
T
k (HkPk|k−1H

T
k + Rk)−1,

where:

• Kk is the Kalman Gain, a factor that reflects the importance of the
incoming measurement compared to the current estimate,

• Hk is the measurement model that relates the state estimate to the mea-
sured data,

• Rk is the measurement noise covariance, representing the expected vari-
ability in the measurements.

4. State Update:
x̂k|k = x̂k|k−1 + Kk(zk − Hkx̂k|k−1),

where:

• x̂k|k is the updated estimate of the state after incorporating the measure-
ment,

• zk is the actual measurement observed at time k.

5. Covariance Update:

Pk|k = (I − KkHk)Pk|k−1,

where:

• Pk|k is the updated state covariance after the measurement has been
incorporated,

• I is the identity matrix, used in updating the covariance matrix.

2.3.2 Application in Tracking
In many tracking applications, the Kalman Filter estimates the state of moving
objects considering their position and velocity. In online tracking algorithms such as
BoT-SORT, a kalman filter is used to predict the next bounding box of a previously
tracked object [14]. This track is then associated with a detected bounding box
through metrics such as IoU (Intersection over Union, see section 2.6.1). More
detail about BoT-SORT is found in section 2.7.4.

9


2. Background

2.4 Appearance feature embedding
MOT involves detecting multiple objects within a video sequence and maintaining
their identities over time. One of the critical components in MOT is the effective
representation and matching of an object’s visual features, which improves accurate
data association between different frames. This section discusses the concepts of
appearance feature embedding and matching, which are essential for robust MOT
systems.

Feature embedding refers to the process of transforming raw data, such as images
or object detections, into a lower-dimensional space where the characteristics of
the data are preserved [15]. This transformation is typically achieved using deep
neural networks, which learn to encode relevant information into a compact vector
representation. The embedded features should ideally capture the unique attributes
of objects, such as shape, color, and texture, while being invariant to changes in
scale, illumination, and viewpoint.

In the context of MOT, appearance embeddings are often extracted from detected
bounding boxes using convolutional neural networks. This is then used as a metric
for associating the same object in different frames based on appearance. The use
of appearance embeddings for MOT can be attributed to Deep-SORT [16]. More
information about Deep-SORT and other tracking algorithms can be found in section
2.7.

2.5 Foreground Extraction
Foreground extraction is a key process in computer vision that isolates dynamic
elements, such as players and the ball, from static backgrounds in video sequences.
A prevalent approach to achieving accurate foreground extraction is through the use
of a median frame as the background model.

The median frame is derived by computing the median value of each pixel across
a set of frames. This technique is effective because it minimizes the impact of
transient elements that might appear in only a few frames, thus creating a stable
representation of the static background.

The subsequent extraction of the foreground involves comparing the current frame
of the video to this median background, highlighting areas with significant pixel
intensity changes. These differences typically indicate motion, and thus, the presence
of foreground elements.

This foreground extraction method is effective primarily under specific conditions. It
assumes a stationary camera and a relatively static background, as the median frame
relies on consistent background content. However, this method has limitations and
may not perform well in dynamic environments with moving shadows, fluctuating
lighting, or changing background elements. Camera movement or shake can also
disrupt the alignment between the median background frame and current frames,
resulting in inaccurate foreground masks and detections.

10


2. Background

Prerequisites:

• Stationary camera

• Relatively static background

• Minimal background changes (e.g., lighting, shadows)

2.6 Important metrics in multiple object tracking
This section outlines a comprehensive set of metrics used in data analysis, computer
vision, and specifically in machine learning model evaluation for object detection.
These metrics assess various aspects of model performance, from general accuracy
to specific losses associated with object detection tasks.

• Accuracy: Measures the overall correctness of the model, calculated as the
ratio of correctly predicted observations to the total observations

Accuracy = Number of Correct Predictions
Total Number of Predictions

.

• Precision: Reflects the ratio of correctly predicted positive observations to
the total predicted positives

Precision = True Positives
True Positives + False Positives

.

• Recall (Sensitivity): Measures the ratio of correctly predicted positive ob-
servations to all actual positives

Recall = True Positives
True Positives + False Negatives

.

• F1 Score: The harmonic mean of precision and recall

F1 = 2 × Precision × Recall
Precision + Recall

.

• Specificity: Measures the ratio of correctly predicted negative observations
to all actual negatives

Specificity = True Negatives
True Negatives + False Positives

.

• MAE (Mean Absolute Error) and MSE (Mean Squared Error): Mea-
sures the accuracy of predictions, with MAE averaging absolute differences,
and MSE squaring differences before averaging.

MAE = 1
n

n∑
i=1

|yi − ŷi|, (2.1)

MSE = 1
n

n∑
i=1

(yi − ŷi)2, (2.2)

11


2. Background

where n is the number of observations, yi represents the actual values of the
observations and ŷi represents the predicted values.

• Box Loss: Reflects the model’s accuracy in predicting bounding boxes around
detected objects. They measure the discrepancy between the predicted and
actual bounding box coordinates. Lower values indicate higher precision in
locating objects.

• Classification Loss: Gauges the models capability to correctly classify ob-
jects within the detected bounding boxes. These metrics evaluate the error
in classification predictions, with lower scores denoting better classification
performance.

• DFL (Distribution Focal Loss): Is a loss function used in object detection
that focuses on improving the accuracy of bounding box predictions by mod-
eling the distribution of box offsets instead of direct value prediction. This
method enhances the model’s precision in localizing objects.

• WBCE (Weighted Binary Cross Entropy) Loss: Is a loss function used
to handle imbalanced classes in binary classification tasks. It assigns different
weights to the positive and negative classes, which helps in emphasizing the
minority class and improving the overall model performance. The WBCE can
be calculated as:

WBCE = − 1
N

N∑
i=1

[w+yi log(ŷi) + w−(1 − yi) log(1 − ŷi)] ,

where N is the total number of observations, yi is the actual class label, ŷi is
the predicted probability, w+ is the weight for the positive class, and w− is
the weight for the negative class.

• Mean Average Precision (mAP): Is a metric used to evaluate the accu-
racy of object detection models in computer vision. It calculates the average
precision across all classes, considering the precision and recall of the models
predictions, providing a comprehensive measure of the model’s overall effec-
tiveness in detecting various objects.

– mAP50 calculates the mean average precision at 50% Intersection over
Union (IoU), providing a benchmark for model accuracy in object local-
ization at a moderate threshold.

– mAP50-95 extends this evaluation across a range of IoU thresholds from
50% to 95%, averaging the precision values at these levels to offer a
comprehensive view of the models performance across varying degrees of
localization strictness.

• MOTA (Multiple Object Tracking Accuracy): Measures the accuracy
of a tracking system by accounting for false positives, false negatives, and
identity switches. It is defined as:

MOTA = 1 −
∑

t(FPt + FNt + IDSWt)∑
t GTt

,

12

pale
Highlight


2. Background

where FPt is the number of false positives, FNt is the number of false negatives,
IDSWt is the number of identity switches, and GTt is the number of ground
truth objects at time t. [17]

• MOTP (Multiple Object Tracking Precision): Measures the precision
of the tracking system by evaluating the average dissimilarity between the
predicted and the ground truth object positions. It is defined as:

MOTP =
∑N

i

∑T
t dit∑

t ct

,

where dit is the distance between the detected and ground truth position of
object i at time t, N is the number of objects, T is the total number of frames,
and ct is the number of matches found in frame t [17].

These metrics collectively form a robust framework for evaluating and monitoring
model performance, crucial for the effective deployment and optimization of com-
puter vision technologies in various applications.

Another metric worth mentioning, while not used by us in this thesis, is the one
proposed by [18]. It is a metric of similarity between sets of trajectories, designed
for the evaluation of MOT algorithms. The proposed metric is designed to be a true
metric space function — it is non-negative, satisfies the identity of indiscernibles,
is symmetric, and obeys the triangle inequality. The metric is formulated through
solving a multi-dimensional assignment problem, which is non-polynomial and not
feasible when the sets of trajectories are large (contain a large number of trajectories).
The paper details a lower bound for the metric that can be computed in polynomial
time using linear programming.

2.6.1 Intersection over Union
IoU (Intersection over Union) is a crucial metric in the field of computer vision,
particularly for tasks such as object detection and segmentation. IoU measures the
overlap between two bounding boxes, Such as one predicted by a tracking algorithm
and one detected by a detection model. Mathematically, IoU is defined as the area
of overlap between the two bounding boxes divided by the area of their union:

IoU = Area of Overlap
Area of Union

.

The IoU value ranges from 0 to 1, where 0 indicates no overlap and 1 signifies perfect
alignment. A higher IoU score generally indicates a more accurate prediction.

In the context of MOT, IoU plays a pivotal role as a metric for data association as
well as in assessing the performance of tracking algorithms. MOT involves detecting
objects in video frames and maintaining their identities across frames, which requires
robust and accurate tracking mechanisms. IoU is used in several key aspects of MOT:

• Tracking through Association: In each frame of a video, objects are de-
tected, and IoU is used to associate these detections and decide if they belong

13


2. Background

to the same track. For online tracking the IoU is usually calculated between
a prediction based on an existing track, and a new detection that may or may
not belong to this track. A high overlap means the detection is more likely
to belong to the tracked object. For offline tracking, IoU is usually computed
between detections in different frames.

• Performance Evaluation: IoU serves as a critical metric for evaluating
the performance of MOT systems. Metrics such as MOTA and MOTP [17]
often incorporate IoU to measure how well the tracking algorithm maintains
accurate and consistent trajectories of objects over time. High IoU scores
indicate better tracking performance, as they reflect more accurate localization
of objects.

Overall, IoU is a fundamental metric in multiple object tracking, enabling precise ob-
ject localization, effective identity association, and comprehensive performance eval-
uation. Its application ensures that tracking algorithms can reliably follow multiple
objects across complex and dynamic video scenes, thereby enhancing the robustness
and accuracy of MOT systems.

2.7 Some Online Trackers
Online trackers process each frame sequentially and make immediate decisions about
object identities without access to future frames. This section introduces and dis-
cusses several relevant prominent online tracking algorithms that have made signifi-
cant contributions the field.

2.7.1 SORT
SORT (Simple Online and Realtime Tracking) [19] is a minimalist approach to online
tracking which primarily uses the Hungarian algorithm [20] for frame-to-frame asso-
ciation. This method employs a simple linear constant velocity model that predicts
the motion of each tracked object between frames and matches these predictions
with incoming detections. Detections are assigned to existing tracks based on the
IoU metric, which measures the overlap between predicted bounding boxes and new
detections. SORT is designed to be lightweight and efficient, but it struggles with
long-term occlusions and identity switches, primarily because it does not incorporate
appearance information to aid in re-identification of objects.

2.7.2 Deep-SORT
Deep-SORT [16] enhances the SORT algorithm by integrating appearance infor-
mation, significantly improving tracking performance, especially in scenarios where
objects are occluded or interact closely. It extends the standard SORT by employ-
ing a deep learning-based feature extractor, which generates a unique appearance
descriptor for each detected object. These descriptors are then used to compute a
Mahalanobis distance metric for matching detections to existing tracks, in combina-
tion with the IoU strategy used in SORT.

14


2. Background

The deep-SORT algorithm also incorporates a more sophisticated motion model
using a Kalman filter, which helps in predicting object positions more accurately
from one frame to the next. The addition of an age parameter and a hit counter for
each track allows deep-SORT to handle temporary occlusions by retaining identity
even when objects disappear for short periods. These enhancements make deep-
SORT more robust against the challenges of tracking objects in crowded scenes
where frequent interactions and occlusions occur.

2.7.3 ByteTrack
ByteTrack [21] stands out as a significant advancement in online tracking algorithms.
It improves on previous tracking algorithms primarily by utilizing low-confidence
detections in its tracking, unlike previous trackers that simply discarded them. It at
first tries to find a suitable detection for each tracklet among detections that reach a
certain confidence threshold. If no match is found, it will then try to find a suitable
detection for that tracklet among the detections that did not reach the threshold.

2.7.4 BoT-SORT
BoT-SORT [14] further builds upon the foundation of ByteTrack, enhancing its capa-
bilities through several significant innovations. These improvements are designed to
tackle some of the inherent limitations in the previous “SORT-like” algorithms and
to integrate the advantages of both motion and appearance information effectively.
Here are the primary ways in which BoT-SORT advances over ByteTrack:

• Camera Motion Compensation (CMC): Integrates camera motion esti-
mation and correction to align predicted bounding boxes with actual detec-
tions accurately. This reduces identity switches and false negatives caused by
camera movement.

• Enhanced Kalman Filter State Vector: Modifies the Kalman filter’s state
vector to estimate the width and height of bounding boxes directly, rather
than the aspect ratio. This leads to more accurate bounding box predictions,
especially in scenes with varying perspectives.

• Robust Association with IoU and Re-ID Fusion: Employs a fusion of
IoU and Re-Identification (Re-ID) features to associate detections to tracks
robustly. This method uses deep appearance features and a dual-thresholding
strategy to ensure only reliable associations are maintained.

• Superior Performance: Demonstrates significant improvements over exist-
ing methods in key MOT metrics (MOTA [17], IDFL [22], and HOTA [23]) on
the MOT17 [24] and MOT20 [25] datasets, indicating better tracking accuracy
and identity maintenance.

These enhancements collectively contribute to BoT-SORT’s high performance, mak-
ing it suitable for real-time tracking in dynamic and crowded environments.

15


2. Background

2.8 Minimum cost flow
One effective method for offline tracking is to formulate the problem as an optimal
path graph problem. One such method of formulating object tracking as a graph
problem is known as MOT based on MCF (Minimum Cost Flow) [26].

The core of this method lies in establishing a comprehensive graph where each node
represents a detection and each edge signifies a possible transition between detections
across consecutive frames. The task of the algorithm is to find the optimal paths
through this graph, where each path corresponds to the trajectory of an object
over time. This allows the algorithm to leverage global information and make more
informed decisions about the continuity and identity of each object.

The cost assigned to each edge reflects the likelihood or confidence that the transition
between these detections is valid, encompassing criteria such as spatial proximity,
motion consistency, and appearance similarity. To construct trajectories, the al-
gorithm traces the path of the flow from its start to the end across the network.
By following the minimum-cost flow, we can link detections from frame to frame,
thereby mapping out the trajectory of each object throughout the sequence.

2.8.1 Graph Construction and Edge Definition
In MOT using the MCF algorithm, a directed graph is typically constructed where
each detection from the video frames is represented as a node. Special nodes, namely
the source and sink, are introduced to manage the initiation and termination of
object tracks:

• Source Node: Acts as the entry point for potential tracks.

• Sink Node: Serves as the exit point where tracks are terminated.

Figure 2.1 shows a graph structure with five detections across three frames. It
illustrates connections that allow track initiation or termination at any time step as
well as “skip” connections that allow tracks to travel across multiple frames at once,
in case of missing detections.

2.8.2 Edge Cost and Capacity
Edges within the graph are specifically defined by their costs and capacities to
optimize tracking:

• Edge Cost: The cost of travelling between two nodes is calculated as a
weighted sum of costs for certain characteristics. These are usually character-
istics such as spatial proximity, appearance similarity, number of frames being
skipped, and other relevant metrics, influencing the path selection by making
certain transitions more favorable. There are also separate costs for creating
and terminating tracks.

• Edge Capacity: When minimum cost flow is used for MOT, capacities are
typically set to 1 between detection nodes to ensure that each detection is

16


2. Background

Source

12

40

76

Sink

t=0

t=1

t=2

Figure 2.1: Graph simplification where each node represents a detection. Nodes
that are aligned horizontally are detections from the same frame. Edges represent
possible transitions between detections, i.e. possible movement of objects. This
graph demonstrates the flexibility in track management, including frame skipping.

used exactly once per track. The capacity on the edge connecting the source
to the sink reflects the expected maximum number of concurrent tracks.

2.8.3 Pathfinding via Flow Optimization

MCF is an optimization problem, this section lists the objective function and the
constraints for the problem. For this section we use the following definitions.

• V denotes the set of all nodes in the graph, where in our case each node
represents an object detection.

• s and t denote the source and sink nodes, respectively.

• E denotes the set of all edges in the graph, where each edge connects a pair
of nodes, indicating potential transitions from one detection to another or
from/to special nodes like the source and sink.

• fij represents the flow on the edge from node i to node j, indicating how many
tracks (or parts of tracks) pass through this edge.

• cij represents the cost associated with the edge from node i to node j.

17


2. Background

Objective Function to minimize

We wish to minimize the sum of the costs of all flows through the graph, expressed
by

∑
(i,j)∈E

fijcij. (2.3)

Flow Conservation

The total flow entering any node i (except for the source s and sink t) must equal
the total flow exiting node i. This is crucial for maintaining consistent object tracks
without discontinuities. The equation for this constraint is

∑
j∈V

fij −
∑
j∈V

fji = 0 for all i ∈ V \ {s, t}. (2.4)

Non-Negative Flow

The flow on any edge cannot be negative, reflecting the physical interpretation of
flow representing the number of tracks. This is expressed as

fij ≥ 0 for all (i, j) ∈ E. (2.5)

Flow Equality at Source and Sink

The total flow entering the network from the source must equal the total flow exiting
the network to the sink. This guarantees that all initiated tracks are properly
terminated. The equation for this constraint is

∑
i∈V \{s}

fsi =
∑

i∈V \{t}
fit = Total number of tracks. (2.6)

In summary, this optimization formulation encapsulates the logic needed to connect
detections across frames in a way that respects the flow of objects through space
and time, aiming to construct the most probable tracks at the minimal possible
cost. This setup is fundamental for algorithms designed to handle complex object
tracking scenarios, such as in surveillance or sports analytics, where accuracy and
computational efficiency are crucial.

2.9 TrackNet
TrackNet [27] is a tracking model designed for tracking the ball in sports video,
specifically tennis and badminton. Rather than relying on object detection to predict
bounding boxes which are then connected in tracks, the TrackNet architecture is
designed to directly predict the trajectory of the ball from the video data. The

18


2. Background

TrackNet model uses a neural network to which the input is a sequence of subsequent
RGB frames, and the output is a sequence of heatmaps of the same resolution as
the input, representing probable ball locations.

The model follows a U-net architecture [28], with the input first passing through a
fully convolutional network extracting features from the sequence, followed by a de-
convolutional network constructing the heatmaps. The U-net model is characterized
by its encoder-decoder structure, with skip connections that help preserve spatial
hierarchies between input and output. This design enables TrackNet to capture
both high-level features and fine details from the video frames, which is essential for
accurate localization of fast-moving objects across frames.

The integration of convolutional and deconvolutional layers allows the model to
process and reconstruct image data effectively. The convolutional layers act as
feature extractors, analyzing the frames to detect patterns and features that signify
the presence and movement of the ball. Conversely, the deconvolutional layers work
to project these lower-dimensional feature representations back into the original
image space, creating a detailed and accurate heatmap.

The fact that the model considers a sequence of images allows it to intrinsically
identify movement. This ability to identify movement is crucial in dynamic sports
like tennis and badminton, as well as football, where the ball is often difficult to
make out in single frames due to its small size, lack of distinguishable features, and
motion blur.

TrackNet V3 [29], designed solely for badminton, builds upon TrackNet by concate-
nating the match background to the input of the tracknet model, allowing it to
more easily identify non-static objects. The background is calculated as the mean
RGB values for each pixel across the match/video sequence. This introduces the
requirement that the camera must be fixed, as changing the perspective makes it im-
possible to calculate and use a background frame. The architecture of the TrackNet
V3 model is visualised in figure 2.2.

They also improve the performance by adding a separate rectification model that
learns to correct the trajectory of the tracknet model in cases where the original
trajectory estimation fails, such as when the shuttlecock is occluded. This separate
model is named InpaintNet, and is inspired by inpainting techniques used by gen-
erative image models. The architecture of the InpaintNet model follows a U-Net
architecture, similarly to the main tracking model. The input of the model is the
trajectory produced by the main model, while the output is the rectified trajectory.

In the training phase, the primary tracking module, TrackNet, employs weighted
binary cross entropy loss to optimize performance. Meanwhile, the rectification
module, InPaintNet, utilizes mean squared error loss to refine its accuracy.

2.10 Datasets
SoccerNet [30] has multiple datasets publicly available for different prediction tasks
related to football. Their data for tracking is made from broadcast footage (non-

19


2. Background

Figure 2.2: Model architecture of TrackNet V3. The numbers at each level corre-
spond to the number of channels for each layer on that level.

static camera), and contains bounding box annotations with object id. The anno-
tations are of eight classes, these being ‘player team left’, ‘player team right’, ‘goal-
keeper team left’, ‘goalkeeper team right’, ‘main referee’, ‘side referee’, ‘staff’, and
‘ball’. The dataset comprises one hundred 30-second clips, each shot at 25 frames per
second, resulting in a total of 75,000 frames across various matches. Some images
from this dataset are shown in figure 2.3.

We also annotated our own dataset for the ball, specifically. The video data for this
set was provided by the football club IFK Göteborg. Unlike the SoccerNet data, our
dataset uses a fixed camera perspective for each sequence. Rather than bounding
boxes, the annotations in our set are simple point annotations for the center of the
ball in each frame. There is also a binary indicator showing whether or not the ball
is visible or hidden due to e.g. occlusion or being out of frame.

Our training set consists of video sequences from 7 separate professional matches at
1080p, 25fps. In total the training set includes 19,483 frames, roughly 13 minutes.
We also have a test set consisting of eight shorter sequences, totaling 4,050 frames
or 2.7 minutes. Examples of frames from our dataset can be seen in figure 2.4

20


2. Background

Figure 2.3: Example frames from the SoccerNet dataset with visible bounding boxes
and object ID.

Figure 2.4: Example frames from our own annotated dataset. The point annotations
are indicated with a small red circle.

21


2. Background

22


3
Methodology

This chapter delineates the various methodologies explored to facilitate effective
player and ball tracking in a football context. Different approaches have been tested
when it comes to player and ball tracking, to facilitate the unique difficulties of the
two tasks.

3.1 Detection
We fine-tuned YOLOv9c, a variant of YOLOv9 with 25.5 million parameters, on the
SoccerNet [30] dataset using the Ultralytics python package [31]. Of the dataset’s
100 matches, 10 were reserved for the validation dataset and another 10 were used
for the test dataset.

Two separate models were trained: one for detecting the ball and another for detect-
ing the players.

The models were trained for 30 epochs using default parameters provided by the
Ultralytics library [31]. Multiple data augmentations were applied to the dataset,
such as change in rotation, scale, translation, and color values. Figure 3.1a shows a
batch used for training the ball detection model after data augmentation, and figure
3.1b shows the same for training the player detection model.

3.2 Foreground Extraction
Implementing the foreground extraction method involved several specific steps aligned
with the theoretical principles discussed earlier:

• Median Frame Calculation: We selected multiple frames at random inter-
vals from the video sequence to form a diverse sample. The median of these
frames was computed pixel-by-pixel to establish a robust background model.

• Grayscale Conversion: Both the median background frame and each cur-
rent frame were converted to grayscale to simplify the process of detecting
changes due to motion.

• Frame Differencing: The absolute difference between the grayscale me-
dian background and the current frames was calculated, identifying significant
changes in pixel intensity.

23


3. Methodology

(a) A batch for ball detection (b) A batch for player detection

Figure 3.1: Batches used for training the detection model, after data augmentation
is applied.

• Threshold Application: A predefined threshold of 15 was applied to the dif-
ference image. Pixels exceeding this threshold were turned white, while those
below it were turned black, with 0 representing black and 255 representing
white.

• Morphological Refinement: To improve the quality of the foreground mask,
morphological operations such as dilation and erosion [32] were applied, which
helped eliminate noise and close gaps in detected foreground objects.

This practical application of the median frame approach enables the effective iso-
lation and tracking of foreground objects, crucial for tasks such as ball tracking in
sports videos.

3.2.1 Removing Stationary Detections

To further refine the detections, stationary objects are removed using a calculated
foreground percentage within each bounding box. The process involves the following
steps:

24


3. Methodology

Algorithm 1: Video Detection Filtering Algorithm
Result: Filtered detection results updated in CSV file

1 Initialization:
2 Read video and corresponding detections from a CSV file;
3 foreach frame in video do
4 Generate Foreground Mask:
5 Generate a foreground mask using the median background frame and the

current frame;
6 Foreground Analysis and Detection Filtering:
7 foreach detection in frame do
8 Extract and Analyze:
9 Extract bounding box coordinates and dimensions;

10 Crop the foreground mask to the area defined by the bounding box;
11 Count the number of white pixels (foreground) in the cropped area;
12 Calculate the percentage of foreground pixels over total pixels in the

bounding box;
13 Apply Threshold Filter:
14 if foreground percentage < threshold then
15 Remove the detection from the CSV file;
16 end
17 end
18 end

By applying this method, stationary objects that do not contribute to the dynamic
aspects of the game are effectively filtered out, improving the accuracy and relevance
of the detection results.

3.3 Player tracking pipeline
When it comes to player tracking we have tried two detection-based algorithms,
BoT-SORT and MCF, both using detections from our finetuned YOLOv9 model.
The pipelines for player tracking are depicted in figure 3.2.

BoT-SORT, which is incorporated from the BoxMOT library [33], uses several pa-
rameters that we have configured. These include:

• New Track Threshold: Set to 0.7,

• High Threshold: Set to 0.4,

• Low Threshold: Set to 0.1,

• Track Life: Defined as 4 seconds.

Detections with a confidence level below 0.1 are discarded, while those above 0.4 are
considered reliable. Detections with confidence levels between these thresholds are
classified as uncertain, and are only used to refine the tracks.

25


3. Methodology

Video Stream

Player Detection
with YOLO v9

BoT-SORT

Tracking Players using
Minimum Cost Flow

Identify and Pe-
nalize ID-Switches

Export to CSV No Switches Found

Found ID Switches

Figure 3.2: Player Tracking Pipeline: This diagram illustrates the sequential pro-
cessing steps involved in tracking players in a sports video. The pipeline includes
player detection, tracking using either BoT-SORT or the minimum cost flow algo-
rithm.

The MCF algorithm used is implemented from scratch by us, the details of this
implementation can be found in section 3.4.

3.4 Minimum Cost Flow implementation details

The code for this implementation is written in Python. The graph is constucted
using the library Networkx [34] and the MCF problem is solved using OR-Tools [35].
This algorithm optimizes the flow through the graph to find the most feasible set of
trajectories for the tracked objects. This section explains the implementation details
of this method.

26


3. Methodology

3.4.1 Graph Construction
A directed graph is constructed using the player and ball detections from the input
data. The nodes and edges in the graph represent the states and transitions of the
tracked objects across frames.

• Nodes:

– Each detection is represented by two nodes: an ’in’ node and an ’out’
node.

– A ’source’ node and a ’sink’ node are added to the graph to handle the
start and end of trajectories.

• Edges:

– An edge from ‘source’ to each detection’s ‘in’ node with a birth cost.

– An edge from each detection’s ‘out’ node to ‘sink’ with a death cost.

– An edge connecting each detection’s ‘in’ node to its corresponding ‘out’
node with a confidence cost.

– Edges connecting ’out’ nodes of detections in one frame to ‘in’ nodes of
detections in subsequent frames, representing possible transitions. The
weight of these edges is calculated based on a custom cost function consid-
ering factors like IoU, center distance, and frame differences, the details
of this cost function is presented in section 3.4.2.

– An edge from source to sink, representing not creating a new track.

Edges are limited to only being used by one object track. By representing each
detection using two nodes and a connecting edge, the nodes also become limited to
a single track due to the edge. This setup ensures that each detection contributes
to only one track. This dual-node representation is not the only possible imple-
mentation with these characteristics but is chosen specifically because it provides
a straightforward method to enforce this exclusivity by restricting the capacity of
the edges. Figure 3.3 shows a graph example involving two frames, each with three
detections.

3.4.2 Cost Calculation
To reflect the likelihood and feasibility of transitions between detections, we assign
the following costs to edges. Note that negative cost are advantageous.

• Node Confidence Cost: This cost is based on the confidence level of each
detection and is applied to edges connecting the ’in’ and ’out’ nodes of the
same detection. The cost ranges from -1 to 1; it is -1 when the confidence is
100% and 1 when the confidence is 0%. This corresponds to the cost of using
a certain detection regardless of the path taken to and from it.

• Transition Costs: These costs are calculated for edges representing transi-
tions between detections across different frames. The custom cost function

27


3. Methodology

Source

In-2In-1 In-3

Out-1 Out-2 Out-3

In-4 In-5 In-6

Out-4 Out-5 Out-6

Sink

t=0

t=1

Figure 3.3: Simple illustration of a minimum cost flow graph from two frames, each
with three detections numbered 1 to 6. Each detection is split into “In” and “Out”
nodes.

includes:

– IoU: Measures the overlap between bounding boxes of detections. The
cost is -1 when the boxes perfectly overlap and 0 when there is no overlap.

– Center Distance: Calculates the distance between the centers of bound-
ing boxes. Edges are not connected if the distance exceeds a certain
threshold, expressed in pixels. This threshold is adjusted depending on
the resolution.

– Frame Difference: Penalizes transitions with a cost of +1 for each
frame skipped, accounting for the temporal gap between detections.

• Birth and Death Costs: These are predefined costs set for initiating and
terminating a trajectory. Applied to edges connecting the ’source’ to ’in’ nodes
and ’out’ nodes to ’sink’, respectively. Both costs are set to 10. Given the
frame rate of 25 fps, a good track lasting 25 frames would typically accumulate
a positive score of around -30. Thus, setting the combined birth and death
costs to 20 implies that a track should be followed if it can be reliably tracked
for approximately one second.

28


3. Methodology

• No track cost: The cost of going straight from source node to sink node is
zero. This ensures the algorithm will only create tracks with a net negative
total cost.

These cost assignments are strategically designed to initiate tracking when a track
can be reliably followed for about one second, and to remove unlikely edges, thereby
optimizing the solver’s efficiency.

3.4.3 Improbable Connections for the Players
Post-processing involves identifying and penalizing improbable connections to im-
prove player tracking accuracy. This includes detecting ID switches using color
changes and acceleration patterns.

• Color Consistency: Detect possible ID switches if a tracked player’s color
changes between frames.

• Acceleration Analysis: Identify ID switches if a player exhibits rapid accel-
eration or deceleration, suggesting unrealistic speed changes.

A new graph is constructed with added penalties for these improbable connections:

• Graph Update: Identify potential ID switches and add penalties to the
corresponding edges.

• Recalculate MCF: Solve the updated graph to find the optimized set of
trajectories with penalized improbable connections.

This additional step ensures that only realistic transitions are considered, minimizing
ID-switches.

3.4.4 Ball Tracking Using Minimum Cost Flow
Tracking the ball in a football game involves specific strategies to address its small
size, rapid movement, and potential occlusions. Effective ball tracking comprises
several crucial steps:

• Background Removal: Implement a foreground mask to eliminate back-
ground detections, reducing false positives by focusing on moving objects.

• Bounding Box Adjustment: Enlarge the bounding box of the ball to main-
tain effective IoU across frames for consistent detection.

• Single Ball Tracking: Focus the tracking mechanism to track only one ball
at a time to prevent tracking errors.

• Interpolate ball position in missing frames: When skip connections are
utilized, use interpolation to fill in the gaps. This process is explained in detail
in section 3.4.5.

The complete pipeline for tracking the ball using minimum cost flow is shown in
figure 3.4.

29


3. Methodology

Video Stream

Ball Detection
with YOLO v9

Removing Stationary
Detections Using Back-

ground Extraction

Tracking the Ball Using
Minimum Cost Flow

Interpolating the Path
for Missing Frames

Export to csv

Figure 3.4: Ball Tracking Pipeline: This figure illustrates the sequence of process-
ing steps involved in the ball tracking system, starting from the video stream and
progressing through detection, filtering stationary detections, tracking, and interpo-
lating the path for missing frames.

30


3. Methodology

3.4.5 Quadratic Interpolation for Ball Path Tracking
When the ball becomes occluded or is blurred in video sequences, quadratic inter-
polation is employed to estimate its trajectory during these obscured periods. This
method is particularly effective for capturing the parabolic motion typical of projec-
tiles, such as a ball in flight.

Quadratic interpolation uses the ball’s positions at three specific points in time,
typically before and after the occlusion, denoted as t1, t2, and t3, with corresponding
positions (x1, y1), (x2, y2), and (x3, y3). The objective is to fit a parabolic equation
of the form:

y = ax2 + bx + c

, where a, b, and c are coefficients determined by the interpolation.

To find these coefficients, we solve the following system of equations:

ax2
1 + bx1 + c = y1,

ax2
2 + bx2 + c = y2,

ax2
3 + bx3 + c = y3.

Solving this system yields the coefficients a, b, and c, defining the quadratic curve
that best fits the trajectory of the ball among the known points. This curve then al-
lows us to estimate the balls position at any intermediate time t, enabling continuous
tracking even when direct visual observation is compromised.

This method not only smooths the trajectory in the temporal domain but also
significantly enhances the accuracy of the tracking system by bridging gaps where
direct data are unavailable. The application of quadratic interpolation is justified by
the natural parabolic paths followed by objects under the influence of gravity and
initial velocities, making it a robust choice for predicting the motion of footballs in
mid-air.

3.5 Tracking the ball using TrackNet
We used the TrackNet V3 [29] implementation found on GitHub [36] and trained it
from scratch using the football video data annotated by us.

3.5.1 Adaption to football
As TrackNet was originally made with badminton in mind, we found it appropriate
to make some adjustments to better suit our task.

The original TrackNet V3 paper uses a sequence length of eight frames for heatmap
prediction and 16 frames for trajectory rectification [29]. This means they look at
eight subsequent frames when making a prediction, and 16 subsequent predictions
when rectifying one prediction. When it comes to football, the ball is often occluded
for longer sequences and follow more complex movement patterns compared to bad-
minton. To adhere to this, we decided to use a greater sequence length, granting

31


3. Methodology

the model more context for each prediction. More details about the chosen sequence
length can be found in section 3.5.2 about training the model.

Ahead of training the rectification model, a mask must be generated specifying which
frames rectification should be performed on. This is because the model should only
rectify trajectories inside the frame, not outside. Since TrackNet V3 is based on
badminton, where the shuttlecock often reaches great height, there is a check to
make sure segments where the shuttlecock is above the frame do not get rectified.
For our purpose, in football, the ball can often exit the frame on either side or the
bottom. We therefore thought it appropriate to extend this check to cover all edges
of the frame.

3.5.2 Training the main tracking model
We tried various combinations of hyperparameters when training the main TrackNet
model. Out of the many parameters available, we thought the ones most appropriate
to change were sequence length and learning rate.

We trained the model with sequence length 10 and 20, as longer sequences led to
issues with computational resources. The learning rate varied between 1e-3 and
1e-4.

3.5.3 Training the rectification model
The InpaintNet model was trained for 300 epochs, and a learning rate scheduler was
utilized which reduced the learning rate by a factor of 10 every 100 epochs. As the
rectification model is significantly more lightweight than the tracking module we
were not as limited by computational resources. Thus, we were able to use a longer
sequence length, train for more epochs, and perform more total training runs.

Various combinations of sequence length and initial learning rate were used in dif-
ferent training runs. For sequence length we tried using 20, 30, and 40. We tried a
multitude of initial learning rates, between 1e-2 and 1e-5.

32


4
Results and discussion

This chapter presents the results obtained and discusses their implications. It in-
cludes the outcomes from training YOLOv9 on both the ball and player datasets.
Additionally, it features comparisons between the BoT-SORT and MCF algorithms
for player tracking, and between TrackNetV3 and MCF for ball tracking. Each
comparison aims to highlight the distinct advantages and limitations of the different
methods.

4.1 Detection
The result of training YOLOv9c on the soccernet player dataset are shown in figure
4.1. For our purposes, the classification loss is not relevant since the model only
trained on one class. Comparing the box loss, DFL, and different metrics of training
and validation, it becomes apparent that the model started overfitting after epoch
10. which is why the model at epoch 10 was chosen as the best model.

The result of training YOLOv9c on the soccernet ball dataset are shown in figures
4.1. Once again, comparing the box loss and DFL of training and validation, the
loss stops reducing after epoch 25. The metrics mAP50, precision, and recall also
start showing a decrease, which is the reason that model is chosen to be the optimal
one.

4.1.1 Player detection evaluation
The performance of the model is measured using F1-score, precision, recall and
precision-recall. The graphs for them are shown in figure 4.3.

The performance curves presented in figure 4.3 illustrate a robust model capable of
maintaining high levels of precision, recall, and F1-score across varying thresholds
of confidence. Notably, the F1 curve (figure 4.3a) demonstrates exceptional stability
at high scores until a sharp decline near the maximum confidence, indicating that
the model performs with high accuracy and balance between precision and recall for
most confidence levels.

The precision curve (figure 4.3b) highlights the model’s effectiveness in ensuring that
positive predictions are accurate, particularly as the confidence threshold increases,
which is crucial for applications where false positives are particularly undesirable.

33


4. Results and discussion

Figure 4.1: Training and validation loss per epoch for the player detection model

Figure 4.2: Training and validation loss per epoch for the ball detection model

The precision-recall curve (figure 4.3c) showcases an almost ideal scenario where
the model achieves high precision without a significant sacrifice in recall, up until
very high recall levels. This behavior is critical for scenarios where it is essential to
capture as many positive instances as possible without introducing many errors.

Lastly, the recall curve (figure 4.3d) indicates that the model can identify almost
all positives at lower confidence thresholds, which is beneficial in applications where
missing a positive detection has serious consequences.

Overall, these performance metrics suggest that the model not only learns well but
also generalizes effectively to new data, balancing the trade-offs between detecting
positives and minimizing false detections efficiently. This capability makes it highly
suitable for deployment in critical scenarios where reliable precision and comprehen-
sive recall are required.

34


4. Results and discussion

(a) F1 curve (b) Precision curve (c) Precision-Recall
curve

(d) Recall curve

Figure 4.3: Various metrics for the player model training and evaluation.

(a) F1 curve (b) Precision curve (c) Precision-Recall
curve

(d) Recall curve

Figure 4.4: Various metrics for the ball model training and evaluation

4.1.2 Ball detection evaluation
The performance of the model is measured similarly as the player model performace
and is shown in figure 4.4.

It immediately becomes clear that the model struggles quite a bit with detecting
the ball. The f1 curve peaks at 0.7 at confidence 0.3, which is very low compared
to the player model. The sharp decline after a certain confidence threshold suggests
a deterioration in precision and recall, likely due to the model failing to correctly
generalize the detection of the ball under varying conditions.

The precision-recall curve illustrates a struggle to balance recall and precision. The
observed continuous decrease in precision as recall increases indicates that the model,
in its effort to detect every possible ball, increasingly misidentifies other objects as
the ball, leading to a rise in false positives. This aspect is particularly critical in our
scenario because while missing a ball can be somewhat mitigated by interpolating
between preceding and subsequent frames, this approach becomes ineffective when
numerous false detections are present.

Overall, while the model exhibits some capability in detecting the ball, its perfor-
mance in minimizing false positives is suboptimal. This is particularly concerning
in environments where high precision is critical to avoid false alarms, such as in
automated sports tracking systems where each detection needs to be highly reliable.
The challenge lies in enhancing the model’s precision without substantially sacri-

35


4. Results and discussion

Figure 4.5: YOLOv9 detection boxes with confidence scores, identifying players’
white shorts, and text in the background, as balls.

ficing recall, ensuring that it can reliably differentiate the ball from other similar
objects under diverse conditions.

Visually inspecting the YOLO detections in video it is clear that it suffers from
many false positive detections, often incorrectly detecting e.g. players’ white shorts
such as in figure 4.5. While the confidences in these detections are low, this is often
true for the actual ball as well. In fact, it also struggles to detect the actual ball at
all, especially when far from the camera.

Foreground Extraction

The foreground extraction process effectively isolated dynamic elements from the
static background, as demonstrated in Figure 4.6. This algorithm was particularly
adept at identifying and removing stationary objects, such as balls in the background,
thereby ensuring that only the moving ball remained visible in the foreground. This
selective masking underscores the efficiency of our foreground extraction technique.

Figure 4.7b presents a zoomed-in view of the extraction, focusing on the area where
the ball’s background is masked. This detailed view clearly depicts the ball isolated
against the background, showcasing the precision of our method.

Additionally, Figure 4.8 illustrates the detections outputted by the YOLO model.
Although detections include background elements, these can be effectively eliminated
using our foreground extraction technique, as demonstrated in the earlier figures.
This capability highlights the practical utility of our approach in real-world scenarios
where background noise must be minimized to focus on dynamic elements.

4.2 Results of tracking the players
To assess the performance of BoT-SORT and our MCF algorithm in player tracking,
we employed two key metrics: MOTA and MOTP [17]. Additionally, we recorded
the number of false positives and the number of identity switches during tracking.

36


4. Results and discussion

Figure 4.6: Foreground extraction. The leftmost image displays the original frame,
the middle image illustrates the background after extraction, and the rightmost
image shows the foreground with the moving objects clearly isolated.

(a) Original frame with the ball (b) Foreground with masked background

Figure 4.7: Zoomed-in view of the ball with background masked.

This evaluation utilized the SoccerNet dataset as the ground truth, applying both
the BoT-SORT and the MCF algorithm to the player tracking data. For these
evaluations, we used the ‘motmetrics’ library in Python [37].

The central box in each plot represents the interquartile range, covering the middle
50% of scores from the 25th to the 75th percentile. The median within this box
marks the typical performance value, effectively dividing the dataset in half. In our
case, there were no outliers, meaning the whiskers also cover 25% of the data each,
extending from the quartiles.

The total number of unique IDs recorded was identical for both the MCF and BoT-
SORT. Both methods only generated new IDs when necessary, such as when new
players entered the scene.

Figure 4.9a illustrates the MOTA scores for both tracking algorithms. The results
indicate very similar performance levels, with minor variations across different test
scenarios. This metric reflects the overall tracking accuracy, where both algorithms
excel by maintaining high accuracy in complex dynamic scenes. MCF holds a slight
edge over BoT-SORT, especially when it comes to the worst performing samples.

37


4. Results and discussion

Figure 4.8: Detections outputted by the YOLO model. Note the presence of back-
ground elements, which are candidates for removal via foreground extraction.

Similarly, as shown in Figure 4.9b, the MOTP scores, which assess the precision of
object positioning by the trackers, are also comparable. This suggests that both
algorithms are effective in precisely locating the objects across frames, though here
BoT-SORT performs slightly better.

Notably, as depicted in Figure 4.9c, the MCF implementation tends to miss fewer
detections compared to BoT-SORT. This means that the BoT-SORT algorithm
tends to ignore certain detections while minimun cost flow makes sure to include
them.

However, despite winning over BoT-SORT in other metrics, the MCF method ex-
hibits a higher number of identity switches as shown in Figure 4.9d. This indicates
a potential area for improvement in the MCF algorithm, particularly in its ability
to maintain consistent identity tracking over extended sequences.

Overall, the analysis suggests that both the MCF and BoT-SORT algorithms per-
form comparably well in terms of tracking accuracy and precision. However, while
the MCF method is more robust in maintaining detection continuity, it tends to have
more identity switches compared to BoT-SORT. This disparity in identity preserva-
tion might be attributed to the use of the Kalman filter in the BoT-SORT algorithm,
which effectively aids in maintaining object identity across frames. The absence of
a similar mechanism in the MCF algorithm could be a contributing factor to its
higher rate of identity switches.

4.2.1 Challenges in Automated ID Switch Correction
In our MCF algorithm, correcting an ID switch is feasible once the switch is de-
tected. This assertion is supported by a manual test, during which an identified ID
switch was successfully removed from the tracking data. To streamline this process,
we explored automating the detection of ID switches by analyzing factors such as

38


4. Results and discussion

Minimum cost flow BoTSORT

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96
M

ot
a

sc
or

e

MOTA

(a) MOTA scores, higher is better.

Minimum cost flow BoTSORT
0.22

0.24

0.26

0.28

0.3

0.32

M
ot

p
sc

or
e

MOTP

(b) MOTP scores, lower is better.

Minimum cost flow BoTSORT

600

800

1,000

1,200

1,400

1,600

1,800

2,000

N
um

be
r

of
m

iss
es

Number of missed detections

(c) Number of missed detections, lower is
better.

Minimum cost flow BoTSORT
0

10

20

30

40

50

60

N
um

be
r

of
sw

itc
he

s

Number of id-switches

(d) Number of identity switches, lower is
better.

Figure 4.9: Comparison of MCF and BoT-SORT algorithms in various matrics,
showing similar performance.

team colors and abrupt changes in acceleration. However, due to time constraints,
these solutions were not fully implemented and their effectiveness remains unproven.
Further development and testing are required to enhance the robustness of these
methods.

4.3 TrackNet training loss
Figure 4.10 shows the loss during training for the tracking model TrackNet V3 using
different sets parameters. The graph shows that all of the models struggle with
generalisation, as the validation loss does not come down along with the training
loss. This suggests our own annotated data may not have been enough, and that
more data is necessary to improve the model’s ability to generalise. As we did not
have time to collect a significant amount of additional data, we decided to move
ahead with the model with sequence length 10 and learning rate 5e-4 (green curve)
as it seemed to have the lowest validation loss overall.

a subset of the combinations of sequence length and learning rate tried are presented
in figures 4.11 and 4.12. This subset was chosen to present as they were among the
best in terms of validation loss.

39


4. Results and discussion

Figure 4.10: Loss for training data and validation data during six separate training
runs using different sets of parameters. The x-axis is expressed in epochs.

(a) training and validation (b) only training loss (c) only validation loss

Figure 4.11: Loss for training data and validation data during three separate train-
ing runs, all using sequence length 30 but different initial learning rates. The initial
learning rates are 5e-5 (purple), 5e-4 (orange), and 1e-3 (blue). The x-axis is ex-
pressed in epochs.

To no surprise, both of these figures show the same issue as the tracking module,
namely a failure to generalise. A sequence length of 30 seems to have generalised
the best, though it shows the worst loss on the training data, where sequence length
40 is the best. Due to its superior generalisation we decided to settle with the model
of sequence length 30.

4.4 Evaluation of ball methods on test set
For ball tracking, it was essential to employ a common evaluation strategy that was
suitable for both methods under comparison. Given that we have not treated the
ball tracking as a MOT problem, since we are only predicting one trajectory, we
treated this as a classification problem. If the predicted position is within a certain
distance threshold of the ground truth label, we consider it a correct prediction. We
opted not to use distance based metrics such as mean squared error as it does not
matter where the model incorrectly predicts the ball, all incorrect predictions are
equally incorrect. More comprehensively, this is how we decided to categorise each
prediction:

• True positive: The model has predicted a location and it is within threshold.

40


4. Results and discussion

(a) training and validation (b) only training loss (c) only validation loss

Figure 4.12: Loss for validation data during three separate training runs using initial
learning rate of 5e-4. The sequence lengths are 20 (green), 30 (orange), and 40
(pink).

Method True positives Wrong location False negatives
YOLO+MCF 2488 66 1086

TrackNet 1940 1126 568
Method True negatives False positives

YOLO+MCF 396 20
TrackNet 162 254

Table 4.1: Total number of predictions of each category across all eight test set
sequences, using a distance threshold of 20 pixels.

• Wrong location: The model has predicted a location but it is not within
threshold.

• False negative: The model has not predicted a location while there is a ball
in the frame.

• True negative: The model has not predicted a location and there is no visible
ball.

• False positive: The model has predicted a location while there is no ball.

Considering the average width of the ball is approximately 40 pixels in diameter, we
established a threshold of 20 pixels. The evaluation was conducted using the test
set we annotated ourselves, more detail about it can be found in section 2.10.

Table 4.1 shows the total number of predictions of each cateogory for both of our
tracking methods. By these metrics YOLO+MCF outperforms TrackNet in all but
false negatives. These numbers indicate that YOLO+MCF is better at finding the
correct ball location, as TrackNet has a very high number of predictions in the wrong
location.

To achieve a better overview of the performances of the complete MCF ball tracking
pipeline versus TrackNet V3, we calculated accuracy, precision, recall, F1-score, and
specificity. Since we have recorded wrong locations in addition to false negatives
and false positives, we decided to slightly alter the formulas for precision and recall
by incorporating the number of wrong location predictions in the denominators:

• Precision: Reflects the ratio of correctly predicted positive observations to

41


4. Results and discussion

Method Accuracy Precision Recall
YOLO+MCF 72.9% ± 13.3% 96.8% ± 4.9% 70.2% ± 15.4%

TrackNet 54.3% ± 20.3% 57.4% ± 15.7% 55.3% ± 21.1%
Method F1 Specificity

YOLO+MCF 80.7% ± 11.1% 95.2% ± 6.3%
TrackNet 55.7% ± 19.0% 36.8% ± 41.0%

Table 4.2: Performance Metrics Comparison across eight annotated sequences using
a distance threshold of 20 pixels. The shown values represent the mean, with the ±
symbol indicating the standard deviation.

the total predicted positives.

Precision = True Positives
True Positives + False Positives + Wrong locations

.

• Recall (Sensitivity): Measures the ratio of correctly predicted positive ob-
servations to all actual positives.

Recall = True Positives
True Positives + False Negatives + Wrong locations

.

These modifications were done as a wrong location prediction can be considered both
an actual positive and a predicted positive, despite being an incorrect prediction.
The rest of the metrics are calculated as usual, the way they are presented in section
2.6. Note that F1 is affected by these changes as it is based on precision and recall.
Specificity was not changed as the metric is only meant to concern actual negative
samples. The results are presented in Table 4.2.

From these quantitative results we can see that our MCF implementation generally
performs better, outperforming TrackNet in every metric. Neither of these perfor-
mances are satisfactory to create a convincing recreation of sequences.

While the performance of YOLO+MCF at this stage is more promising, it is impor-
tant to note that the TrackNet model has a significantly higher standard deviation
on values across the eight separate sequences. This is another sign pointing to poor
generalisation, leading to different performance on different data. We hypothesise
that the TrackNet architecture has a high potential to yield better results if it is
trained on more data.

Since TrackNet has a high number of wrong location predictions, we wanted to see
if these were close to the correct location. To this end, we ran the evaluation again
but with increased distance thresholds of 100 and 150 pixels for correct predictions.
The result of this change can be seen in table 4.3. We see here that a significant
number of wrong location predictions are somewhat close to the correct position,
though the majority are farther away still.

42


4. Results and discussion

Method 20 pixels 100 pixels 150 pixels
YOLO+MCF 66 55 46

TrackNet 1126 868 777

Table 4.3: Total number of predictions with the wrong location across all eight test
set sequences, using different distance thresholds of 20, 100, and 150 pixels.

4.4.1 Visual analysis of TrackNet result
By observing the predicted trajectories overlaid on top of the original video, we
can see that the model accurately predicts the trajectory while the ball is traveling
freely, but struggles if the ball is still, moving slowly, or being dribbled. Examples
of these scenarios can be seen in figure 4.13. We take this to mean that the model
did not struggle to learn to find the ball in movement with a grassy background.
However, the model’s movement-based design predictably struggles to detect still
balls. Furthermore, situations where the ball is being dribbled are too visually
distinct from one another for the model to successfully generalise across different
players seen from different angles.

Figure 4.13c shows the model missing the ball when it gets very close to the camera
and appears larger than usual. This may indicate that the model has not seen
enough large balls during training, yet another sign of poor generalisation.

Figure 4.13d on the other hand shows the model defaulting its prediction to a spot
close to the middle of the screen while the ball is not in motion. This is something
that we have observed happening occasionally, but it is unclear to us why this
happens. Presumably, it has learnt that this location is generally more likely to
contain the ball compared to other parts of the frame. Other times, the model
successfully does not predict a ball when the ball goes off-screen. This may be
fixable with more training data, or it may require tuning parameters such as the
weight in the weighted binary cross entropy loss, to make the model more likely to
choose not to predict a ball.

43


4. Results and discussion

(a) A correct prediction. (b) Model missing dribbled
ball.)

(c) Model missing large ball
(close to camera).

(d) Model missing motionless
ball.

Figure 4.13: Positions predicted by our TrackNet model overlaid on top of the
original video. Videos are from the test set annotated by us.

44


5
Conclusion

In this thesis, we explored the complex task of tracking players and the ball in
football videos, leveraging advanced computer vision techniques and deep learning
models. Our primary goal was achieving accurate and reliable object detection and
tracking in the dynamic and visually complex environment of football games. By
doing so, we aimed to contribute to the broader project of reconstructing football
sequences in a virtual reality environment, thus improving the tools available for
sports analysis and training.

Throughout our research, we evaluated several methods and models, including YOLOv9
for object detection and the BoT-SORT and MCF algorithms for tracking. Addi-
tionally, we explored the potential of the TrackNet V3 model for ball trajectory
prediction.

5.1 Summary of Findings
The MCF algorithm demonstrated robust performance in tracking players, maintain-
ing high precision and effectively minimizing false positives. However, it exhibited a
higher frequency of identity switches compared to the BoT-SORT algorithm, which
indicates a need for improved mechanisms to maintain player identity across frames.

Tracking the ball proved more challenging due to its small size, rapid movement, and
frequent occlusions. The combination of YOLOv9 for detection and MCF for track-
ing outperformed TrackNet V3 in terms of accuracy and overall stability. However,
TrackNet V3 showed potential for improvement when it comes to generalisation, sug-
gesting it could outperform detection-based tracking given more parameter tuning
and training data.

5.2 Contributions to the Field
Our work contributes to the understanding and development of object tracking
in sports, particularly in football. By identifying specific challenges such as ID-
switching and detecting the ball, we have provided insights that could inform the
development of more robust tracking systems. These systems are crucial for appli-
cations in sports analytics, coaching, and even in creating immersive virtual reality
experiences.

45


5. Conclusion

46


6
Future work

This chapter details areas of potential improvement and additional ideas we would
have liked to explore, given more time.

6.1 Correcting ID-switches for player Minimum
Cost Flow

The MCF algorithm has shown promising results in player tracking but struggles
with ID switches. Future work should focus on developing and integrating methods
to more effectively identify and correct identity switches. Once an identity switch is
detected, the MCF algorithm can be rerun with penalisations applied to the edges
that led to the ID switch. These strategies are aimed at refining the identity tracking
capabilities of the MCF method, thereby enhancing its applicability and reliability
for real-world player tracking scenarios.

To further enhance its accuracy and robustness, we propose the integration of ad-
ditional information sources into the algorithm. These enhancements could signifi-
cantly improve the performance of the tracking system by providing more discrimi-
native features and reducing the incidence of errors.

6.1.1 Jersey Colors and Deep Features Learned by Neural
Networks

Incorporating jersey colors into the tracking algorithm could provide a straightfor-
ward and effective method for distinguishing between players, especially in sports
with clearly defined team uniforms. Additionally, leveraging deep features learned
by neural networks could allow the algorithm to capture more complex patterns and
player-specific characteristics that are not immediately apparent to simpler models.

6.1.2 Jersey Numbers
Jersey numbers offer a direct method of identifying players. By training an optical
character recognition (OCR) component within our system, we can continuously
update player IDs based on the visibility of their jersey numbers. This would likely
reduce the number of ID switches, particularly in high-quality footage where num-
bers are clearly visible.

47


6. Future work

6.1.3 Pose Estimation
Integrating pose estimation can provide contextual clues that enhance the tracking
accuracy. By understanding the orientation and actions of players, the system can
better predict their movements and interactions, reducing the likelihood of misidenti-
fications, especially in crowded scenes. Pose estimation has been explored separately
in the football in VR project [1].

6.1.4 Speed Analysis
Analyzing the speed and direction of movement can also be instrumental in improv-
ing tracking. Sudden changes in speed or atypical movement patterns can indicate
potential errors in player identification or tracking, prompting a re-evaluation of the
tracking decision at that instant. Kalman Filters have potential to be useful in this
area.

6.2 Better tracking of ball
We have seen that YOLO+MCF is the better performer when tracking the ball,
though TrackNet has shown more signs of potential for improvement. It is our
opinion that both of these methods are worth exploring further.

Regarding the detection based MCF tracking, the bottleneck is in YOLO’s struggle
to detect the ball. Future work should go into finding ways to improve detection,
whether that’s by improving the performance of YOLO or by testing state-of-the-art
two-stage detectors such as Detectron2 [38].

For TrackNet, the biggest issue is that the validation loss does not follow the training
loss. Ef