Importance Sampling in
Deep Learning Object Detection

An empirical investigation into accelerating deep learning object detection
training by exploiting informative data points

Master’s thesis in Computer science and engineering

Alexander Huang, Johannes Kvernes

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2023


Master’s thesis 2023

Importance Sampling in
Deep Learning Object Detection

An empirical investigation into accelerating deep learning object
detection training by exploiting informative data points

Alexander Huang & Johannes Kvernes

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2023


Importance Sampling in Deep Learning Object Detection
An empirical investigation into accelerating deep learning object detection training
by exploiting informative data points
Alexander Huang, Johannes Kvernes

© Alexander Huang, Johannes Kvernes, 2023.

Acedmic supervisor: Yinan Yu, Department of Computer Science and Engineering
Industrial supervisors: Olle Månsson, Zenseact. Willem Verbeke, Zenseact
Examiner: Krasimir Angelov, Department of Computer Science and Engineering

Master’s Thesis 2023
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2023

iv


Importance Sampling in Deep Learning Object Detection
An empirical investigation into accelerating deep learning object detection training
by exploiting informative data points
Alexander Huang
Johannes Kvernes
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
In recent years, the field of deep learning object detection has witnessed a notable
surge in progress, largely fueled by the growth of available data. While the ever
growing amount of data has enabled complex object detection models to improve
generalization performance and to facilitate training phases that are less prone to
over-fitting, the time to reach desired performance with this remarkable amount of
data has been reported to be an emerging problem in practice. This is certainly the
case when each sample consist of high resolution image data that could burden the
capacity of loading data.

This thesis explores the possibilities of leveraging importance sampling as a means
to accelerate gradient-based training of an object detection model. This avenue
of research aims at biasing the sampling of training examples during training, in
the hope of exploiting informative samples and reduce the amount of computation
on uninformative, noisy and out-of-distribution samples. While previous art shows
compelling evidence of importance sampling in deep learning, it is consistently re-
ported in a context of less complex tasks. In this work, we propose techniques that
can be applied to single-stage anchor-free object detectors, in order to adopt impor-
tance sampling to accelerate training. Our methods do not require modifications
to the objective function, and allow for a biased sampling procedure that remains
consistent across runs and samples.

Our results suggest that an uncertainty-based heuristic outperforms loss-based heuris-
tics, and that object detection training can be subject to a remarkable speed-up in
terms of reaching the baseline’s performance in fewer iterations, where the base-
line samples the training examples uniformly without replacement. Furthermore,
the empiric observations reported in this work also indicate that an increased final
generalization performance can be achieved given an equal amount of training time
when compared to the baseline.

Keywords: Importance sampling, deep learning, object detection, accelerate train-
ing, sampling heuristic, gradient descent, loss, uncertainty, threshold-closeness.

v


Acknowledgements
We would like to show our greatest appreciation to Zenseact for the opportunity to
work with an interesting topic and a fantastic team, and for lending us the equipment
and computational resources that have made this project possible.

Additionally, we would like to thank our industrial supervisors, Willem Verbeke and
Olle Månsson, whose guidance and mentorship have been instrumental in shaping
this project. They have, with great expertise and sheer commitment to this thesis,
played a crucial role in resolving complex problems and equipped us with knowledge
that has not only enriched the project, but also ourselves as future practitioners.

Lastly, we would like to extend our gratitude to our academic supervisor, Yinan
Yu, who has been active throughout the project, making the academic formalities a
smooth process as well as providing us with valuable insights and lead-way into the
cutting-edge research through relevant literature.

Alexander Huang, Gothenburg, 2023-06-13

Johannes Kvernes, Gothenburg, 2023-06-13

vii


Contents

List of Acronyms xi

1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Training a Neural Network . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 9

2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1.1 Intersection over Union . . . . . . . . . . . . . . . . 11
2.2.1.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . 12
2.2.1.3 Average Precision . . . . . . . . . . . . . . . . . . . . 14
2.2.1.4 Mean Average Precision . . . . . . . . . . . . . . . . 15

2.2.2 Single-stage Object Detectors . . . . . . . . . . . . . . . . . . 15
2.2.3 Anchor-free Detectors . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Non-Maximum Suppression . . . . . . . . . . . . . . . . . . . 16
2.2.5 Fully Convolutional One-Stage Object Detection . . . . . . . . 18

2.2.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.5.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Importance Sampling in Deep Learning . . . . . . . . . . . . . 22

3 Related Work 27

4 Methods 31
4.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Backbone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Don’t-Care Tokens . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.5 Non-Maximum Suppression . . . . . . . . . . . . . . . . . . . 33

4.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

ix


Contents

4.2.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.3 Loss-based Importance Sampling . . . . . . . . . . . . . . . . 34
4.2.4 Object-wise Reduction . . . . . . . . . . . . . . . . . . . . . . 35
4.2.5 Error-based Importance Sampling . . . . . . . . . . . . . . . . 35

4.2.5.1 Uncertainty-based Importance Sampler - Decision Bound-
ary Closeness . . . . . . . . . . . . . . . . . . . . . . 36

4.2.6 Reducible Hold-out Loss . . . . . . . . . . . . . . . . . . . . . 36
4.2.7 Rejection-based sampling . . . . . . . . . . . . . . . . . . . . . 37

4.3 Regression-agnostic sampling . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Results 41
5.1 Loss and Uncertainty Samplers . . . . . . . . . . . . . . . . . . . . . 42
5.2 Rejection-based Sampling Strategy . . . . . . . . . . . . . . . . . . . 45
5.3 Reducible Hold-Out loss . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Regression-agnostic Importance Sampler . . . . . . . . . . . . . . . . 53

6 Result Analysis 59
6.1 Interpretation of the results . . . . . . . . . . . . . . . . . . . . . . . 59

6.1.1 A closer look at loss and uncertainty samplers . . . . . . . . . 59
6.1.2 RHO Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.3 Rejection-based Sampling . . . . . . . . . . . . . . . . . . . . 61
6.1.4 Ignoring Regression . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2 Consistency with Importance Sampling . . . . . . . . . . . . . . . . . 63
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.3.1 Compatibility with Bell and Whistles . . . . . . . . . . . . . . 66
6.3.2 The Background Class and False Positives . . . . . . . . . . . 66
6.3.3 Sacrificing Performance . . . . . . . . . . . . . . . . . . . . . . 67

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7 Future Work 69

Bibliography 71

A Appendix 1 I

B Appendix 2 V

C Appendix 3 XVII

x


List of Acronyms

AD Autonomous Driving.

ADAS Advanced Driver Assistance Systems.

AP Average Precision.

CNN Convolutional Neural Network.

CPU Central Processing Unit.

DL Deep Learning.

DLA Deep Layer Aggregation.

FCN Fully Convolutional Network.

FCOS Fully Convolutional One-Stage Object Detector.

FN False Negative.

FP False Positive.

FPN Feature Pyramid Network.

IoU Intersection over Union.

IS Importance Sampling.

LIS Loss Importance Sampler.

LIS-IR Loss Importance Sampler with Ignored Regression.

LIS-RB Loss Importance Sampler with Rejection-based sampling.

LIS-RHO Loss Importance Sampler with Reducible Hold Out framework.

mAP Mean Average Precision.

ML Machine Learning.

MLP Multilayer Perceptron.

xi


List of Acronyms

NMS Non-Maximum Suppression.

OD Object Detection.

RHO Reducible Hold-out.

RoI Region of interest.

SSD Single-Stage Detector.

TN True Negative.

TP True Positive.

UIS Uncertainty Importance Sampler.

UIS-IR Uncertainty Importance Sampler with Ignored Regression.

UIS-RB Uncertainty Importance Sampler with Rejection-based sampling.

ZOD Zenseact Open Dataset.

xii


1
Introduction

The remarkable increase in available data has enabled rapid development of deep
neural networks. Various highly complex deep neural network models have been
shown to outperform other methods in a plethora of tasks in fields such as computer
vision and natural language processing. However, big datasets and complex models
are generally mirrored by a substantial computational cost that entails a central
challenge when training these types of machine learning models, highlighting the
importance of improving the efficiency of gradient-based training algorithms.

One aspect of streamlining training is to supervise the model with samples from
training data with a frequency that is proportional to their respective importance
score. For context, it has rendered itself obvious to many practitioners that not
all samples are of equal difficulty for an Machine Learning (ML) model to solve.
During the training phase, deep learning models tend to converge rapidly in the
early iterations, whereas in the intermediate and later stages, a vast majority of
samples become significantly easier for the model to predict correctly, resulting in
minuscule gradient updates. Compellingly, a portion of samples could possibly be
sampled less frequently, while other more informative samples could be exploited
without compromising the performance of the final model. This seems to have
the potential to accelerating training. This phenomenon has been showcased in
numerous dissertations related to data pruning and core-set selection [1], [2], example
difficulty [3]–[5], curriculum learning [6], and various other paradigms including
Importance Sampling (IS) [7]–[12].

This thesis explores the application of various IS techniques as a means to accelerate
training of deep learning Object Detection (OD) models. By leveraging IS, training
can be concentrated on samples that tend to contribute more to the process of finding
a local minimum in the optimization problem at hand, supposedly leading to fewer
training iterations and/or better generalization performance in comparison to the
standard procedure of uniform sampling without replacement. As such, training
times would be less susceptible to an ever-growing amount of data and could, in
practice, free up computational resources for other problems or accelerate research
and development since the time garnered by lower training times could be allocated
to conducting more experiments in a less or equal amount of time.

1


1. Introduction

1.1 Purpose
The purpose of this thesis is to empirically investigate, develop and propose a repre-
sentative approach for applying importance sampling in the task of OD, an applica-
tion of importance sampling that has, as opposed to image classification and to our
best knowledge at the time of writing, yet to be studied thoroughly in the current
art. In our work, we anticipate to showcase examples of how importance sampling
can be leveraged to accelerate the training of an OD model in a computationally
constrained environment for a real-time application such as Autonomous Driving
(AD). The main research questions for this thesis are the following:

1. How do different types of importance sampling heuristics perform
in the task of deep learning object detection?

2. What are the effects of different sampling strategies when training
a deep learning object detector, and how do they perform?

3. Can importance sampling be used to accelerate learning in low-
density regions of a data distribution?

1.2 Collaboration
This project was conducted at Zenseact, a technology company centered around
developing AD and Advanced Driver Assistance Systems (ADAS) [13]. Zenseact
thrives on recent advances in the field of machine learning and leverages R&D op-
erations to provide state-of-the-art solutions for renowned vehicle manufacturers.
Under those circumstances and from the outset, practitioners deemed it necessary
to accelerate the training process of the in-house OD model, and as such, this project
was conducted in close collaboration with our supervisors at the company. Before
closing this paragraph, we want to highlight the rigorous work of which this thesis
was extended from [14]. Similarly to us, Johansson and Lindberg [14] completed
their project at Zenseact and investigated IS in the context of deep learning. The
authors even touched on its application in training an object detector. Their study
garnered a strong base of knowledge to proceed from.

2


2
Theory

ML is a subfield of artificial intelligence that has gained significant traction in recent
years as it enables models to learn from data and make predictions without being
explicitly programmed in a rule-based manner. The field has evolved rapidly, and it
encompasses a wide range of techniques, including supervised learning, unsupervised
learning, and reinforcement learning. The process typically involves training a model
on a dataset to identify patterns and relationships between data points. Once a
model is trained, it can run inference on new data.

Advancements in ML, particularly in Deep Learning (DL), have revolutionized the
field, enabling machines to achieve state-of-the-art performance on a wide range of
complex tasks such as semantic segmentation [15], image classification [16], and OD
[17]. All of these are fundamental tasks in computer vision and enable a vast space
of application in various domains. Semantic segmentation pertains to, given for
instance an input image, output a class prediction from a set of predefined classes
for each pixel. Similarly, image classification predicts one or multiple classes, but for
an image as a whole. While this is far from trivial, OD takes it further and resolves
spatial predictions for each object of interest on the image plane. As such, DL OD
generally relies on more complex architectures and techniques to reach sufficient
performance levels. DL OD models have shown great promise in a wide range of
applications [17]. For instance, autonomous cars use object detection to identify
pedestrians, vehicles, and other objects in their environment and make decisions
accordingly.

Despite rapid advancements in the field, OD and DL are still active areas of re-
search, and many challenges remain. One of them being computationally expensive
and time-consuming training phases. To address this, researchers have been propos-
ing various ways to accelerate training of DL models. Examples include adaptive
learning rate [18], momentum [19], normalization [20], weight decay [21], and numer-
ous other variance reduction techniques. While these are all fundamental aspects of
accelerating model training, prioritizing informative samples have shown to induce
a similar effect [6], [7], [10], [22]. The common trajectory of the performance curve
when training deep learning models generally starts with a high rate of convergence,
and any clean sample from the dataset may sufficiently contribute when updating
the parameters at this stage. As the model continues to learn, the rate of conver-
gence slows down, inferring that a good amount of samples might have been learned
and have become too easy for the model to predict correctly. In these instances,

3


2. Theory

the model is subject to negligible contributions. Intuitively, treating each entry in
the dataset equally could be sub-optimal and computing resources could instead be
concentrated on informative samples in an effort to accelerate training. In fact, this
is a problem well suited for a method titled Importance Sampling (IS) [23], [24].

In this section, we tread through some of the fundamental concepts in deep neural
networks in order to later motivate why IS can accelerate training of deep learning
models. We will also examine how DL OD works and visit some ideas that have
enabled high-performing models in a real-time setting, such as AD. Lastly, we will
touch on what importance sampling is, how it works, and its application in deep
learning training.

2.1 Multilayer Perceptron
Multilayer Perceptrons (MLPs) are a class of models within ML that are inspired
by the structure and function of biological neurons in the brain [25]. These types
of models consist of layers of interconnected processing units, also called neurons,
that transform an input and propagate the information through the layers until a
final output is obtained. This structure has enabled remarkable success in solving
highly complex non-linear problems. Before proceeding with the architecture, it is
important to understand the computations around a neuron which is visualized as
a graph in figure 2.1. The basic computation for a single neuron is expressed as
a = σ(∑n

i wixi) where wi is a weight scalar, xi is an element in an input vector
x and σ(x) is an activation function, which is a crucial part for inducing nonlinear
capabilities. Without non-linearity, the output would just be a linear transformation
of the input, which limits the network to model merely simple relationships between
them.

Figure 2.1: A simple computational graph of a neuron

The structure of an MLP is visualized in figure 2.2. It consists of an input layer
(the first aggregate of neurons), the output layer (the last aggregate of neurons, 1
in this case), hidden layers which correspond to all the layers between the input
and the output layer, and weights visualized as connections between neurons. If we

4


2. Theory

further vectorize the neuron computation from figure 2.1 to handle an aggregate of
neurons we get a = σ(W (l)x(l)), which corresponds to the vectorized activation from
layer l. It is also common but not mandatory to add a bias term b which is simply
added to the preactivation output i.e. a = σ(W (l)x(l) + b(l)). From this, we can
express the nested computations which capture the whole forward pass from input
to output a(l) = σ(W (l)a(l−1) + b(l)). Fundamentally, the current layer’s input is
the previous layer’s activation output.

Figure 2.2: The structure of an MLP.

2.1.1 Training a Neural Network

As with many other ML models, when training a neural network, we are trying to
optimize an objective function. Informally we want to tweak the parameters θ, i.e.
the weights and biases, to accurately map the input to the desired output. The
underlying algorithm for achieving this is called Gradient Descent [26]. It works by
iteratively stepping in the negative gradient’s direction and updating the parameters
for the function of interest. This process is visualized in figure 2.3.

5


2. Theory

Figure 2.3: A simple case of gradient descent where pi ∈ R2

The gradient ∇f is a vector-valued function whose components are the partial
derivatives of f i. e.

∇f(p) =


∂f(p)
∂p1...

∂f(p)
∂pn


A fundamental property of the gradient is that given a point p̄, it points in the
direction in which the function locally increases most, and so if we are minimizing,
it is in our interest to step in the negative direction where the function decreases
the most. The algorithm is outlined below.

Algorithm 1 General gradient descent
Initialize starting point p, function f , convergence criterion ϵ
while ||∇f(p)|| > ϵ do

Find search direction (negative gradient): s = −∇f(p)
Obtain step size with line search: α = arg minα f(p + αs)
Update state: p←− p + αs

end while

In the context of neural networks, p̄ is a point in space of possible parameters θ̄.
The objective function we want to minimize is often referred to as a loss function
(also known as a cost function). Given an MLP parameterized with θ the loss
function Lθ(x, y) quantifies how poor the prediction was according to a ground
truth y. Let J(θ) = Eπ[Lθ(x, y)], (x, y) ∼ π. Essentially we want to obtain the
optimal parameters that yield the lowest expected loss for our problem at hand:
θ∗ = arg minθ J(θ).

6


2. Theory

Before we can apply the concept of Gradient Descent, we must be able to compute
the gradient of the loss function. The gradient will essentially inform the model in
what direction the parameters W (l) and b(l) should be updated in order to minimize
the loss, which consequently improves the performance of the model. An algorithm
for computing the gradients is referred to as Backpropagation [27]. The idea of
Backpropagation is to propagate the loss back through the network, from the output
layer to the input layer, using the chain rule of calculus. Let us assume that we
run a forward pass and have computed the loss L. Backpropagation computes all
the partial derivatives of the loss with respect to each weight ( ∂L

∂W
(l)
ij

) and with

respect to each bias ( ∂L

∂b
(l)
j

) in order to obtain the components of the gradient. These
partial derivatives hold information about the change in loss w.r.t. to the change in
parameters.

Now consider an example where we have one output layer and two hidden layers i.e.
ŷ = σ(W (3)σ(W (2)σ(W (1)x))) and a(l) = σ(W (l)a(l−1)). Note that we have omitted
the bias term to simplify the example. Let all the activation functions denote the
sigmoid function σ(x) = 1

1+e−x , and let the non-parametrized loss function denote a
simple mean squared error function L(ŷ, y) = 1

2(y−ŷ)2, which can also be expressed
explicitly as a function of all the weights of each layer:

L(ŷ, y) = L(σ(W (3)a(2)), y)
= L(σ(W (3)σ(W (2)a(1))), y)
= L(σ(W (3)σ(W (2)σ(W (1)x))), y)

We have the following derivatives for the non-parameterized loss function and the
sigmoid activation function:

∂L

∂ŷ
= ŷ − y

∂σ

∂x
= σ(x)(1− σ(x))

Now to obtain the partial derivatives we work backward through the network to
calculate ∂L

∂W (3) then ∂L
∂W (2) and then ∂L

∂W (1) . The calculation for the gradient in the
third layer gives

∂L

∂W (3) = ∂L

∂ŷ

∂ŷ

∂W (3)

= (ŷ − y)σ(W (3)a(2))σ(1− (W (3)a(2)))a(2)

For calculating the gradient in the second layer, we have to use the chain rule twice:

7


2. Theory

∂L

∂W (2) = ∂L

∂ŷ

∂ŷ

∂a(2)
∂a(2)

∂W (2)

= (ŷ − y)︸ ︷︷ ︸
∂L
∂ŷ

σ(W (3)a(2))σ(1− (W (3)a(2)))W (3)︸ ︷︷ ︸
∂ŷ

∂a(2)

σ(W (2)a(1))(1− σ(W (2)a(1)))a(1)︸ ︷︷ ︸
∂a(2)

∂W (2)

Similarly, the gradient for the first layer can be calculated from the following expres-
sion

∂L

∂W (1) = ∂L

∂ŷ

∂ŷ

∂a(2)
∂a(2)

∂a(1)
∂a(1)

∂W (1)

.

Note that some factors appear in multiple calculations, and so computationally, it is
not necessary to re-calculate these expressions. Only the factors that are particular
to the layer must be evaluated while the factors that are common to previous layers
can easily be reused.

Since we now have a way of deriving the gradient, we can similarly to algorithm 1
outline gradient descent in the context of training a neural network:

Algorithm 2 Gradient descent for training a neural network
Input initial parameters θ, learning rate η, data D = {(xi, yi)}N

i=0
for t in 1, 2, 3, ..., T do

Randomly sample Dt ⊆ D where |Dt| = nb is the batch size
x, y ←− Dt

Compute output from forward pass: ŷ ←− Forward(x)
Compute loss: L←− L(ŷ, y)
Compute average gradient across the batch: ĝ ←− 1

nb

∑nb
i=0 ∇L

(i)
θ from

∇L
(i)
θ ←− Backward(L)

Update parameters θ ←− θ − ηĝ
end for

The following clarifications should be noted

• The above process is usually repeated for a couple of so-called epochs.

• The learning rate η is a hyperparameter influencing the magnitude of the
update step.

• Dt usually selects a subset of samples from D by sampling uniformly without
replacement.

As inferred by the last point, since gradient descent is an iterative algorithm, when
applying it to train a neural network, the process must define the quantity of train-
ing data to propagate at each iteration. In theory, it is possible to forward- and

8


2. Theory

backpropagate on the whole dataset each step i.e. nb = |D|. In fact, assuming that
the training data is a perfect representation of the real-world problem, this would
yield in the most optimal direction to step in. However, the computation and mem-
ory consumption required to perform this operation makes this infeasible in practice
for large-scale datasets. This approach is known as batch gradient descent. On the
other extreme, we define stochastic gradient descent. Instead, of propagating the
whole dataset each iteration, this approach selects only one training example per
iteration i.e. nb = 1. The number of iterations per time unit would be remarkably
better and the memory consumption would be kept at a minimum. However, since
the whole parameter space is influenced by solely one training example per itera-
tion, this approach suffers from highly fluctuating gradients and is very prone to
stepping in sub-optimal directions. As a compromise, mini-batch gradient descent
propagates multiple examples per step. The number of samples per batch is typically
determined by a hyperparameter which is set to accommodate hardware capabilities.
This approach is less prone to high variance than stochastic gradient descent, and
is computationally feasible, unlike batch gradient descent in most cases.

2.1.2 Convolutional Neural Networks
MLPs are generally not the preferred way to solve computer vision tasks. If the
input/output has to deal with large images, the model will generally suffer from
heavy parameterization as each layer is fully connected. Furthermore, since the input
and the weights have a static layout where each pixel will be mapped to one input
node, the model will not be spatially invariant which is could be a desirable property
of models in computer vision. Spatial invariance allows the model to capture features
regardless of their position in the input image [28].

Convolutional Neural Network (CNN) is a type of neural network that is commonly
applied in computer vision tasks. At the core of a CNN is the convolution which is
an operation that allows the network to extract useful features from images while
being spatially invariant [28]. It works by applying a set of learnable filters, one
small region at a time, and produces a new feature map as illustrated in figure 2.4.
Specifically, one convolution takes the dot product between the current region of the
image across channels and the filter to obtain a preactivation scalar. The value of the
preactivation denotes how well the convolved area matches the values/pattern in the
filter. The filters slide over the image, performing a Hadamard product followed by a
sum of the elementwise products at each position which in the end forms a complete
feature map. By nesting this operation over a stack of layers with multiple filters
per layer, CNNs are able to learn increasingly complex and hierarchical features
from the input. Much like in MLPs, the feature maps from each layer are passed
through an activation function to induce non-linearity. The hyperparameters for a
convolutional layer are primarily:

• kernel size - The size of the weight matrix that corresponds to a filter

• stride - Defines the step size of the filter as it slides over the input

• padding - An offset that defines how far out from the input’s border the filter

9


2. Theory

can go

• number of channels - Number of filters in the layer. This number also denotes
the number of feature maps the output will have.

Figure 2.4: A convolution operation 1. I is the input, K represents the filter and
I ∗K is the resulting feature map.

2.2 Object Detection

OD is a complex task in computer vision that pertains to identifying and localizing
objects of interest in an image. Not only does the algorithm have to be able to
classify which types of objects are depicted, but it also needs the ability to propose
assigned regions for each identified object on the image plane. These region proposals
are oftentimes represented by either 2- or 3-dimensional bounding boxes, and when
visualized, overlayed on the image as illustrated in figure 2.5. OD has seen numerous
applications in various fields such as robotics, healthcare, surveillance, and AD. It is
typically used as a critical subsystem to solve many different problems [17], [29]. An
example is AD where one or more cameras are mounted to the car and the car has to
be able to decipher its surrounding in order to act in a highly dynamic environment.

1Detection and Tracking of Pallets using a Laser Rangefinder and Machine Learning Techniques
- Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/An-
example-of-convolution-operation-in-2D-2_fig3_324165524 [accessed 20 Mar, 2023]

10


2. Theory

Figure 2.5: A training example from Zenseact Open Dataset (ZOD)2[30] where the
2d bounding box annotation has been overlayed onto the image.

Due to the task’s complex intrinsic nature, the state-of-the-art has mainly been
dominated by DL approaches in recent years [17]. These methods have rapidly
revolutionized the field using CNNs as a core component. For this section, we will
present some project-relevant concepts of DL OD such as Single-Stage Detector
(SSD), anchor-free detectors, and Non-Maximum Suppression (NMS), along with a
thorough walk-through of a specific well-established object detector which applies
all the aforementioned concepts. However, before we jump into any details, we will
describe the theory on how to evaluate object detectors to get a sense of what the
end goal is regardless of how the optimization problem is formulated.

2.2.1 Evaluation Metrics
Evaluation is a crucial part of ML. Rigorously defined and tested evaluation metrics
is the standard way of measuring the performance of ML models. There are sev-
eral evaluation metrics that are commonly used in OD, including Intersection over
Union (IoU), Precision and Recall, F1 score, Average Precision (AP), Mean Average
Precision (mAP), etc. This section intends to describe the metrics used in the thesis
in more detail.

2.2.1.1 Intersection over Union

The Intersection over Union (IoU) of two geometric shapes S1 and S2 measures to
what extent the two shapes overlap. This is used extensively in OD to determine
the overlap of bounding boxes, as a high overlap between predicted and ground
truth boxes indicates good performance for the model. The IoU ranges from 0 to 1,
where a value of 0 indicates no overlap between the bounding boxes, and a value of
1 indicates perfect overlap. For 2D OD, let S1 = b1 be a bounding box represented
by the coordinate tuple b1 = (x(1)

1 , y
(1)
1 , x

(1)
2 , y

(1)
2 ) where (x(1)

1 , y
(1)
1 ) is the top-left

2For this dataset, Zenseact AB has taken all reasonable measures to remove all personally
identifiable information, including faces and license plates. To the extent that you like to request
removal of specific images from the dataset, please contact privacy@zenseact.com.

11


2. Theory

coordinate and (x(1)
2 , y

(1)
2 ) the bottom-right coordinate, and let S2 = b2 follow a

similar definition. A visual representation of this can be seen in figure 2.6. The area
of the intersection between the bounding boxes can be calculated as follows

|b1 ∩ b2| =max(0, min(x(2)
2 , x

(2)
2 )−max(x(1)

1 , x
(2)
1 ))

·max(0, min(y(1)
2 , y

(2)
2 )−max(y(1)

1 , y
(2)
1 )) (2.1)

The area of the union is simply the summed area of the bounding boxes minus the
area of the intersection, i.e.

|b1 ∪ b2| =(x(1)
2 − x

(1)
1 ) · (y(1)

2 − y
(1)
1 )+

(x(2)
2 − x

(2)
1 ) · (y(2)

2 − y
(2)
1 )− |b1 ∩ b2| (2.2)

The IoU of the two bounding boxes is simply

IoU(b1, b2) = |b1 ∩ b2|
|b1 ∪ b2|

(2.3)

Figure 2.6: A visual representation of the components of the IoU between two 2D
bounding boxes.

2.2.1.2 Confusion Matrix

The confusion matrix is a set of valuable metrics that measure how many classifica-
tions were correct and how many were wrong. In the case of binary classification,
it is a 2-by-2 matrix where the rows and columns represent the ground truth and

12


2. Theory

predictions respectively, see figure 2.7. One row and one column represent positive
locations, and one of each represents negative locations. The result is a matrix where
we can easily read how many correct and incorrect predictions the model made of
each type, i.e. the true and false positives and negatives. They work as follows:

• True Positives (TPs): The number of samples that were supposed to be true,
and that the model predicted to be true.

• True Negatives (TNs): The number of samples that were supposed to be false,
and that the model predicted to be false.

• False Positives (FPs): The number of samples that were supposed to be false,
but that the model predicted to be true.

• False Negatives (FNs): The number of samples that were supposed to be true,
but that the model predicted to be false.

In the case of multi-class classification with C classes, the matrix is C-by-C. In this
case, each row and each column represent a class. The matrix value at row i and
column j then represents the number of times the model predicted class j when the
ground truth class was i.

In OD, TPs are counted every time there is a ground truth object and a predicted
object with the same class and an IoU above some threshold. FPs are counted when
there is a predicted object which does not overlap sufficiently with any ground truth
objects of the same class. FNs are counted when there is a ground truth object
which does not overlap sufficiently with any predicted object of the same class.

Figure 2.7: A binary confusion matrix

13


2. Theory

2.2.1.3 Average Precision

AP is the average precision over the recall-values [31]. Precision is the fraction of
correct positive classifications out of all the positive classifications:

P = TP

TP + FP

Recall is the fraction of correct positive classifications out of all the actual positive
cases:

R = TP

TP + FN

Preferably, we want both precision and recall to be high (close to 1), but there is
typically a trade-off between them for a given model; as the recall increases, the
precision tends to decrease. To compute AP, we first rank all the object predictions
by their confidence score in descending order. Then, going from top to bottom,
computing and storing the current precision and recall value for each time we visit
a new prediction yields a sequence of precision-recall pairs which we can use to
compute the final metric. If we plot this, we get the precision-recall curve, as in
figure 2.8. As can be seen in the figure, the precision is wiggling up and down in a
sort of zigzag pattern as the recall increases, which is due to the variations of correct
and incorrect predictions when iterating through the ranked detections [32].

Figure 2.8: An example precision-recall curve.

In order to account for the variations, and as a means to increase the robustness
of the metric, it is standard to, for each point i, define the interpolated precision-
value P interp

i to be the maximum precision-value of all recall-values at or above i:
P interp

i = max(Pi, Pi+1, ..., Pn). The AP is now the average interpolated precision per
recall value, which is equivalent to the area under the interpolated precision-recall
curve, see figure 2.9.

14


2. Theory

Figure 2.9: An example precision-recall curve with and without interpolated
precision-values.

Note that, in the literature, several versions of AP are used, based mainly on the
dataset that the model is being evaluated on. For example, the PASCAL-VOC
Challenge [32] used to compute the AP of each class using 11 equally spaced recall
values, but was changed in 2010 to use all values.

2.2.1.4 Mean Average Precision

mAP is the standard way of measuring the performance of an OD model. The AP
is computed separately for each class, and the mAP is the mean of the class-wise
APs. Since the expectation value of AP is invariant to the number of objects, mAP
is an unbiased estimation of the true mAP given an infinite dataset. Note that the
literature sometimes uses the terms AP and mAP interchangeably.

2.2.2 Single-stage Object Detectors
Overall, DL OD architectures can be divided into two categories: two-stage object
detectors and SSDs. As the name suggests, two-stage object detectors consist of two
stages i.e. region proposal followed by object classification. In the first stage, the
algorithm generates a set of candidate object locations and in the second stage, it
classifies each region proposal into one of the object categories or background. Two-
stage detectors are known for their high performance but are generally slow and
computationally inefficient as a result of the separate handling of tasks. In contrast,
SSDs deal with both sub-problems in a single forward pass, hence enabling efficiency,
speed, and simplicity. Nonetheless, this generally results in worse performance when
compared to their two-staged counterpart. While high performance is crucial in the
context of AD, two-stage detectors are commonly deemed obsolete as the algorithms
must, by requirement, be able to operate in real-time. With that being said, this
thesis adopts an SSD approach.

15


2. Theory

2.2.3 Anchor-free Detectors
In traditional DL OD, anchor boxes are used to generate object proposals and refine
object localization. They are in essence a set of predefined fixed-sized rectangles of
various scales and aspect ratios evenly distributed across the image. During training,
the model is trained to predict the offsets and scales between each relevant anchor
box and the corresponding ground truth bounding box for each object in an image.
This allows the model to better match the object’s bounding box in the image in
comparison to a naive approach where the predicted bounding box is limited to a
single aspect ratio and/or scale. However, anchor boxes are also limited in a sense
as they have to be predefined and adjusted specifically for the dataset and domain
at hand. If the predefined anchor boxes do not align well with the bounding boxes
in the dataset, the model might not be able to accurately localize the objects.

Anchor-free object detectors are a type of OD models that do not rely on predefined
templates in order to localize objects in an image. Instead, they typically predict
objects’ locations by directly predicting all the necessary bounding box parameters.
These are obtained from extracted features from the output of deep convolutional
layers. Anchor-free models tend to have a host of benefits as they are simple, flexible,
and agnostic to the distribution of bounding box scales and aspect ratios for any
given dataset.

Regarding anchor-free detectors, it is worth mentioning that they are commonly
presented in the context of dense-based object detectors. These object detectors
project their predictions onto a dense prediction map which is a high dimensional
array whose shape resembles the shape of the input image. As such, they enable
fine-grained localization and instance segmentation that offers rich and detailed
information about the input. In contrast to these types of object detectors are
models that view the problem as a set prediction problem. The output from a set-
based object detector for a given sample is simply a set of class and bounding box
tuples. Transformer models are notably good at this, since they work by taking a
set as input to output a set of features [33]. A well-known set-based framework is
DEtection TRansformer (DETR) where image features are extracted and divided
and spatially embedded into a set of object queries [34]. The object queries serve as
inputs to a transformer model which is followed by a prediction head to output a
set of predictions. While set-based object detectors can omit certain post-processing
techniques, the computations revolving around the transformer module can be costly
when there are a lot of object queries to be processed. Set-based object detectors are
also arguably less interpretable since they do not offer any explicit spatial references
in the predictions as opposed to a dense prediction map. For these reasons, this
thesis employs a dense-based object detector which is explained in greater detail in
4.1.

2.2.4 Non-Maximum Suppression
Dense-based OD models typically generate multiple candidate bounding boxes around
objects of interest, leading to the need to filter out some of these boxes in order to
keep just one bounding box per relevant object. NMS is a post-processing technique

16


2. Theory

used to filter redundant overlapping object region proposals. It does so by identifying
and removing proposals with an IoU greater than a predefined iou-threshold when
comparing to the proposal with the highest confidence score for the same class (see
figure 2.10). It is also common to filter out bounding boxes with a confidence score
below a predefined confidence threshold beforehand. The overarching procedure is
expressed in the following pseudo code:

Algorithm 3 Pseudo code of NMS
Input: B = {b1, ..., bN}, S = {s1, ..., sN}, thiou, thconf

B: collection of bounding boxes
S: collection of corresponding confidence scores
thiou: IoU threshold
thconfidence: confidence threshold

for i in 0, ..., N do
if si < thconf then

B ← B \ bi

S ← S \ si

end if
end for

B̂ ← ∅
while B ̸= ∅ do

m← argmax(S)
B̂ ← B̂ ∪ bm

B ← B \ bm

for j in 0, ..., |B| do
if iou(bm, bj) ≥ thiou then

B ← B \ bj

S ← S \ sj

end if
end for

end while
return B̂, S

17


2. Theory

Figure 2.10: An example of non-maximum suppression. The left image depicts the
candidate bounding box predictions before non-maximum suppression. The right
image depicts the processed image, where two bounding box predictions were filtered
out as they overshot the iou-threshold with another more confident prediction.

2.2.5 Fully Convolutional One-Stage Object Detection

The object detector that was used for this thesis resembles but is not exactly equal
to, a model called Fully Convolutional One-Stage Object Detector (FCOS) [35].
It is an anchor-free single-stage object detector, where the main idea is to perform
object detection directly on feature maps produced by a Fully Convolutional Network
(FCN). It does so by projecting its class and bounding box predictions on a dense
prediction map with a shape similar to the input image, but downscaled and with
more channels. As such, the predictions can be interpreted in a per-pixel manner
which somewhat resembles how a typical DL image semantic segmentation network
works [15]. For each target pixel i.e. a strided region of the input image, the model
predicts a class score and a left, top, right bottom offset from the location (x, y).
The left, top right, and bottom predictions form a box where the pixel’s location is
an interior point and the offsets define the distances from the interior point to the
corresponding box boundaries. Let the ground-truth objects for an input image be
defined as oi where oi = (c(i), x

(i)
1 , y

(i)
1 , x

(i)
2 , y

(i)
2 ) ∈ {1, 2, ..., C} ×R4. Here (x(i)

1 , y
(i)
1 ),

(x(i)
2 , y

(i)
2 ) is the left-top and right-bottom coordinate of object i’s bounding box

respectively. c(i) is the class index where C is the number of classes. For a location
(x, y) in the original image, there exists a corresponding target location ( s

2 +xs, s
2 +ys)

on the s strided target feature map. Each location can be seen as a separate pixel
sample, and a location is considered positive if it lies within a ground truth bounding
box; otherwise negative. Naturally, each location holds the ground truth class c∗

and a regression target u∗ = (l∗, t∗, r∗, b∗), where the elements are the ground truth
offset in the left, top, right and bottom direction respectively. Assuming location

18


2. Theory

(x, y) is a positive location for ground truth oi, the regression target becomes:

l∗
x,y = (x− x

(i)
1 )

s
t∗
x,y = (y − y

(i)
1 )

s

r∗
x,y = (x(i)

2 − x)
s

b∗
x,y = (y(i)

2 − y)
s

(2.4)

If a location falls within two different objects’ bounding boxes, the location is de-
noted as an ambiguous sample, and the smallest object takes priority when con-
structing the target feature map.

2.2.5.1 Architecture

The network architecture proposed by Tian et al. [35] is an Feature Pyramid Network
(FPN) layered on top of a ResNet backbone. An FPN acts as a modular component
in a CNN that takes a feature map and generates a proportionally sized feature
map at multiple levels. One benefit of this is to alleviate the problem of objects’
scale invariance. The higher resolution feature maps are more detail-aware, whereas
the lower resolution feature maps are more context-aware. Fusing happens between
adjacent layers to make benefit from context-aware features. Furthermore, each
level in the FPN tunnels to a shared head which outputs a dense prediction map.
The benefit of multi-level predictions is twofold. Firstly, the difficulty of identifying
objects in a large span of different sizes is alleviated. Secondly, ambiguous samples
can be stratified and impose less of a problem. For this to work properly each output
must have a corresponding target feature map where the objects are stratified by size.
All targets are initialized the same, but then for each target pixel of each feature level
i, a location is converted to a negative location if it satisfies max(l∗

x,y, t∗
x,y, r∗

x,y, b∗
x,y) >

mi. mi is simply a predefined threshold of the max distance from the location center
and upper bounds the maximum size of objects for a given feature level.

For each location in the dense prediction map, the network outputs a prediction for
the 4 regression targets i.e. ux,y = (lx,y, tx,y, rx,y, bx,y), and a class prediction vector
cx,y. These are stacked along the channel dimension, and the full dense prediction
will be y ∈ R

h
s

× w
s

×C×4. As such, for each target location, C binary classifiers are
trained for the classification task along with 4 regressors. Since the model is trained
to perform both classification and bounding box regression, the final activation
functions are separated for each task. Sigmoid σ(cx,y,z) = 1

1+e−cx,y,z is used for each
(binary) classification prediction and rectified linear unit ReLu(a) = max(0, a) is
used for the regression predictions.

2.2.5.2 Loss Function

The loss function has two terms due to the duality of the problem and is defined as
follows:

L((cx,y, ux,y), (c∗
x,y, u∗

x,y)) = 1
Npos

∑
x,y

Lcls(cx,y, c∗
x,y)

+ λ

Npos

∑
x,y

1c∗
x,y>0Lreg(ux,y, u∗

x,y)
(2.5)

19


2. Theory

Here Lreg is the GIoU loss i.e. Lreg = 1 − ( |b∩b∗|
|b∪b∗| −

|h\(b∪b∗)|
|h| ), which describes an

inverse metric of the overlap between the predicted bounding box b derived from
ux,y, the ground truth bounding box b∗ derived from u∗

x,y and the smallest box h
that encloses both b and b∗. The first term in the parenthesis is equal to the IoU,
an overlap metric that is described in more detail in 2.2.1.1. Furthermore, 1c∗

x,y>0 is
an indicator function which is 1 if the current location is a positive location. Npos is
simply the number of positive locations. Lcls is the binary focal loss computed over
each entry in the predicted probability vector cx,y ∈ RC against the ground truth
class c∗

x,y. Focal loss is a modification of the regular cross-entropy loss and can be
expressed as follows

FL(cx,y,z, c∗
x,y) =

−α(1− cx,y,z)γ log(cx,y,z) if z = c∗
x,y

−(1− α)cγ
x,y,z log(1− cx,y,z) otherwise

(2.6)

α is a balancing factor, and γ is a hyperparameter that reduces the relative loss for
well-classified examples. This gives Lcls(cx,y, c∗

x,y) = ∑C
i=0 FL(cx,y,i, c∗

x,y). Finally,
λ is just a balancing factor, commonly set to 1. Worth mentioning is that FCOS
applies a concept known as centerness to suppress low-quality bounding boxes in the
post-processing stage, which adds a new component to the loss and the predictions.
This thesis does not employ this concept to keep a cleaner scientific setting in regards
to IS and will not go into further detail about it.

2.3 Importance Sampling
Importance Sampling (IS) is commonly known as a Monte Carlo method that plays
a crucial role in the estimation of statistical properties of complex systems [24] [23].
It operates on the premise that certain values of the input random variables carry
greater significance when estimating a value compared to others. By placing a higher
emphasis on these important values through more frequent sampling, the variance
of the estimate given these types of samples can be decreased.

Formally, in the continuous case, we are interested in calculating∫
p(x)f(x)dx = Ep[f(x)]

where p is the target distribution and f(x) being the function of interest. However,
if the dimensionality of x is high, the integral may be very difficult to compute.
We could possibly approximate this expectation using a numerical approximation
method that computes the following average:

Ep[f(x)] ≈ 1
N

N∑
i=1

f(xi)︸ ︷︷ ︸
s

xi ∼ p

Note that since xi is drawn from a probability distribution, it infers that we could
redraw a new sample average s, which assumes its own probability distribution. In

20


2. Theory

fact, as N tends to infinity, by the central limit theorem [36], s will follow a Gaussian
distribution centered around the true expectation Ep[f(x)] i.e.

s
dist−−→ N (µ, σ2)

µ = Ep[f(x)]
σ2 = 1

N
Varp[f(x)]

Now, by introducing a proposal distribution q, which should be selected heuristically,
we have the possibility to bias the sampling while maintaining the estimand in
question. The trick is as follows

Ep[f(x)] =
∫

p(x)f(x)q(x)
q(x)

dx

=
∫

q(x)[p(x)
q(x)

f(x)]dx

= Eq[
p(x)
q(x)

f(x)]

We are yet again subject to approximating this new integral using a numerical
approximation method but this time sampling from q:

Eq[
p(x)
q(x)

f(x)] ≈ 1
N

N∑
i=1

p(xi)
q(xi)

f(xi)︸ ︷︷ ︸
r

xi ∼ q

As before, we can conclude that the distribution of r will approach a Gaussian
distribution with the following parameters when N approaches infinity:

r
dist−−→ N (µ, σ2)

µ = Eq[p(x)
q(x)f(x)]

σ2 = 1
N

Varq[p(x)
q(x)f(x)]

Interestingly, we now have a new, possibly improved variance. The aim is to satisfy
1
N

Varq[p(x)
q(x)f(x)] < 1

N
Varp[f(x)] as this will effectively decrease the uncertainty in

the estimate of Ep[f(x)]. To do this, q(x) should be high where |p(x)f(x)| is high.
With an appropriately chosen q, the sampling effort is concentrated on the regions
of which the integrand of the target distribution has high values, leading to more
accurate results with fewer samples.

21


2. Theory

Figure 2.11: Comparison of three different choices for q(x). The rightmost plot
showcases an example where the choice of q(x) would provide a favorable variance
reduction over the uniform density function depicted in the center plot. Conversely,
the leftmost plot illustrates a poorly chosen q(x) that would increase the variance
in the estimator.

It is crucial to emphasize that the choice of q is an essential but non-trivial consider-
ation. A poorly chosen q would induce a density ratio p(x)

q(x) where the estimation of
the expected value would have a high variance. Consequently, the sample average
would have high uncertainty and would be determined by a small portion of the
samples. The hope is that we have some understanding of p(x) and f(x) that could
guide us in our selection. For instance, we may know where f(x) is high or maybe
even know a certain approximation of p(x).

2.3.1 Importance Sampling in Deep Learning
The intuition for leveraging IS in DL could be motivated by the unknown distribution
of the data we are dealing with. To illustrate this, let’s consider an abstract example
visualized in figure 2.12, which depicts the unknown data distribution for a binary
classification task where we want to distinguish cats and dogs. Note that we have
condensed the whole input feature space into a single variable x for visualization
purposes. A typical trajectory when training a DL model on a task like this is
the fast rate of convergence in the early stages of training. At these stages, the
model becomes quite good at correctly predicting easy samples. Easy samples in
this context could be samples where cat or dog features are prominent. On rare
occasions, the model gets exposed to samples where e.g. a strange-looking cat has
almost indistinguishable features from a dog or samples where the animal’s features
are rare. These are the more challenging samples and lie in the long tail regions of
the distribution densities. As conveyed by the figure, in these regions, the densities
are relatively small and the model will rarely get exposed to these types of samples if
sampling uniformly at random. In fact, the model will have to run through far more
samples to become good at predicting the challenging regions correctly as opposed
to predicting the easy regions correctly. This motivates the utility of importance
sampling. By oversampling the challenging regions, the model will be less subject to
allocating compute resources on insignificant gradient contributions obtained from

22


2. Theory

uninformative samples and could instead focus on learning from more informative
samples, which induce a greater albeit favorable change in the parameters. As
such, the parameters of the model will possibly reach a local optimum with less
computation. This is especially lucrative if the dataset is large and consists of a big
proportion of uninformative samples that don’t vary significantly in input feature
space.

Figure 2.12: An illustration of a toy example: cat vs dog unknown data distributions.
Note that the x-axis is an abstract and arbitrary representation of the input feature
space. The lower middle animal image depicts a Lykoi, which is a cat breed that
resembles a wolf or a dog.

To formally reason why importance sampling could be beneficial in DL, let L be a loss
function, θ the learnable parameters of a deep learning model, and D = {(xi, yi)}n

i=1
the data of which Dt is a batch from D at timestep t. A gradient descent update is
expressed as the following:

θt+1 = θt − ηt∇Lθt
(Dt) (2.7)

where ηt is the learning rate at timestep t. With this in mind, we can express the
error term ht at time t as the distance between the current parameters θt and the
optimal solution θ∗.

ht = ||θt − θ∗||22 (2.8)

23


2. Theory

The convergence rate can now be defined as the difference in the errors between two
consecutive iterations, i.e.

ht+1 − ht = ∥θt+1 − θ∗∥2
2 − ∥θt − θ∗∥2

2

= (θt+1 + θt − 2θ∗)(θt+1 − θt)
= (2θt − 2θ∗ − ηt∇Lθt

(Dt))(−ηt∇Lθt
(Dt))

= −2ηt(θt − θ∗)∇Lθt
(Dt) + η2

t (∇Lθt
(Dt))2

(2.9)

The expectation over the batches gives us the expected contribution for a single
iteration:

Ep[ht+1 − ht] = −2ηt(θt − θ∗)Ep[∇Lθt
(Dt)] + η2

t Ep[(∇Lθt
(Dt))2]

= −2ηt(θt − θ∗)Ep[∇Lθt(Dt)] + η2
t ((Ep[∇Lθt(Dt)])2 + Varp[∇Lθt(Dt)])

(2.10)

Now, reducing Varp[∇Lθt
(Dt)] in equation 2.10 implies improving the convergence

rate, since the difference in the distances to the optimal solution should ideally be
as negative as possible, i.e. we want ht+1 << ht. This could be satisfied by biasing
the sampling of Dt. In particular, when the aim is to enhance the convergence rate
in the long tails of the data distribution where the data is sparse, it is in our interest
to select a heuristic q that could outperform the uniform sampling in reducing the
variance of Varp[∇Lθt

(Dt)] when Dt consist of such samples.

It is worth highlighting the challenge of estimating the importance of each sample
as the model evolves. Specifically, when dealing with mini-batch gradient descent,
the gradient estimates are obtained from mini-batches of data at certain parameter
states. IS will enable the training phase to compile mini-batches with a high concen-
tration of important samples for a given state of parameters. Because only a fraction
of the training examples are sampled every iteration, the importance scores retrieved
from previous observations may poorly reflect the current state of the model’s pa-
rameters. It is of interest to consequently leverage an update rule qt(x) → qt+1(x)
which accounts for this change in importance, possibly even for non-sampled data
points with e.g. smoothing [11].

In practice, this could more or less incur the following overarching procedure;

24


2. Theory

Algorithm 4 Simple importance sampling strategy in mini-batch gradient-descent
training

Data: D
Model parameters: θ
q ∝ U(1, |D|)
for t = 1,...T do

Dt ←− nb data points sampled with q from D
[s, ∇L]←− ForwardBackward(Dt, θ)
θ ←− θ − ηt ·∇L
q ←− UpdateImportanceScores(q, s)

end for

Note that this pseudo-code is overly simplified and is far from representative of all
IS strategies. In other words, several steps may substantially differ depending on the
method, task, and heuristic. However, an interesting step is the stochastic process
where entries in Dt are sampled with q, inferring that we can concentrate on the
samples that tend to contribute more to the update of θ. Consequently, accelerating
training and spending fewer computations on less informative samples.

25


2. Theory

26


3
Related Work

Estimating the importance of each sample to enable accelerated training is a con-
cept also present in other techniques such as curriculum learning [6] and core-set
selection [2]. Similarly to IS, curriculum learning prioritizes certain samples during
training to ultimately enhance the training procedure. However, the difference be-
ing that curriculum learning aims at prioritizing easy samples in the beginning and
progressively introduces more challenging samples as the training advances. While
the benefits of curriculum learning has been demonstrated by a significant amount
of empirical data [37], the method requires defining the order of the samples before
training. This is in itself a costly task, as the sample difficulties must have been
assessed beforehand. As tedious as it is, it is commonly done either manually or by
transferring knowledge from a pre-trained model. However, self-paced learning [38]
alleviates this problem and reformulates this ad-hoc scheme by utilizing training
dynamics to decide if a sample is yet too challenging and sorts the data according
to that heuristic.

On the other end of the spectrum are methods related to core-set selection. These
methods aim at finding a subset of the most informative samples in a labeled or
unlabeled dataset. Applications include querying the most important samples to
label or pruning data for budgeted and/or accelerated training. Toneva et al. [1]
propose pruning data based on the number of forgetting events. The authors show
that samples that are easily forgotten by the model during training tend to be
more informative, while samples that are rarely forgotten tend to be less important.
Excluding rarely forgotten examples and training the model on a fraction of the
dataset may therefore improve convergence without degrading the model. Pleiss et
al. [39] achieve better final performance by excluding mislabeled or overly ambiguous
samples. This was achieved using a classification-based metric called margin [40]
[41] [42], in their case, the final layer margin. The final layer margin is computed as
the difference between the ground truth logit and the highest logit among the other
classes, i.e. M(x, c) = zc(x)︸ ︷︷ ︸

assigned logit

− maxi ̸=czi(x)︸ ︷︷ ︸
largest other logit

, where c is the index of the ground

truth value in the vector, and z(x) ∈ Rc is the pre-softmax output (logits), from
the model given input x.

Paul et al. [4] suggest using the error vector as an importance heuristic to prune
data after merely a few epochs of training. Sener and Savarese [2] propose a method
which involves optimizing a k-center problem. The key idea is to find a subset of so

27


3. Related Work

called center points in the data that when trained on, achieve a performance as close
as possible to the one that would be obtained when training on the whole dataset.
With a distance heuristic, the algorithm tries to choose b center points such that the
largest distance between a data point and its nearest center point is minimized. For
computing the distance between samples, the authors suggest using the activations
of the final fully-connected layer. While the notion of informative samples is shared
between IS and core-set selection approaches, core-set selection typically requires a
full training phase or at least a pretraining stage before being able to assign sample
scores.

IS can be seen as a middle ground approach of the two aforementioned concepts [43].
Like curriculum learning, IS involves prioritizing samples online during training.
Like core-set selection, IS needs to assign samples’ importance scores dynamically.
Various suggestions for measuring importance and how to use importance sampling
in a training algorithm have been proposed in the literature. Some popular sug-
gestions have been to sample training examples based on the loss. Loshchilov and
Hutter [12] employ a strategy where datapoints are ranked with respect to their
latest loss and then sample based on an exponentially decaying function applied
on top of the ordered samples. Ioannou et al. [22] suggest an approach where at
the end of each epoch, a portion of the lowest loss yielding samples are replaced
with a portion of the highest loss yielding samples. At the cost of introducing a
hyperparameter, the authors also suggest utilizing a momentum component for the
importance score for a more balanced sampling scheme and a prominent improve-
ment in convergence speed. The paper states that this approach achieves up to 40%
convergence speed-up on CIFAR10 [44]. Katharopoulos and Fleuret [45] propose
a strategy that trains a second neural network in parallel and predicts the loss of
the main model for the sampling procedure. They call this method approximative
importance sampling. This study [45] claims a speed-up of 20% to 30% on image
classification on MNIST [46] and CIFAR10. However, it is important to note that
the authors evaluate the speed-up based on the convergence in training loss which
should be biased in this setting. Katharopoulos and Fleuret [7] later published a
study where they leverage an upper bound on the gradient norm instead. In fact,
it has been theoretically shown that the optimal proposal distribution should be
proportional to the per-sample gradient norm [10]. However, the per sample gra-
dient norm is computationally prohibitive to compute, so it may be impractical to
apply in practice. Nonetheless, the authors suggest that the gradient of the loss
with respect to the pre-activation outputs of the last neural network layer gives an
upper bound to the gradient norm of all the parameters, and can be utilized in IS
to converge faster to a lower test error.

In contrast to the aforementioned studies, other studies have shown compelling ev-
idence for importance samplers that concentrate on uncertain samples rather than
purely high loss samples. Chang et al. [11] compared Loshchilov and Hutter [12]’s
approach with samplers that, instead of using loss, focus on samples with high pre-
diction variance or threshold-closeness. Their observations concluded that preferring
high loss samples works well when the task is relatively easy (i.e. both training and
testing accuracy are high). However, when the task is complex and the dataset

28


3. Related Work

is challenging and/or noisy, a preference towards uncertain examples showed to be
more beneficial for the final generalization performance.

Online batch selection A popular method for addressing the problem of a
rapidly evolving importance distribution involves doing additional forward passes.
Some of the aforementioned studies employ this concept and several other works
have explored this route [7], [9], [12], [47], [48]. They motivate its advantage over
the naive sampling from Algorithm 4 when measuring convergence based on wall-
clock time instead of the number of forward propagations. The idea is to sample
a big batch to forward propagate. From this, a sample heuristic could be utilized
to select a subset for the backward-pass. Assuming that the backward pass is the
bottleneck in the algorithm, this approach could potentially result in a speedup.
However, for this, many samples have to be loaded from disk into memory and de-
coded by the Central Processing Unit (CPU) at a time, which in itself may be a
greater bottleneck than the backward-pass when working with large images. For
this reason, online batch selection does not appear useful in OD for AD and is not
investigated in this thesis.

Reducible holdout framework Mindermann et al. [9] proposed Reducible Hold-
out (RHO) which is a framework that is used in combination with IS. It first trains a
smaller model on the same task using a held-out part of the dataset denoted as Dho

that is consequently excluded from the main dataset. This smaller model is trained
once, after which it performs one forward-pass over the whole dataset. Using the
forward-pass and a heuristic such as the loss, a score for each sample is computed.
They explore the loss as an heurstic. The loss produced from the small model is
denoted the irreducible loss. When training the real model, this irreducible loss is
subtracted from the normal loss produced by the main model to form a new heuristic
called the reducible loss. The formal notation for this importance heuristic is simply:

RHO loss︷ ︸︸ ︷
L(x, y; Dt)︸ ︷︷ ︸

training loss

− L(x, y; Dho)︸ ︷︷ ︸
irreducible holdout loss

(3.1)

The general intuition behind RHO loss selection relies on the empirical observation
that it avoids easy, noisy, and out-of-distribution samples. Noisy samples are chal-
lenging but uninformative. Such samples include samples with incorrect labeling and
may contribute to degrading the model’s performance. Out-of-distribution samples
are in other words outliers. Despite not being mislabeled, from the perspective of
the test data evaluation, it could be hurtful for the model to put too much emphasis
on these samples.

The following points are stated in the paper:

• Since easy samples tend to have a low training loss, and the RHO loss is always
less than the training loss, such samples are less favored. If the model were

29


3. Related Work

to forget an easy sample, the metric would account for this since the training
loss would be high, and which increases the chance of revisiting it.

• Noisy samples would be harder for the holdout model to generalize, incurring
a high irreducible loss and consequently a low RHO loss.

• Out-of-distribution samples follow a similar reasoning as for the noisy samples.
Consider an outlier d′ and a non-outlier d. Since Dho is an unbiased split from
D, the following holds true for both datasets P (d′) < P (d). Naturally, the
holdout model will perform worse on out-of-distribution samples and induce
a high irreducible loss and low RHO loss.

Several of the aforementioned studies suggest that accelerated training can be ob-
tained by exploiting informative samples using importance sampling. However, none
of them show any empirical evaluations in the task of OD, which is a seemingly more
complex task when compared to image classification. Not only is the complexity of
OD higher than the tasks that IS has been reported with, but the dataset [30] used
in this thesis is vastly more complex than CIFAR10 and CIFAR100 which is among
the popular choices in these studies. Previous studies have investigated other ideas
for enhancing training convergence and also try to address the long tail problem in
the context of DL OD [49]. These methods are generally concerned with some func-
tion for selecting the most important regions within images rather than operating
on a dataset level and often require modifications to the optimization problem by
introducing some component to the loss or the backpropagation stage. Online Hard
Example Mining (OHEM) by Shrivastava et al. [49] has shown to be one of the more
prominent methods where the idea is to consider only the most beneficial regions
of interest for the backpropagation. This method is, however, limited to two-stage
object detectors as the regions are parsed from a RoI pooling stage. Based on the
idea of OHEM is a method called Loss Rank Mining by Yu et al. [50]. The method
can be applied to one-stage detectors and works similarly by filtering out all but
the k highest prediction-wise losses from the final feature map. Nevertheless, this
approach is mainly tailored to grid-based and template-based detectors and does
not translate to a stable solution for detectors where the image plane stride of the
dense prediction map is low. The culprit is the fixed k along with the fact that the
number of positive target pixels can vary substantially between samples. Another
method worth mentioning is Prime Sample Attention (PISA) [51]. It presents an
intermediate preference between hard mining and random sampling of image regions.
While it does propose an importance-based sampling scheme, it does not, similar to
the other aforementioned methods that are applicable in OD, operate on a dataset
level.

30


4
Methods

This thesis studies the possibility of leveraging importance sampling to accelerate
training in DL OD. Given this objective, the project branches into two main areas of
practice: 1) The development of a sufficiently performing, low-latency object detec-
tion system along with its full data preprocessing, training, evaluation, and inference
pipeline, and 2) The development of an IS framework for applying IS methods and
conducting relevant experiments. The methods presented in this chapter heavily
rely on theory provided in chapter 2 and related work from chapter 3. However,
due to the absence of literature on IS in OD, related work has been adapted from
a predominantly image classification setting to an OD setting. This discrepancy
alone has revealed multiple challenges, which, together with other applied topics,
are presented in this chapter in more detail.

The methods outlined in the sections below first cover details about the object
detection system used in the project. Subsequently the IS framework and design
choices for the system as a whole are presented.

4.1 Object Detection

This thesis employs a single-stage anchor-free object detector that closely resembles
FCOS [35], presented in section 2.2.5. Similarly to FCOS, the model used in this
project is fully convolutional and directly computes its predictions on dense location-
aware feature maps where all the predictions for classification and regression for a
location (x, y) are stacked along the channel axis as illustrated in figure 4.1.

4.1.1 Backbone
Tian et al. [35] use ResNet50 and ResNet101 as backbones, i.e. ResNet models with
50 and 101 convolutional layers respectively. This thesis uses a ResNet34 where
each convolutional layer has only 50% of the channel count of the original model
proposed by He et al. [52]. This is due to the constraints of AD, where the model
has to be small in order to run in real-time in a car while not compromising too
much performance and to accommodate the size of the dataset to minimize the risk
of severe overfitting.

31


4. Methods

4.1.2 Architecture
Tian et al. [35] use an FPN in FCOS, in order to handle differently scaled objects.
This project instead uses Deep Layer Aggregation (DLA) [53] where the last 3 stages
are aggregated to go from 32x to 8x down-sampling. Hence, a single feature map is
returned from the model. This keeps the architecture simple and the scale invariance
in latent feature space. For this thesis, it is deemed secondary to achieve state-of-
the-art performance, and simplicity is prioritized to have a clean environment for
studying IS.

Figure 4.1: An abstract overview of our model’s architecture. The ResNet backbone
extracts features from the input image, while the Deep Layer Aggregation module
aggregates the extracted feature maps to produce the output in the form of a dense
prediction map.

4.1.3 Don’t-Care Tokens
In a dataset, there may be objects whose class is unknown, so called “don’t-care”
objects. This label can for example represent an object where the annotator could
tell that there was an object, but could not tell which class it belongs to. For this
kind of object, it is desired to only propagate gradients for the bounding box, but
not for the classification part. To accomplish this, a mask is defined during data
processing, which marks any pixel that is contained by a don’t-care object. The
mask for these pixels is 0, and 1 otherwise, and the mask is pixel-wise multiplied by
the classification loss in order to only propagate gradients of pixels where the class
is known.

4.1.4 Loss Function
In their paper, Tian et al. [35] use equation 2.5 as the loss function. Instead of using
focal loss, or training C binary classifiers, that is one for each class, this thesis uses
a Softmax activation function followed by categorical cross-entropy loss. Softmax is
in this context expressed as σ(zx,y,i) = ezx,y,i∑C

j=1 ezx,y,j
, where zx,y,i is the preactivation

logit for the classification prediction at location (x, y) and class index i. As such, the
output will represent a probability vector which conforms to the fact that an object

32


4. Methods

can only belong to one of the classes. Categorical cross-entropy loss is expressed as
follows;

Lcls(cx,y, c∗
x,y)) = −

C∑
i=1

1i=c∗
x,y
· log(cx,y,i)

.

Additionally, it is worth noting that the original FCOS loss function does not con-
sider don’t-care tokens. Consequently, the loss function in this project is as follows;

L((cx,y, ux,y), (c∗
x,y, u∗

x,y)) = λcls

Ncare

∑
x,y

1carex,yLcls(cx,y, c∗
x,y)

+ λreg

Npos

∑
x,y

1c∗
x,y>0Lreg(ux,y, u∗

x,y)
(4.1)

where

• λcls and λreg are weights for the classification and bounding box regression
losses respectively (and are set to 1 unless otherwise specified).

• 1carex,y is the don’t-care mask value at pixel (x, y), as described in section
4.1.3.

• Ncare is the number of positive pixels in the don’t-care mask, i.e. ∑
x,y 1carex,y .

4.1.5 Non-Maximum Suppression
Since every pixel in an image predicts a bounding box and crowds the region pro-
posals, a method is required to filter the most relevant predictions. There are two
kinds of bounding boxes that need to be filtered away: low-confidence bounding
boxes, and bounding boxes predicting the same object as another bounding box. As
described in Section 2.2.4, NMS takes care of both of these cases and is used during
inference in this project. The confidence- and IoU-thresholds are both set to 0.5.

4.2 Importance Sampling
For the evaluation of several IS methods to be convenient, a framework is developed
which allows for customization of the strategy. This framework works by keeping
a list of importance scores, one for each training example. Whenever the model is
going to train, one or more examples are sampled using the importance distribution.
The importance scores are all positive, and the sampling is done by sampling an
example with a probability proportional to its importance score:

pi = si∑N
j=1 sj

(4.2)

where si is the importance score of sample i and N is the number of training samples.

After the model has trained on these examples, the importance scores of the selected
examples are updated using the outcome of training on them. The outcome of

33


4. Methods

the training refers to any attribute which can be computed from a single forward
and backward pass, e.g. loss, prediction error, gradient norm etc. How the new
importance score is computed given these attributes differs, and is here called a
sampling heuristic. Various heuristics are explained in this section. There are also
strategies that are invariant to which heuristic is used, but which influence the
sampling in other ways.

4.2.1 Baseline
The baseline approach is to not use any importance sampling at all. In this approach,
the order of the training examples is randomized before every epoch to remove any
bias in how the data is ordered. The model then trains on every training sample
exactly once per epoch. This approach serves as a baseline to compare all other
approaches to and is referred to as either “baseline” or “sequential” in the rest of
the report.

4.2.2 Smoothing
Smoothing is a heuristic-invariant approach that can serve two purposes: 1) Induce
more exploration, and 2) Reduce the effect of stale importance values. It is important
to note that smoothing can either be accumulated after each iteration or simply used
as a non-accumulating bias term to the importance score. When the probabilities
are calculated as in equation 4.2, the distribution is closer to uniform than it would
have been otherwise, which induces more exploration since all samples’ importance
scores are increased, not just the ones that were selected and trained on. In this
thesis, smoothing acts as a non-accumulating bias term and the value is dynamically
set to the mean importance divided by the number of examples in the dataset:

smoothing = 1
N2

N∑
i=1

si (4.3)

4.2.3 Loss-based Importance Sampling
The Loss Importance Sampler (LIS) is the first IS heuristic. The importance of a
sample is simply equal to the loss of that sample; si = Lθ(xi, yi). As conveyed in
chapter 3, the loss is a common IS heuristic and has been used by many practitioners.
However, when dealing with the loss of an object detector, it is important to be
mindful of the differences between the optimization problem and the indirect goal
of the importance sampler. Recall that the loss in this project produces a pixel-wise
loss in the dense prediction tensor. Since the aim is to accelerate the training of
detecting objects, computing an importance score that is homogeneously reduced
per pixel is counterproductive, because the importance scores can be dominated by
large objects (i.e. objects that occupy a great number of pixels) along with negative
locations (i.e. background).

34


4. Methods

4.2.4 Object-wise Reduction
One way of addressing the above-mentioned problem is to reduce the scores per
object. This is done by keeping track of which pixels belong to each object and
averaging the values in an object-wise manner. This yields one importance value
per object, which can then be reduced further into a sample-wise value. Some
suggestions for the second reduction are as follows:

• Sum: The object-wise importances are summed.

• Average: The object-wise importances are averaged.

• Area-weighted mean: The object-wise importances are scaled by some value
proportional to the area of their objects, and the averaged. They can for
example be scaled by the square root of the areas.

• Max: Pick only the highest of the object-wise importances to represent the
sample.

• Root mean square: Compute the root of the mean of the squares of the object-
wise importances.

This thesis considers mainly the average as an object-wise reduction technique, which
is the one that is referred to unless otherwise stated. An illustration of how the
average reduction works can be seen in figure 4.2.

Figure 4.2: Visual representation of the average object reduction. Note that the set
of background pixels is also considered to be an additional object.

The average is prioritized in this project because it is the most stable of the reduc-
tions, and the other methods have effects that might be undesired. For example,
using the sum would generally give high scores to samples with many objects, and
using the area-weighted average would generally give high scores to samples with
large objects. The average is however not perfect, as it risks drowning out impor-
tant objects if there are many unimportant objects in the same sample. The max
reduction would not do this, but is not as stable and would be susceptible to noise.

4.2.5 Error-based Importance Sampling
The error-based sampler assigns a score scaled with how poorly the model performs
in its prediction about each sample. This is similar to how the loss sampler works.
One of the original methods is presented by Chang et al. [11]. They compute a

35


4. Methods

score that depends on the history of an error-based heuristic over the epochs. For
classification the error-based heuristic is 1

M

∑M
t=1(1− yT ŷt), where M is the current

epoch. Paul et al. [4] adopt a similar heuristic but compute the score as the norm of
the error vector and do not average the score over multiple epochs. The error-based
sampler used for this thesis is similar to the approach from Chang et al. [11], but
does not consider the history, and is adapted in several ways to include both the
multi-modality of OD and the dense output feature map of the FCOS approach.

For the classification part, the score for a sample is defined as
1

Ncare

∑
x,y

−log(c∗T
x,ycx,y) · 1c∗

x,y>0 (4.4)

The score for the bounding box regression task is similar but uses the per-pixel
IoU-loss, and the regression mask instead of the classification (don’t-care) mask.
These two components, classification and regression, each being a masked average,
are then summed, yielding a single value representing the poorness of the prediction
for the sample. In the case that object-wise reduction is desired, it can be used
instead of a masked average. Note how the calculations, similarly to cross-entropy,
use the negative logarithm to enable an importance score that could range from
zero to infinity. In the case of an averaged object-wise reduction, this trick could
alleviate the problem of hard underrepresented classes getting drowned out by other
easy classes within the same image.

4.2.5.1 Uncertainty-based Importance Sampler - Decision Boundary Close-
ness

It can be beneficial for the model to train not just on samples that it makes big
mistakes on, but samples where the model is uncertain, i.e. the confidence of a
prediction is close to the decision boundary. In this case, the computations of
the two components of the error can be modified to instead represent the model’s
uncertainty. Let ax,y = c∗T

x,ycx,y. Then, replace ax,y with 4 · ax,y · (1 − ax,y). The
classification part of the score for a sample is now

1
Ncare

∑
x,y

−log(4 · ax,y · (1− ax,y)) · 1c∗
x,y>0 (4.5)

Since ax,y ∈ [0, 1], ax,y · (1 − ax,y) will reach its highest point when ax,y = 0.5;
at its decision boundary. The expression is multiplied by 4 simply to normalize
it back into the range [0, 1]. This technique is applied similarly for the regression
task but with IoU-loss instead, which is also in the range [0, 1]. We denote this
sampler Uncertainty Importance Sampler (UIS), and it is analogous to the threshold-
closeness-based sampler from Chang et al. [11] which was shown to outperform loss-
based approaches. Similarly to the loss-based sampler UIS uses the object-wise
reduction instead of the masked average in equation 4.5.

4.2.6 Reducible Hold-out Loss
The loss value of a noisy or out-of-distribution sample can be very high, so using it
directly would give high importance to such training examples, which is undesirable.

36


4. Methods

For this reason, the RHO-framework is considered. A discrepancy between the orig-
inal paper from Mindermann et al. [9] is that they use online batch selection, which
this thesis does not. Another difference is how the reducible losses are computed.
The authors compute the reducible loss as the difference between the training loss
and the irreducible loss, see equation 3.1. In this thesis, the reducible losses are
produced in a similar way but are also shifted and re-scaled:
Let s = L(x, y, Dt)− L(x, y, Dho). Then

ŝi = si −min s

max s−min s
(4.6)

This is done in order to keep the reducible scores positive and compatible with
equation 4.2. Finally, the authors sample training examples by picking the highest
reducible score. This is problematic when not using online batch selection, as some of
the importance scores can be stale. In this project, the training samples are instead
sampled with the probabilities given in equation 4.2. Hyperparameters revolving
around the holdout-training are given in table 4.1 below.

Table 4.1: Hyperparameters for the holdout-training in our experiments.

Hyperparameter Value
Backbone Resnet 34
Feature multiplier 0.25
Num epochs 100
Batch size 128
Holdout training split 0.15

4.2.7 Rejection-based sampling

Estimating each sample’s importance while the model’s parameters evolve is a chal-
lenge that has been pointed out in numerous previous studies [7], [9], [43], [48].
Outdated importance scores may decrease the effectiveness of importance sampling
as the proposal distribution in such a case poorly reflects the target importance
distribution. This problem is often addressed using online batch selection, which
appears inapplicable for this project for reasons discussed in chapter 3. Apart from
smoothing, to further alleviate this issue, we explore a method analogous to a ran-
dom walk in the dataset. The idea is to accept or reject samples based on a compar-
ison between one of the latest forward-propagated samples and a randomly drawn
proposed sample. This is done instead of sampling naively based on the scheme
highlighted in algorithm 4. Although the proposed sample may have an outdated
importance score, at least the latest forward propagated sample should be up-to-
date. For this approach, there is a lack of theoretical guarantees. It is however
an intriguing approach to investigate empirically as a means to explore alternative
sampling strategies. The algorithm in question is outlined as follows:

37


4. Methods

Algorithm 5 Rejection-based sampling scheme as an alternative to the naive sam-
pling scheme in algorithm 4

Data: D
Batch size: nb

Model parameters: θ
q0 ∝ U(1, |D|)
D0 ←− nb data points sampled with q from D
for t = 1, ..., T do

for i = 1, ..., nb do
Sample candidate sample xc ∼ q
Get active sample xa ←− Dt−1,i

if q(xa) > 2 · q(xc) or q(xc) < p · (q(xc) + q(xa)) where p ∼ U(0, 1) then
Reject candidate sample Dt,i ←− xa

else
Accept candidate sample Dt,i ←− xc

end if
end for
[s, ∇L]←− ForwardBackward(Dt, θ)
θ ←− θ − ηt ·∇L
q ←− UpdateImportanceScores(q, s)

end for

4.3 Regression-agnostic sampling
As the results in chapter 5 will show, some sampling strategies intrinsically prioritize
and improve performance for certain sizes of objects, even when using object-wise
reduction. For this reason, it is interesting to completely ignore regression in the
importance calculations. In the heuristics presented above, the importance scores
are always computed as the sum between a classification part and a regression part.
When ignoring regression, the regression part of the equation is simply set to 0.

4.4 Dataset
The ZOD [54] is a recently released open dataset from Zenseact [30]. ZOD includes
many modalities useful for AD, e.g. road semantic segmentation, and 2D- and 3D
bounding box detection in both single frames and sequences. For this thesis, only
2D bounding boxes for dynamic objects in single frames are used. This includes
approximately 90,000 training images and 10,000 validation images.

As can be seen in figure 4.3, the license plates for cars in ZOD are anonymized, and
this is also the case for faces. ZOD gives two types of anonymization; blurring and
deepfakes. However, the anonymization is automatic and imperfect, and occasionally
leaves images visually corrupted. In order to use as clean a dataset as possible, this
thesis uses the original, non-anonymized version of the data, which is not available
publicly.

38


4. Methods

Since ZOD consists of frames with a resolution of 3848 × 2168, to streamline the
experiments and minimize data-loading bottlenecks, the images are cropped in the
data preprocessing pipeline. The Region of interest (RoI) is approximately the
middle third in the x-dimension and the middle sixth in the y-dimension. This
crop is responsible for objects far away in a typical scenario, so it is called the far
Region of interest (RoI), and this version of the dataset is called ZODFAR. Figure
4.3 depicts how the crop is applied.

Figure 4.3: Visualization of a sample with its annotations. The orange box highlights
a wide RoI while the pink box highlights a far RoI. Although both RoIs are important
in a real scenario, only the far RoI is used in ZODFAR as a means to keep the system
simple. The original image has image dimensions: 3848× 2168 while the processed
image has dimensions: 1280× 384.

There are several reasons why this is the region of interest. Firstly, it reduces the
size of the input image to the network by almost a factor of 20, which is necessary
to keep the model running in a streamlined setting. Secondly, cropping a small part
of the image is preferred to cropping a large part of the image and scaling it down,
which leaves a lot of the bounding boxes extremely small in comparison to the total

39


4. Methods

image size.

The number of objects in ZODFAR are stratified by class and size in table 4.2. Small
objects are defined as objects with an area of < 0.33% of the image size, medium
objects have an area between 0.33% and 3%, and large objects have an area > 3%
of the image size. These thresholds are mainly based on COCO’s size stratifications
[55].

Table 4.2: Number of objects in ZODFAR’s dataset stratified by size and class. The
number to the left of the parenthesis is the number of objects in the training set,
while the number inside the parenthesis is the number of objects in the validation
set.

Pedestrian Vehicle VulnerableVehicle Animal DontCare
large 2342 (326) 93084 (11435) 2120 (353) 61 (5) 0 (0)

medium 32295 (3936) 217714 (21478) 11315 (1614) 578 (79) 0 (0)
small 66581 (7563) 193465 (24913) 12942 (1736) 1050 (135) 85999 (9168)

4.5 Experiment Details
Each sampling strategy was run several times (10 unless otherwise specified) and
all runs of the same kind were averaged at each time step in order to give a robust
estimate of the performance. Each run used a batch size of 128 and was trained for
350 epochs with 89976 training samples on an Nvidia A100 GPU 40GB, and took
roughly 40-50 hours per run until complete. Note that since our system is running
compute-intensive operations to log and collect certain data for monitoring and
summaries of each iteration and validation, 40-50 hours might not be representative
of the time it would take in a streamlined setting. Also note that for the runs
with importance samplers, the notion of an epoch disappears after the first epoch is
complete since they no longer iterate over the data sequentially. Lastly, all object
detectors were trained using the Adam optimizer [18], with an initial learning rate of
η = 10−3, β1 = 0.9 which is the exponential decay rate for the 1st moment estimates,
β2 = 0.999 which is the exponential decay rate for the 2nd moment estimates, and
a numerical stability constant ϵ = 10−7.

40


5
Results

In this section, we present the results of our project. As stated, the purpose of this
thesis is not to achieve state-of-the-art performance in OD, but instead to evaluate
IS methods in terms of how much they speed up training in comparison to the
baseline. As such, the results focus on the performance influenced by importance
samplers and their various configurations. The results are presented with different
levels of stratification, ranging from general mAP to AP stratified by both object
size and class. The AP is computed over all recall values, and with interpolated
precision values. mAPsmall, mAPmedium, and mAPlarge are also computed, meaning
the same computation as above but filtered on small, medium, and large objects
respectively.

The performance of an IS method is primarily measured by how much faster in
terms of the number of iterations it could reach the maximum average mAP of the
baseline, as in figure 1 in [9]. Although wall-clock time is what one ultimately wants
to decrease, measuring speedup based on wall-clock is not meaningful in practice as
it can depend on miscellaneous implementation details. In particular, speed-up is
based on the rate of learning and can be calculated as follows;

num_iterationsbaseline

num_iterationsother_method
− 1 (5.1)

where

• num_iterationsbaseline is the iteration where the baseline reaches its maximum
mAP

• num_iterationsother_method is the iteration where the mAP of the other method
surpasses the maximum mAP of the baseline

Furthermore, as mentioned and as a means to delve deeper into the detection per-
formance of certain subgroups in the data, we assess how the importance sampler
affects performance across sizes and classes. It is especially interesting to observe
how IS affects the performance in data points that lie in the low-density regions of
the unknown data distribution. Since these types of samples are so scarce in the
data, generalizing to such samples could require significantly more training itera-
tions when sampling the batches uniformly without replacement. It is in our best
interest to accelerate training on these types of samples, as they could bottleneck

41


5. Results

the finish point of training. As a proxy for these types of samples, we will evaluate
the performance of the Animal class in greater detail. The Animal class is vastly
underrepresented in the data, as shown in table 4.2.

5.1 Loss and Uncertainty Samplers
In figure 5.1 we can see that for an average over 10 runs, sampling training exam-
ples using the mean-object-reduced UIS is beneficial and speeds up training by 95%,
meaning that the training time could be almost halved if further performance is not
required. On the other hand, sampling based on mean-object-reduced LIS does not
appear to accelerate training. It is worth noting that measuring the speedup ac-
cordingly may be subject to volatility and is best assessed with a significant amount
of runs averaged together. However in this case, the plot in figure 5.1 indicates a
consistent and clear superiority of the UIS when compared to the baseline.

0 5 10 15 20 25
1e4 Iterations

0.28

0.30

0.32

0.34

m
AP

95% speedup

mAP

Sequential Sampler
UIS
LIS

Figure 5.1: Comparison of overall mAP on ZODFAR for models trained with Sequen-
tial (baseline), LIS, and UIS. The closed black line marks the speedup for achieving
the baseline’s highest averaged mAP. The filled-in area marks ±1 standard devia-
tion.

Looking closer at the results, specifically comparing performance on the different
sizes of objects in figure 5.2, it is clear that both sampling approaches improve the

42


5. Results

learning for learning small objects. This is also where the model performs worst,
as indicated by the mAP for small objects being lower than for medium or large
ones. For medium objects, the UIS appears to outperform the baseline, which the
LIS does not. For large objects, neither sampling strategy reaches the highest mAP
of the baseline, so no speedup can be measured.

0 5 10 15 20 25
1e4 Iterations

0.28

0.30

0.32

0.34

m
AP

mAP

0 5 10 15 20 25
1e4 Iterations

0.18

0.19

0.20

0.21

0.22

0.23

0.24

0.25

m
AP

mAP small

0 5 10 15 20 25
1e4 Iterations

0.36

0.38

0.40

0.42

0.44

0.46

m
AP

mAP medium

0 5 10 15 20 25
1e4 Iterations

0.40

0.42

0.44

0.46

0.48

0.50

m
AP

mAP large

mAP stratified by size

Sequential Sampler
UIS
LIS

Figure 5.2: Comparison of mAP , mAPsmall, mAPmedium, and mAPlarge on ZODFAR
for models trained with Sequential (baseline), LIS, and UIS.

Figure 5.3 reveals the performance of the different samplers stratified by both class
and size. The UIS outperforms both the baseline and the LIS in the majority of
categories and only degrades the model noticeably in APlarge,vulnerablevehicle. It is
outperformed by the LIS in APlarge,animal. The LIS does however degrade the AP in
most categories. Specifically, the performance in all large categories and all medium
categories, except for APlarge,animal seems to be degraded with the use of the LIS.

43


5. Results

0 5 10 15 20 25
1e4 Iterations

0.12

0.14

0.16

0.18

0.20

0.22

APPEDESTRIAN

small
7563 objects

0 5 10 15 20 25
1e4 Iterations

0.300

0.325

0.350

0.375

0.400

0.425

0.450

0.475

AP

medium
3936 objects

0 5 10 15 20 25
1e4 Iterations

0.24

0.26

0.28

0.30

0.32

0.34

0.36

AP

large
326 objects

0 5 10 15 20 25
1e4 Iterations

0.350

0.375

0.400

0.425

0.450

0.475

0.500

0.525

0.550

APVEHICLE

21478 objects

0 5 10 15 20 25
1e4 Iterations

0.50

0.55

0.60

0.65

0.70

0.75

AP

24913 objects

0 5 10 15 20 25
1e4 Iterations

0.65

0.70

0.75

0.80

0.85

0.90

0.95

AP

11435 objects

0 5 10 15 20 25
1e4 Iterations

0.10

0.12

0.14

0.16

0.18

0.20

AP

VULNERABLE
VEHICLE

1736 objects

0 5 10 15 20 25
1e4 Iterations

0.300

0.325

0.350

0.375

0.400

0.425

0.450

0.475

AP

1614 objects

0 5 10 15 20 25
1e4 Iterations

0.375

0.400

0.425

0.450

0.475

0.500

0.525

0.550

0.575

AP

353 objects

0 5 10 15 20 25
1e4 Iterations

0.020

0.025

0.030

0.035

0.040

0.045

APANIMAL

135 objects

0 5 10 15 20 25
1e4 Iterations

0.06

0.07

0.08

0.09

0.10

0.11

AP

79 objects

0 5 10 15 20 25
1e4 Iterations

0.08

0.10

0.12

0.14

0.16

0.18

AP

5 objects

Sequential Sampler
UIS
LIS

Figure 5.3: Comparison of AP on ZODFAR stratified by bounding box size and class
for runs with three different sampling heuristics; Sequential (baseline), LIS, and UIS.
Each curve is plotted as an exponential moving average using a smoothing factor of
0.8 for visualization purposes. The corresponding runs can be found unsmoothed in
figure A.3 in appendix.

Through closer examination of APanimal and the stratified categories APsmall,animal,
APmedium,animal, and APlarge,animal (see figure 5.4), it is apparent t