Importance Sampling in Deep Learning Object Detection An empirical investigation into accelerating deep learning object detection training by exploiting informative data points Master’s thesis in Computer science and engineering Alexander Huang, Johannes Kvernes Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Importance Sampling in Deep Learning Object Detection An empirical investigation into accelerating deep learning object detection training by exploiting informative data points Alexander Huang & Johannes Kvernes Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 Importance Sampling in Deep Learning Object Detection An empirical investigation into accelerating deep learning object detection training by exploiting informative data points Alexander Huang, Johannes Kvernes © Alexander Huang, Johannes Kvernes, 2023. Acedmic supervisor: Yinan Yu, Department of Computer Science and Engineering Industrial supervisors: Olle Månsson, Zenseact. Willem Verbeke, Zenseact Examiner: Krasimir Angelov, Department of Computer Science and Engineering Master’s Thesis 2023 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2023 iv Importance Sampling in Deep Learning Object Detection An empirical investigation into accelerating deep learning object detection training by exploiting informative data points Alexander Huang Johannes Kvernes Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract In recent years, the field of deep learning object detection has witnessed a notable surge in progress, largely fueled by the growth of available data. While the ever growing amount of data has enabled complex object detection models to improve generalization performance and to facilitate training phases that are less prone to over-fitting, the time to reach desired performance with this remarkable amount of data has been reported to be an emerging problem in practice. This is certainly the case when each sample consist of high resolution image data that could burden the capacity of loading data. This thesis explores the possibilities of leveraging importance sampling as a means to accelerate gradient-based training of an object detection model. This avenue of research aims at biasing the sampling of training examples during training, in the hope of exploiting informative samples and reduce the amount of computation on uninformative, noisy and out-of-distribution samples. While previous art shows compelling evidence of importance sampling in deep learning, it is consistently re- ported in a context of less complex tasks. In this work, we propose techniques that can be applied to single-stage anchor-free object detectors, in order to adopt impor- tance sampling to accelerate training. Our methods do not require modifications to the objective function, and allow for a biased sampling procedure that remains consistent across runs and samples. Our results suggest that an uncertainty-based heuristic outperforms loss-based heuris- tics, and that object detection training can be subject to a remarkable speed-up in terms of reaching the baseline’s performance in fewer iterations, where the base- line samples the training examples uniformly without replacement. Furthermore, the empiric observations reported in this work also indicate that an increased final generalization performance can be achieved given an equal amount of training time when compared to the baseline. Keywords: Importance sampling, deep learning, object detection, accelerate train- ing, sampling heuristic, gradient descent, loss, uncertainty, threshold-closeness. v Acknowledgements We would like to show our greatest appreciation to Zenseact for the opportunity to work with an interesting topic and a fantastic team, and for lending us the equipment and computational resources that have made this project possible. Additionally, we would like to thank our industrial supervisors, Willem Verbeke and Olle Månsson, whose guidance and mentorship have been instrumental in shaping this project. They have, with great expertise and sheer commitment to this thesis, played a crucial role in resolving complex problems and equipped us with knowledge that has not only enriched the project, but also ourselves as future practitioners. Lastly, we would like to extend our gratitude to our academic supervisor, Yinan Yu, who has been active throughout the project, making the academic formalities a smooth process as well as providing us with valuable insights and lead-way into the cutting-edge research through relevant literature. Alexander Huang, Gothenburg, 2023-06-13 Johannes Kvernes, Gothenburg, 2023-06-13 vii Contents List of Acronyms xi 1 Introduction 1 1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Training a Neural Network . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 9 2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1.1 Intersection over Union . . . . . . . . . . . . . . . . 11 2.2.1.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . 12 2.2.1.3 Average Precision . . . . . . . . . . . . . . . . . . . . 14 2.2.1.4 Mean Average Precision . . . . . . . . . . . . . . . . 15 2.2.2 Single-stage Object Detectors . . . . . . . . . . . . . . . . . . 15 2.2.3 Anchor-free Detectors . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.4 Non-Maximum Suppression . . . . . . . . . . . . . . . . . . . 16 2.2.5 Fully Convolutional One-Stage Object Detection . . . . . . . . 18 2.2.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.5.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Importance Sampling in Deep Learning . . . . . . . . . . . . . 22 3 Related Work 27 4 Methods 31 4.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 Backbone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.3 Don’t-Care Tokens . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.5 Non-Maximum Suppression . . . . . . . . . . . . . . . . . . . 33 4.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ix Contents 4.2.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.3 Loss-based Importance Sampling . . . . . . . . . . . . . . . . 34 4.2.4 Object-wise Reduction . . . . . . . . . . . . . . . . . . . . . . 35 4.2.5 Error-based Importance Sampling . . . . . . . . . . . . . . . . 35 4.2.5.1 Uncertainty-based Importance Sampler - Decision Bound- ary Closeness . . . . . . . . . . . . . . . . . . . . . . 36 4.2.6 Reducible Hold-out Loss . . . . . . . . . . . . . . . . . . . . . 36 4.2.7 Rejection-based sampling . . . . . . . . . . . . . . . . . . . . . 37 4.3 Regression-agnostic sampling . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5 Results 41 5.1 Loss and Uncertainty Samplers . . . . . . . . . . . . . . . . . . . . . 42 5.2 Rejection-based Sampling Strategy . . . . . . . . . . . . . . . . . . . 45 5.3 Reducible Hold-Out loss . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4 Regression-agnostic Importance Sampler . . . . . . . . . . . . . . . . 53 6 Result Analysis 59 6.1 Interpretation of the results . . . . . . . . . . . . . . . . . . . . . . . 59 6.1.1 A closer look at loss and uncertainty samplers . . . . . . . . . 59 6.1.2 RHO Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1.3 Rejection-based Sampling . . . . . . . . . . . . . . . . . . . . 61 6.1.4 Ignoring Regression . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 Consistency with Importance Sampling . . . . . . . . . . . . . . . . . 63 6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.3.1 Compatibility with Bell and Whistles . . . . . . . . . . . . . . 66 6.3.2 The Background Class and False Positives . . . . . . . . . . . 66 6.3.3 Sacrificing Performance . . . . . . . . . . . . . . . . . . . . . . 67 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7 Future Work 69 Bibliography 71 A Appendix 1 I B Appendix 2 V C Appendix 3 XVII x List of Acronyms AD Autonomous Driving. ADAS Advanced Driver Assistance Systems. AP Average Precision. CNN Convolutional Neural Network. CPU Central Processing Unit. DL Deep Learning. DLA Deep Layer Aggregation. FCN Fully Convolutional Network. FCOS Fully Convolutional One-Stage Object Detector. FN False Negative. FP False Positive. FPN Feature Pyramid Network. IoU Intersection over Union. IS Importance Sampling. LIS Loss Importance Sampler. LIS-IR Loss Importance Sampler with Ignored Regression. LIS-RB Loss Importance Sampler with Rejection-based sampling. LIS-RHO Loss Importance Sampler with Reducible Hold Out framework. mAP Mean Average Precision. ML Machine Learning. MLP Multilayer Perceptron. xi List of Acronyms NMS Non-Maximum Suppression. OD Object Detection. RHO Reducible Hold-out. RoI Region of interest. SSD Single-Stage Detector. TN True Negative. TP True Positive. UIS Uncertainty Importance Sampler. UIS-IR Uncertainty Importance Sampler with Ignored Regression. UIS-RB Uncertainty Importance Sampler with Rejection-based sampling. ZOD Zenseact Open Dataset. xii 1 Introduction The remarkable increase in available data has enabled rapid development of deep neural networks. Various highly complex deep neural network models have been shown to outperform other methods in a plethora of tasks in fields such as computer vision and natural language processing. However, big datasets and complex models are generally mirrored by a substantial computational cost that entails a central challenge when training these types of machine learning models, highlighting the importance of improving the efficiency of gradient-based training algorithms. One aspect of streamlining training is to supervise the model with samples from training data with a frequency that is proportional to their respective importance score. For context, it has rendered itself obvious to many practitioners that not all samples are of equal difficulty for an Machine Learning (ML) model to solve. During the training phase, deep learning models tend to converge rapidly in the early iterations, whereas in the intermediate and later stages, a vast majority of samples become significantly easier for the model to predict correctly, resulting in minuscule gradient updates. Compellingly, a portion of samples could possibly be sampled less frequently, while other more informative samples could be exploited without compromising the performance of the final model. This seems to have the potential to accelerating training. This phenomenon has been showcased in numerous dissertations related to data pruning and core-set selection [1], [2], example difficulty [3]–[5], curriculum learning [6], and various other paradigms including Importance Sampling (IS) [7]–[12]. This thesis explores the application of various IS techniques as a means to accelerate training of deep learning Object Detection (OD) models. By leveraging IS, training can be concentrated on samples that tend to contribute more to the process of finding a local minimum in the optimization problem at hand, supposedly leading to fewer training iterations and/or better generalization performance in comparison to the standard procedure of uniform sampling without replacement. As such, training times would be less susceptible to an ever-growing amount of data and could, in practice, free up computational resources for other problems or accelerate research and development since the time garnered by lower training times could be allocated to conducting more experiments in a less or equal amount of time. 1 1. Introduction 1.1 Purpose The purpose of this thesis is to empirically investigate, develop and propose a repre- sentative approach for applying importance sampling in the task of OD, an applica- tion of importance sampling that has, as opposed to image classification and to our best knowledge at the time of writing, yet to be studied thoroughly in the current art. In our work, we anticipate to showcase examples of how importance sampling can be leveraged to accelerate the training of an OD model in a computationally constrained environment for a real-time application such as Autonomous Driving (AD). The main research questions for this thesis are the following: 1. How do different types of importance sampling heuristics perform in the task of deep learning object detection? 2. What are the effects of different sampling strategies when training a deep learning object detector, and how do they perform? 3. Can importance sampling be used to accelerate learning in low- density regions of a data distribution? 1.2 Collaboration This project was conducted at Zenseact, a technology company centered around developing AD and Advanced Driver Assistance Systems (ADAS) [13]. Zenseact thrives on recent advances in the field of machine learning and leverages R&D op- erations to provide state-of-the-art solutions for renowned vehicle manufacturers. Under those circumstances and from the outset, practitioners deemed it necessary to accelerate the training process of the in-house OD model, and as such, this project was conducted in close collaboration with our supervisors at the company. Before closing this paragraph, we want to highlight the rigorous work of which this thesis was extended from [14]. Similarly to us, Johansson and Lindberg [14] completed their project at Zenseact and investigated IS in the context of deep learning. The authors even touched on its application in training an object detector. Their study garnered a strong base of knowledge to proceed from. 2 2 Theory ML is a subfield of artificial intelligence that has gained significant traction in recent years as it enables models to learn from data and make predictions without being explicitly programmed in a rule-based manner. The field has evolved rapidly, and it encompasses a wide range of techniques, including supervised learning, unsupervised learning, and reinforcement learning. The process typically involves training a model on a dataset to identify patterns and relationships between data points. Once a model is trained, it can run inference on new data. Advancements in ML, particularly in Deep Learning (DL), have revolutionized the field, enabling machines to achieve state-of-the-art performance on a wide range of complex tasks such as semantic segmentation [15], image classification [16], and OD [17]. All of these are fundamental tasks in computer vision and enable a vast space of application in various domains. Semantic segmentation pertains to, given for instance an input image, output a class prediction from a set of predefined classes for each pixel. Similarly, image classification predicts one or multiple classes, but for an image as a whole. While this is far from trivial, OD takes it further and resolves spatial predictions for each object of interest on the image plane. As such, DL OD generally relies on more complex architectures and techniques to reach sufficient performance levels. DL OD models have shown great promise in a wide range of applications [17]. For instance, autonomous cars use object detection to identify pedestrians, vehicles, and other objects in their environment and make decisions accordingly. Despite rapid advancements in the field, OD and DL are still active areas of re- search, and many challenges remain. One of them being computationally expensive and time-consuming training phases. To address this, researchers have been propos- ing various ways to accelerate training of DL models. Examples include adaptive learning rate [18], momentum [19], normalization [20], weight decay [21], and numer- ous other variance reduction techniques. While these are all fundamental aspects of accelerating model training, prioritizing informative samples have shown to induce a similar effect [6], [7], [10], [22]. The common trajectory of the performance curve when training deep learning models generally starts with a high rate of convergence, and any clean sample from the dataset may sufficiently contribute when updating the parameters at this stage. As the model continues to learn, the rate of conver- gence slows down, inferring that a good amount of samples might have been learned and have become too easy for the model to predict correctly. In these instances, 3 2. Theory the model is subject to negligible contributions. Intuitively, treating each entry in the dataset equally could be sub-optimal and computing resources could instead be concentrated on informative samples in an effort to accelerate training. In fact, this is a problem well suited for a method titled Importance Sampling (IS) [23], [24]. In this section, we tread through some of the fundamental concepts in deep neural networks in order to later motivate why IS can accelerate training of deep learning models. We will also examine how DL OD works and visit some ideas that have enabled high-performing models in a real-time setting, such as AD. Lastly, we will touch on what importance sampling is, how it works, and its application in deep learning training. 2.1 Multilayer Perceptron Multilayer Perceptrons (MLPs) are a class of models within ML that are inspired by the structure and function of biological neurons in the brain [25]. These types of models consist of layers of interconnected processing units, also called neurons, that transform an input and propagate the information through the layers until a final output is obtained. This structure has enabled remarkable success in solving highly complex non-linear problems. Before proceeding with the architecture, it is important to understand the computations around a neuron which is visualized as a graph in figure 2.1. The basic computation for a single neuron is expressed as a = σ(∑n i wixi) where wi is a weight scalar, xi is an element in an input vector x and σ(x) is an activation function, which is a crucial part for inducing nonlinear capabilities. Without non-linearity, the output would just be a linear transformation of the input, which limits the network to model merely simple relationships between them. Figure 2.1: A simple computational graph of a neuron The structure of an MLP is visualized in figure 2.2. It consists of an input layer (the first aggregate of neurons), the output layer (the last aggregate of neurons, 1 in this case), hidden layers which correspond to all the layers between the input and the output layer, and weights visualized as connections between neurons. If we 4 2. Theory further vectorize the neuron computation from figure 2.1 to handle an aggregate of neurons we get a = σ(W (l)x(l)), which corresponds to the vectorized activation from layer l. It is also common but not mandatory to add a bias term b which is simply added to the preactivation output i.e. a = σ(W (l)x(l) + b(l)). From this, we can express the nested computations which capture the whole forward pass from input to output a(l) = σ(W (l)a(l−1) + b(l)). Fundamentally, the current layer’s input is the previous layer’s activation output. Figure 2.2: The structure of an MLP. 2.1.1 Training a Neural Network As with many other ML models, when training a neural network, we are trying to optimize an objective function. Informally we want to tweak the parameters θ, i.e. the weights and biases, to accurately map the input to the desired output. The underlying algorithm for achieving this is called Gradient Descent [26]. It works by iteratively stepping in the negative gradient’s direction and updating the parameters for the function of interest. This process is visualized in figure 2.3. 5 2. Theory Figure 2.3: A simple case of gradient descent where pi ∈ R2 The gradient ∇f is a vector-valued function whose components are the partial derivatives of f i. e. ∇f(p) =  ∂f(p) ∂p1... ∂f(p) ∂pn  A fundamental property of the gradient is that given a point p̄, it points in the direction in which the function locally increases most, and so if we are minimizing, it is in our interest to step in the negative direction where the function decreases the most. The algorithm is outlined below. Algorithm 1 General gradient descent Initialize starting point p, function f , convergence criterion ϵ while ||∇f(p)|| > ϵ do Find search direction (negative gradient): s = −∇f(p) Obtain step size with line search: α = arg minα f(p + αs) Update state: p←− p + αs end while In the context of neural networks, p̄ is a point in space of possible parameters θ̄. The objective function we want to minimize is often referred to as a loss function (also known as a cost function). Given an MLP parameterized with θ the loss function Lθ(x, y) quantifies how poor the prediction was according to a ground truth y. Let J(θ) = Eπ[Lθ(x, y)], (x, y) ∼ π. Essentially we want to obtain the optimal parameters that yield the lowest expected loss for our problem at hand: θ∗ = arg minθ J(θ). 6 2. Theory Before we can apply the concept of Gradient Descent, we must be able to compute the gradient of the loss function. The gradient will essentially inform the model in what direction the parameters W (l) and b(l) should be updated in order to minimize the loss, which consequently improves the performance of the model. An algorithm for computing the gradients is referred to as Backpropagation [27]. The idea of Backpropagation is to propagate the loss back through the network, from the output layer to the input layer, using the chain rule of calculus. Let us assume that we run a forward pass and have computed the loss L. Backpropagation computes all the partial derivatives of the loss with respect to each weight ( ∂L ∂W (l) ij ) and with respect to each bias ( ∂L ∂b (l) j ) in order to obtain the components of the gradient. These partial derivatives hold information about the change in loss w.r.t. to the change in parameters. Now consider an example where we have one output layer and two hidden layers i.e. ŷ = σ(W (3)σ(W (2)σ(W (1)x))) and a(l) = σ(W (l)a(l−1)). Note that we have omitted the bias term to simplify the example. Let all the activation functions denote the sigmoid function σ(x) = 1 1+e−x , and let the non-parametrized loss function denote a simple mean squared error function L(ŷ, y) = 1 2(y−ŷ)2, which can also be expressed explicitly as a function of all the weights of each layer: L(ŷ, y) = L(σ(W (3)a(2)), y) = L(σ(W (3)σ(W (2)a(1))), y) = L(σ(W (3)σ(W (2)σ(W (1)x))), y) We have the following derivatives for the non-parameterized loss function and the sigmoid activation function: ∂L ∂ŷ = ŷ − y ∂σ ∂x = σ(x)(1− σ(x)) Now to obtain the partial derivatives we work backward through the network to calculate ∂L ∂W (3) then ∂L ∂W (2) and then ∂L ∂W (1) . The calculation for the gradient in the third layer gives ∂L ∂W (3) = ∂L ∂ŷ ∂ŷ ∂W (3) = (ŷ − y)σ(W (3)a(2))σ(1− (W (3)a(2)))a(2) For calculating the gradient in the second layer, we have to use the chain rule twice: 7 2. Theory ∂L ∂W (2) = ∂L ∂ŷ ∂ŷ ∂a(2) ∂a(2) ∂W (2) = (ŷ − y)︸ ︷︷ ︸ ∂L ∂ŷ σ(W (3)a(2))σ(1− (W (3)a(2)))W (3)︸ ︷︷ ︸ ∂ŷ ∂a(2) σ(W (2)a(1))(1− σ(W (2)a(1)))a(1)︸ ︷︷ ︸ ∂a(2) ∂W (2) Similarly, the gradient for the first layer can be calculated from the following expres- sion ∂L ∂W (1) = ∂L ∂ŷ ∂ŷ ∂a(2) ∂a(2) ∂a(1) ∂a(1) ∂W (1) . Note that some factors appear in multiple calculations, and so computationally, it is not necessary to re-calculate these expressions. Only the factors that are particular to the layer must be evaluated while the factors that are common to previous layers can easily be reused. Since we now have a way of deriving the gradient, we can similarly to algorithm 1 outline gradient descent in the context of training a neural network: Algorithm 2 Gradient descent for training a neural network Input initial parameters θ, learning rate η, data D = {(xi, yi)}N i=0 for t in 1, 2, 3, ..., T do Randomly sample Dt ⊆ D where |Dt| = nb is the batch size x, y ←− Dt Compute output from forward pass: ŷ ←− Forward(x) Compute loss: L←− L(ŷ, y) Compute average gradient across the batch: ĝ ←− 1 nb ∑nb i=0 ∇L (i) θ from ∇L (i) θ ←− Backward(L) Update parameters θ ←− θ − ηĝ end for The following clarifications should be noted • The above process is usually repeated for a couple of so-called epochs. • The learning rate η is a hyperparameter influencing the magnitude of the update step. • Dt usually selects a subset of samples from D by sampling uniformly without replacement. As inferred by the last point, since gradient descent is an iterative algorithm, when applying it to train a neural network, the process must define the quantity of train- ing data to propagate at each iteration. In theory, it is possible to forward- and 8 2. Theory backpropagate on the whole dataset each step i.e. nb = |D|. In fact, assuming that the training data is a perfect representation of the real-world problem, this would yield in the most optimal direction to step in. However, the computation and mem- ory consumption required to perform this operation makes this infeasible in practice for large-scale datasets. This approach is known as batch gradient descent. On the other extreme, we define stochastic gradient descent. Instead, of propagating the whole dataset each iteration, this approach selects only one training example per iteration i.e. nb = 1. The number of iterations per time unit would be remarkably better and the memory consumption would be kept at a minimum. However, since the whole parameter space is influenced by solely one training example per itera- tion, this approach suffers from highly fluctuating gradients and is very prone to stepping in sub-optimal directions. As a compromise, mini-batch gradient descent propagates multiple examples per step. The number of samples per batch is typically determined by a hyperparameter which is set to accommodate hardware capabilities. This approach is less prone to high variance than stochastic gradient descent, and is computationally feasible, unlike batch gradient descent in most cases. 2.1.2 Convolutional Neural Networks MLPs are generally not the preferred way to solve computer vision tasks. If the input/output has to deal with large images, the model will generally suffer from heavy parameterization as each layer is fully connected. Furthermore, since the input and the weights have a static layout where each pixel will be mapped to one input node, the model will not be spatially invariant which is could be a desirable property of models in computer vision. Spatial invariance allows the model to capture features regardless of their position in the input image [28]. Convolutional Neural Network (CNN) is a type of neural network that is commonly applied in computer vision tasks. At the core of a CNN is the convolution which is an operation that allows the network to extract useful features from images while being spatially invariant [28]. It works by applying a set of learnable filters, one small region at a time, and produces a new feature map as illustrated in figure 2.4. Specifically, one convolution takes the dot product between the current region of the image across channels and the filter to obtain a preactivation scalar. The value of the preactivation denotes how well the convolved area matches the values/pattern in the filter. The filters slide over the image, performing a Hadamard product followed by a sum of the elementwise products at each position which in the end forms a complete feature map. By nesting this operation over a stack of layers with multiple filters per layer, CNNs are able to learn increasingly complex and hierarchical features from the input. Much like in MLPs, the feature maps from each layer are passed through an activation function to induce non-linearity. The hyperparameters for a convolutional layer are primarily: • kernel size - The size of the weight matrix that corresponds to a filter • stride - Defines the step size of the filter as it slides over the input • padding - An offset that defines how far out from the input’s border the filter 9 2. Theory can go • number of channels - Number of filters in the layer. This number also denotes the number of feature maps the output will have. Figure 2.4: A convolution operation 1. I is the input, K represents the filter and I ∗K is the resulting feature map. 2.2 Object Detection OD is a complex task in computer vision that pertains to identifying and localizing objects of interest in an image. Not only does the algorithm have to be able to classify which types of objects are depicted, but it also needs the ability to propose assigned regions for each identified object on the image plane. These region proposals are oftentimes represented by either 2- or 3-dimensional bounding boxes, and when visualized, overlayed on the image as illustrated in figure 2.5. OD has seen numerous applications in various fields such as robotics, healthcare, surveillance, and AD. It is typically used as a critical subsystem to solve many different problems [17], [29]. An example is AD where one or more cameras are mounted to the car and the car has to be able to decipher its surrounding in order to act in a highly dynamic environment. 1Detection and Tracking of Pallets using a Laser Rangefinder and Machine Learning Techniques - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/An- example-of-convolution-operation-in-2D-2_fig3_324165524 [accessed 20 Mar, 2023] 10 2. Theory Figure 2.5: A training example from Zenseact Open Dataset (ZOD)2[30] where the 2d bounding box annotation has been overlayed onto the image. Due to the task’s complex intrinsic nature, the state-of-the-art has mainly been dominated by DL approaches in recent years [17]. These methods have rapidly revolutionized the field using CNNs as a core component. For this section, we will present some project-relevant concepts of DL OD such as Single-Stage Detector (SSD), anchor-free detectors, and Non-Maximum Suppression (NMS), along with a thorough walk-through of a specific well-established object detector which applies all the aforementioned concepts. However, before we jump into any details, we will describe the theory on how to evaluate object detectors to get a sense of what the end goal is regardless of how the optimization problem is formulated. 2.2.1 Evaluation Metrics Evaluation is a crucial part of ML. Rigorously defined and tested evaluation metrics is the standard way of measuring the performance of ML models. There are sev- eral evaluation metrics that are commonly used in OD, including Intersection over Union (IoU), Precision and Recall, F1 score, Average Precision (AP), Mean Average Precision (mAP), etc. This section intends to describe the metrics used in the thesis in more detail. 2.2.1.1 Intersection over Union The Intersection over Union (IoU) of two geometric shapes S1 and S2 measures to what extent the two shapes overlap. This is used extensively in OD to determine the overlap of bounding boxes, as a high overlap between predicted and ground truth boxes indicates good performance for the model. The IoU ranges from 0 to 1, where a value of 0 indicates no overlap between the bounding boxes, and a value of 1 indicates perfect overlap. For 2D OD, let S1 = b1 be a bounding box represented by the coordinate tuple b1 = (x(1) 1 , y (1) 1 , x (1) 2 , y (1) 2 ) where (x(1) 1 , y (1) 1 ) is the top-left 2For this dataset, Zenseact AB has taken all reasonable measures to remove all personally identifiable information, including faces and license plates. To the extent that you like to request removal of specific images from the dataset, please contact privacy@zenseact.com. 11 2. Theory coordinate and (x(1) 2 , y (1) 2 ) the bottom-right coordinate, and let S2 = b2 follow a similar definition. A visual representation of this can be seen in figure 2.6. The area of the intersection between the bounding boxes can be calculated as follows |b1 ∩ b2| =max(0, min(x(2) 2 , x (2) 2 )−max(x(1) 1 , x (2) 1 )) ·max(0, min(y(1) 2 , y (2) 2 )−max(y(1) 1 , y (2) 1 )) (2.1) The area of the union is simply the summed area of the bounding boxes minus the area of the intersection, i.e. |b1 ∪ b2| =(x(1) 2 − x (1) 1 ) · (y(1) 2 − y (1) 1 )+ (x(2) 2 − x (2) 1 ) · (y(2) 2 − y (2) 1 )− |b1 ∩ b2| (2.2) The IoU of the two bounding boxes is simply IoU(b1, b2) = |b1 ∩ b2| |b1 ∪ b2| (2.3) Figure 2.6: A visual representation of the components of the IoU between two 2D bounding boxes. 2.2.1.2 Confusion Matrix The confusion matrix is a set of valuable metrics that measure how many classifica- tions were correct and how many were wrong. In the case of binary classification, it is a 2-by-2 matrix where the rows and columns represent the ground truth and 12 2. Theory predictions respectively, see figure 2.7. One row and one column represent positive locations, and one of each represents negative locations. The result is a matrix where we can easily read how many correct and incorrect predictions the model made of each type, i.e. the true and false positives and negatives. They work as follows: • True Positives (TPs): The number of samples that were supposed to be true, and that the model predicted to be true. • True Negatives (TNs): The number of samples that were supposed to be false, and that the model predicted to be false. • False Positives (FPs): The number of samples that were supposed to be false, but that the model predicted to be true. • False Negatives (FNs): The number of samples that were supposed to be true, but that the model predicted to be false. In the case of multi-class classification with C classes, the matrix is C-by-C. In this case, each row and each column represent a class. The matrix value at row i and column j then represents the number of times the model predicted class j when the ground truth class was i. In OD, TPs are counted every time there is a ground truth object and a predicted object with the same class and an IoU above some threshold. FPs are counted when there is a predicted object which does not overlap sufficiently with any ground truth objects of the same class. FNs are counted when there is a ground truth object which does not overlap sufficiently with any predicted object of the same class. Figure 2.7: A binary confusion matrix 13 2. Theory 2.2.1.3 Average Precision AP is the average precision over the recall-values [31]. Precision is the fraction of correct positive classifications out of all the positive classifications: P = TP TP + FP Recall is the fraction of correct positive classifications out of all the actual positive cases: R = TP TP + FN Preferably, we want both precision and recall to be high (close to 1), but there is typically a trade-off between them for a given model; as the recall increases, the precision tends to decrease. To compute AP, we first rank all the object predictions by their confidence score in descending order. Then, going from top to bottom, computing and storing the current precision and recall value for each time we visit a new prediction yields a sequence of precision-recall pairs which we can use to compute the final metric. If we plot this, we get the precision-recall curve, as in figure 2.8. As can be seen in the figure, the precision is wiggling up and down in a sort of zigzag pattern as the recall increases, which is due to the variations of correct and incorrect predictions when iterating through the ranked detections [32]. Figure 2.8: An example precision-recall curve. In order to account for the variations, and as a means to increase the robustness of the metric, it is standard to, for each point i, define the interpolated precision- value P interp i to be the maximum precision-value of all recall-values at or above i: P interp i = max(Pi, Pi+1, ..., Pn). The AP is now the average interpolated precision per recall value, which is equivalent to the area under the interpolated precision-recall curve, see figure 2.9. 14 2. Theory Figure 2.9: An example precision-recall curve with and without interpolated precision-values. Note that, in the literature, several versions of AP are used, based mainly on the dataset that the model is being evaluated on. For example, the PASCAL-VOC Challenge [32] used to compute the AP of each class using 11 equally spaced recall values, but was changed in 2010 to use all values. 2.2.1.4 Mean Average Precision mAP is the standard way of measuring the performance of an OD model. The AP is computed separately for each class, and the mAP is the mean of the class-wise APs. Since the expectation value of AP is invariant to the number of objects, mAP is an unbiased estimation of the true mAP given an infinite dataset. Note that the literature sometimes uses the terms AP and mAP interchangeably. 2.2.2 Single-stage Object Detectors Overall, DL OD architectures can be divided into two categories: two-stage object detectors and SSDs. As the name suggests, two-stage object detectors consist of two stages i.e. region proposal followed by object classification. In the first stage, the algorithm generates a set of candidate object locations and in the second stage, it classifies each region proposal into one of the object categories or background. Two- stage detectors are known for their high performance but are generally slow and computationally inefficient as a result of the separate handling of tasks. In contrast, SSDs deal with both sub-problems in a single forward pass, hence enabling efficiency, speed, and simplicity. Nonetheless, this generally results in worse performance when compared to their two-staged counterpart. While high performance is crucial in the context of AD, two-stage detectors are commonly deemed obsolete as the algorithms must, by requirement, be able to operate in real-time. With that being said, this thesis adopts an SSD approach. 15 2. Theory 2.2.3 Anchor-free Detectors In traditional DL OD, anchor boxes are used to generate object proposals and refine object localization. They are in essence a set of predefined fixed-sized rectangles of various scales and aspect ratios evenly distributed across the image. During training, the model is trained to predict the offsets and scales between each relevant anchor box and the corresponding ground truth bounding box for each object in an image. This allows the model to better match the object’s bounding box in the image in comparison to a naive approach where the predicted bounding box is limited to a single aspect ratio and/or scale. However, anchor boxes are also limited in a sense as they have to be predefined and adjusted specifically for the dataset and domain at hand. If the predefined anchor boxes do not align well with the bounding boxes in the dataset, the model might not be able to accurately localize the objects. Anchor-free object detectors are a type of OD models that do not rely on predefined templates in order to localize objects in an image. Instead, they typically predict objects’ locations by directly predicting all the necessary bounding box parameters. These are obtained from extracted features from the output of deep convolutional layers. Anchor-free models tend to have a host of benefits as they are simple, flexible, and agnostic to the distribution of bounding box scales and aspect ratios for any given dataset. Regarding anchor-free detectors, it is worth mentioning that they are commonly presented in the context of dense-based object detectors. These object detectors project their predictions onto a dense prediction map which is a high dimensional array whose shape resembles the shape of the input image. As such, they enable fine-grained localization and instance segmentation that offers rich and detailed information about the input. In contrast to these types of object detectors are models that view the problem as a set prediction problem. The output from a set- based object detector for a given sample is simply a set of class and bounding box tuples. Transformer models are notably good at this, since they work by taking a set as input to output a set of features [33]. A well-known set-based framework is DEtection TRansformer (DETR) where image features are extracted and divided and spatially embedded into a set of object queries [34]. The object queries serve as inputs to a transformer model which is followed by a prediction head to output a set of predictions. While set-based object detectors can omit certain post-processing techniques, the computations revolving around the transformer module can be costly when there are a lot of object queries to be processed. Set-based object detectors are also arguably less interpretable since they do not offer any explicit spatial references in the predictions as opposed to a dense prediction map. For these reasons, this thesis employs a dense-based object detector which is explained in greater detail in 4.1. 2.2.4 Non-Maximum Suppression Dense-based OD models typically generate multiple candidate bounding boxes around objects of interest, leading to the need to filter out some of these boxes in order to keep just one bounding box per relevant object. NMS is a post-processing technique 16 2. Theory used to filter redundant overlapping object region proposals. It does so by identifying and removing proposals with an IoU greater than a predefined iou-threshold when comparing to the proposal with the highest confidence score for the same class (see figure 2.10). It is also common to filter out bounding boxes with a confidence score below a predefined confidence threshold beforehand. The overarching procedure is expressed in the following pseudo code: Algorithm 3 Pseudo code of NMS Input: B = {b1, ..., bN}, S = {s1, ..., sN}, thiou, thconf B: collection of bounding boxes S: collection of corresponding confidence scores thiou: IoU threshold thconfidence: confidence threshold for i in 0, ..., N do if si < thconf then B ← B \ bi S ← S \ si end if end for B̂ ← ∅ while B ̸= ∅ do m← argmax(S) B̂ ← B̂ ∪ bm B ← B \ bm for j in 0, ..., |B| do if iou(bm, bj) ≥ thiou then B ← B \ bj S ← S \ sj end if end for end while return B̂, S 17 2. Theory Figure 2.10: An example of non-maximum suppression. The left image depicts the candidate bounding box predictions before non-maximum suppression. The right image depicts the processed image, where two bounding box predictions were filtered out as they overshot the iou-threshold with another more confident prediction. 2.2.5 Fully Convolutional One-Stage Object Detection The object detector that was used for this thesis resembles but is not exactly equal to, a model called Fully Convolutional One-Stage Object Detector (FCOS) [35]. It is an anchor-free single-stage object detector, where the main idea is to perform object detection directly on feature maps produced by a Fully Convolutional Network (FCN). It does so by projecting its class and bounding box predictions on a dense prediction map with a shape similar to the input image, but downscaled and with more channels. As such, the predictions can be interpreted in a per-pixel manner which somewhat resembles how a typical DL image semantic segmentation network works [15]. For each target pixel i.e. a strided region of the input image, the model predicts a class score and a left, top, right bottom offset from the location (x, y). The left, top right, and bottom predictions form a box where the pixel’s location is an interior point and the offsets define the distances from the interior point to the corresponding box boundaries. Let the ground-truth objects for an input image be defined as oi where oi = (c(i), x (i) 1 , y (i) 1 , x (i) 2 , y (i) 2 ) ∈ {1, 2, ..., C} ×R4. Here (x(i) 1 , y (i) 1 ), (x(i) 2 , y (i) 2 ) is the left-top and right-bottom coordinate of object i’s bounding box respectively. c(i) is the class index where C is the number of classes. For a location (x, y) in the original image, there exists a corresponding target location ( s 2 +xs, s 2 +ys) on the s strided target feature map. Each location can be seen as a separate pixel sample, and a location is considered positive if it lies within a ground truth bounding box; otherwise negative. Naturally, each location holds the ground truth class c∗ and a regression target u∗ = (l∗, t∗, r∗, b∗), where the elements are the ground truth offset in the left, top, right and bottom direction respectively. Assuming location 18 2. Theory (x, y) is a positive location for ground truth oi, the regression target becomes: l∗ x,y = (x− x (i) 1 ) s t∗ x,y = (y − y (i) 1 ) s r∗ x,y = (x(i) 2 − x) s b∗ x,y = (y(i) 2 − y) s (2.4) If a location falls within two different objects’ bounding boxes, the location is de- noted as an ambiguous sample, and the smallest object takes priority when con- structing the target feature map. 2.2.5.1 Architecture The network architecture proposed by Tian et al. [35] is an Feature Pyramid Network (FPN) layered on top of a ResNet backbone. An FPN acts as a modular component in a CNN that takes a feature map and generates a proportionally sized feature map at multiple levels. One benefit of this is to alleviate the problem of objects’ scale invariance. The higher resolution feature maps are more detail-aware, whereas the lower resolution feature maps are more context-aware. Fusing happens between adjacent layers to make benefit from context-aware features. Furthermore, each level in the FPN tunnels to a shared head which outputs a dense prediction map. The benefit of multi-level predictions is twofold. Firstly, the difficulty of identifying objects in a large span of different sizes is alleviated. Secondly, ambiguous samples can be stratified and impose less of a problem. For this to work properly each output must have a corresponding target feature map where the objects are stratified by size. All targets are initialized the same, but then for each target pixel of each feature level i, a location is converted to a negative location if it satisfies max(l∗ x,y, t∗ x,y, r∗ x,y, b∗ x,y) > mi. mi is simply a predefined threshold of the max distance from the location center and upper bounds the maximum size of objects for a given feature level. For each location in the dense prediction map, the network outputs a prediction for the 4 regression targets i.e. ux,y = (lx,y, tx,y, rx,y, bx,y), and a class prediction vector cx,y. These are stacked along the channel dimension, and the full dense prediction will be y ∈ R h s × w s ×C×4. As such, for each target location, C binary classifiers are trained for the classification task along with 4 regressors. Since the model is trained to perform both classification and bounding box regression, the final activation functions are separated for each task. Sigmoid σ(cx,y,z) = 1 1+e−cx,y,z is used for each (binary) classification prediction and rectified linear unit ReLu(a) = max(0, a) is used for the regression predictions. 2.2.5.2 Loss Function The loss function has two terms due to the duality of the problem and is defined as follows: L((cx,y, ux,y), (c∗ x,y, u∗ x,y)) = 1 Npos ∑ x,y Lcls(cx,y, c∗ x,y) + λ Npos ∑ x,y 1c∗ x,y>0Lreg(ux,y, u∗ x,y) (2.5) 19 2. Theory Here Lreg is the GIoU loss i.e. Lreg = 1 − ( |b∩b∗| |b∪b∗| − |h\(b∪b∗)| |h| ), which describes an inverse metric of the overlap between the predicted bounding box b derived from ux,y, the ground truth bounding box b∗ derived from u∗ x,y and the smallest box h that encloses both b and b∗. The first term in the parenthesis is equal to the IoU, an overlap metric that is described in more detail in 2.2.1.1. Furthermore, 1c∗ x,y>0 is an indicator function which is 1 if the current location is a positive location. Npos is simply the number of positive locations. Lcls is the binary focal loss computed over each entry in the predicted probability vector cx,y ∈ RC against the ground truth class c∗ x,y. Focal loss is a modification of the regular cross-entropy loss and can be expressed as follows FL(cx,y,z, c∗ x,y) = −α(1− cx,y,z)γ log(cx,y,z) if z = c∗ x,y −(1− α)cγ x,y,z log(1− cx,y,z) otherwise (2.6) α is a balancing factor, and γ is a hyperparameter that reduces the relative loss for well-classified examples. This gives Lcls(cx,y, c∗ x,y) = ∑C i=0 FL(cx,y,i, c∗ x,y). Finally, λ is just a balancing factor, commonly set to 1. Worth mentioning is that FCOS applies a concept known as centerness to suppress low-quality bounding boxes in the post-processing stage, which adds a new component to the loss and the predictions. This thesis does not employ this concept to keep a cleaner scientific setting in regards to IS and will not go into further detail about it. 2.3 Importance Sampling Importance Sampling (IS) is commonly known as a Monte Carlo method that plays a crucial role in the estimation of statistical properties of complex systems [24] [23]. It operates on the premise that certain values of the input random variables carry greater significance when estimating a value compared to others. By placing a higher emphasis on these important values through more frequent sampling, the variance of the estimate given these types of samples can be decreased. Formally, in the continuous case, we are interested in calculating∫ p(x)f(x)dx = Ep[f(x)] where p is the target distribution and f(x) being the function of interest. However, if the dimensionality of x is high, the integral may be very difficult to compute. We could possibly approximate this expectation using a numerical approximation method that computes the following average: Ep[f(x)] ≈ 1 N N∑ i=1 f(xi)︸ ︷︷ ︸ s xi ∼ p Note that since xi is drawn from a probability distribution, it infers that we could redraw a new sample average s, which assumes its own probability distribution. In 20 2. Theory fact, as N tends to infinity, by the central limit theorem [36], s will follow a Gaussian distribution centered around the true expectation Ep[f(x)] i.e. s dist−−→ N (µ, σ2) µ = Ep[f(x)] σ2 = 1 N Varp[f(x)] Now, by introducing a proposal distribution q, which should be selected heuristically, we have the possibility to bias the sampling while maintaining the estimand in question. The trick is as follows Ep[f(x)] = ∫ p(x)f(x)q(x) q(x) dx = ∫ q(x)[p(x) q(x) f(x)]dx = Eq[ p(x) q(x) f(x)] We are yet again subject to approximating this new integral using a numerical approximation method but this time sampling from q: Eq[ p(x) q(x) f(x)] ≈ 1 N N∑ i=1 p(xi) q(xi) f(xi)︸ ︷︷ ︸ r xi ∼ q As before, we can conclude that the distribution of r will approach a Gaussian distribution with the following parameters when N approaches infinity: r dist−−→ N (µ, σ2) µ = Eq[p(x) q(x)f(x)] σ2 = 1 N Varq[p(x) q(x)f(x)] Interestingly, we now have a new, possibly improved variance. The aim is to satisfy 1 N Varq[p(x) q(x)f(x)] < 1 N Varp[f(x)] as this will effectively decrease the uncertainty in the estimate of Ep[f(x)]. To do this, q(x) should be high where |p(x)f(x)| is high. With an appropriately chosen q, the sampling effort is concentrated on the regions of which the integrand of the target distribution has high values, leading to more accurate results with fewer samples. 21 2. Theory Figure 2.11: Comparison of three different choices for q(x). The rightmost plot showcases an example where the choice of q(x) would provide a favorable variance reduction over the uniform density function depicted in the center plot. Conversely, the leftmost plot illustrates a poorly chosen q(x) that would increase the variance in the estimator. It is crucial to emphasize that the choice of q is an essential but non-trivial consider- ation. A poorly chosen q would induce a density ratio p(x) q(x) where the estimation of the expected value would have a high variance. Consequently, the sample average would have high uncertainty and would be determined by a small portion of the samples. The hope is that we have some understanding of p(x) and f(x) that could guide us in our selection. For instance, we may know where f(x) is high or maybe even know a certain approximation of p(x). 2.3.1 Importance Sampling in Deep Learning The intuition for leveraging IS in DL could be motivated by the unknown distribution of the data we are dealing with. To illustrate this, let’s consider an abstract example visualized in figure 2.12, which depicts the unknown data distribution for a binary classification task where we want to distinguish cats and dogs. Note that we have condensed the whole input feature space into a single variable x for visualization purposes. A typical trajectory when training a DL model on a task like this is the fast rate of convergence in the early stages of training. At these stages, the model becomes quite good at correctly predicting easy samples. Easy samples in this context could be samples where cat or dog features are prominent. On rare occasions, the model gets exposed to samples where e.g. a strange-looking cat has almost indistinguishable features from a dog or samples where the animal’s features are rare. These are the more challenging samples and lie in the long tail regions of the distribution densities. As conveyed by the figure, in these regions, the densities are relatively small and the model will rarely get exposed to these types of samples if sampling uniformly at random. In fact, the model will have to run through far more samples to become good at predicting the challenging regions correctly as opposed to predicting the easy regions correctly. This motivates the utility of importance sampling. By oversampling the challenging regions, the model will be less subject to allocating compute resources on insignificant gradient contributions obtained from 22 2. Theory uninformative samples and could instead focus on learning from more informative samples, which induce a greater albeit favorable change in the parameters. As such, the parameters of the model will possibly reach a local optimum with less computation. This is especially lucrative if the dataset is large and consists of a big proportion of uninformative samples that don’t vary significantly in input feature space. Figure 2.12: An illustration of a toy example: cat vs dog unknown data distributions. Note that the x-axis is an abstract and arbitrary representation of the input feature space. The lower middle animal image depicts a Lykoi, which is a cat breed that resembles a wolf or a dog. To formally reason why importance sampling could be beneficial in DL, let L be a loss function, θ the learnable parameters of a deep learning model, and D = {(xi, yi)}n i=1 the data of which Dt is a batch from D at timestep t. A gradient descent update is expressed as the following: θt+1 = θt − ηt∇Lθt (Dt) (2.7) where ηt is the learning rate at timestep t. With this in mind, we can express the error term ht at time t as the distance between the current parameters θt and the optimal solution θ∗. ht = ||θt − θ∗||22 (2.8) 23 2. Theory The convergence rate can now be defined as the difference in the errors between two consecutive iterations, i.e. ht+1 − ht = ∥θt+1 − θ∗∥2 2 − ∥θt − θ∗∥2 2 = (θt+1 + θt − 2θ∗)(θt+1 − θt) = (2θt − 2θ∗ − ηt∇Lθt (Dt))(−ηt∇Lθt (Dt)) = −2ηt(θt − θ∗)∇Lθt (Dt) + η2 t (∇Lθt (Dt))2 (2.9) The expectation over the batches gives us the expected contribution for a single iteration: Ep[ht+1 − ht] = −2ηt(θt − θ∗)Ep[∇Lθt (Dt)] + η2 t Ep[(∇Lθt (Dt))2] = −2ηt(θt − θ∗)Ep[∇Lθt(Dt)] + η2 t ((Ep[∇Lθt(Dt)])2 + Varp[∇Lθt(Dt)]) (2.10) Now, reducing Varp[∇Lθt (Dt)] in equation 2.10 implies improving the convergence rate, since the difference in the distances to the optimal solution should ideally be as negative as possible, i.e. we want ht+1 << ht. This could be satisfied by biasing the sampling of Dt. In particular, when the aim is to enhance the convergence rate in the long tails of the data distribution where the data is sparse, it is in our interest to select a heuristic q that could outperform the uniform sampling in reducing the variance of Varp[∇Lθt (Dt)] when Dt consist of such samples. It is worth highlighting the challenge of estimating the importance of each sample as the model evolves. Specifically, when dealing with mini-batch gradient descent, the gradient estimates are obtained from mini-batches of data at certain parameter states. IS will enable the training phase to compile mini-batches with a high concen- tration of important samples for a given state of parameters. Because only a fraction of the training examples are sampled every iteration, the importance scores retrieved from previous observations may poorly reflect the current state of the model’s pa- rameters. It is of interest to consequently leverage an update rule qt(x) → qt+1(x) which accounts for this change in importance, possibly even for non-sampled data points with e.g. smoothing [11]. In practice, this could more or less incur the following overarching procedure; 24 2. Theory Algorithm 4 Simple importance sampling strategy in mini-batch gradient-descent training Data: D Model parameters: θ q ∝ U(1, |D|) for t = 1,...T do Dt ←− nb data points sampled with q from D [s, ∇L]←− ForwardBackward(Dt, θ) θ ←− θ − ηt ·∇L q ←− UpdateImportanceScores(q, s) end for Note that this pseudo-code is overly simplified and is far from representative of all IS strategies. In other words, several steps may substantially differ depending on the method, task, and heuristic. However, an interesting step is the stochastic process where entries in Dt are sampled with q, inferring that we can concentrate on the samples that tend to contribute more to the update of θ. Consequently, accelerating training and spending fewer computations on less informative samples. 25 2. Theory 26 3 Related Work Estimating the importance of each sample to enable accelerated training is a con- cept also present in other techniques such as curriculum learning [6] and core-set selection [2]. Similarly to IS, curriculum learning prioritizes certain samples during training to ultimately enhance the training procedure. However, the difference be- ing that curriculum learning aims at prioritizing easy samples in the beginning and progressively introduces more challenging samples as the training advances. While the benefits of curriculum learning has been demonstrated by a significant amount of empirical data [37], the method requires defining the order of the samples before training. This is in itself a costly task, as the sample difficulties must have been assessed beforehand. As tedious as it is, it is commonly done either manually or by transferring knowledge from a pre-trained model. However, self-paced learning [38] alleviates this problem and reformulates this ad-hoc scheme by utilizing training dynamics to decide if a sample is yet too challenging and sorts the data according to that heuristic. On the other end of the spectrum are methods related to core-set selection. These methods aim at finding a subset of the most informative samples in a labeled or unlabeled dataset. Applications include querying the most important samples to label or pruning data for budgeted and/or accelerated training. Toneva et al. [1] propose pruning data based on the number of forgetting events. The authors show that samples that are easily forgotten by the model during training tend to be more informative, while samples that are rarely forgotten tend to be less important. Excluding rarely forgotten examples and training the model on a fraction of the dataset may therefore improve convergence without degrading the model. Pleiss et al. [39] achieve better final performance by excluding mislabeled or overly ambiguous samples. This was achieved using a classification-based metric called margin [40] [41] [42], in their case, the final layer margin. The final layer margin is computed as the difference between the ground truth logit and the highest logit among the other classes, i.e. M(x, c) = zc(x)︸ ︷︷ ︸ assigned logit − maxi ̸=czi(x)︸ ︷︷ ︸ largest other logit , where c is the index of the ground truth value in the vector, and z(x) ∈ Rc is the pre-softmax output (logits), from the model given input x. Paul et al. [4] suggest using the error vector as an importance heuristic to prune data after merely a few epochs of training. Sener and Savarese [2] propose a method which involves optimizing a k-center problem. The key idea is to find a subset of so 27 3. Related Work called center points in the data that when trained on, achieve a performance as close as possible to the one that would be obtained when training on the whole dataset. With a distance heuristic, the algorithm tries to choose b center points such that the largest distance between a data point and its nearest center point is minimized. For computing the distance between samples, the authors suggest using the activations of the final fully-connected layer. While the notion of informative samples is shared between IS and core-set selection approaches, core-set selection typically requires a full training phase or at least a pretraining stage before being able to assign sample scores. IS can be seen as a middle ground approach of the two aforementioned concepts [43]. Like curriculum learning, IS involves prioritizing samples online during training. Like core-set selection, IS needs to assign samples’ importance scores dynamically. Various suggestions for measuring importance and how to use importance sampling in a training algorithm have been proposed in the literature. Some popular sug- gestions have been to sample training examples based on the loss. Loshchilov and Hutter [12] employ a strategy where datapoints are ranked with respect to their latest loss and then sample based on an exponentially decaying function applied on top of the ordered samples. Ioannou et al. [22] suggest an approach where at the end of each epoch, a portion of the lowest loss yielding samples are replaced with a portion of the highest loss yielding samples. At the cost of introducing a hyperparameter, the authors also suggest utilizing a momentum component for the importance score for a more balanced sampling scheme and a prominent improve- ment in convergence speed. The paper states that this approach achieves up to 40% convergence speed-up on CIFAR10 [44]. Katharopoulos and Fleuret [45] propose a strategy that trains a second neural network in parallel and predicts the loss of the main model for the sampling procedure. They call this method approximative importance sampling. This study [45] claims a speed-up of 20% to 30% on image classification on MNIST [46] and CIFAR10. However, it is important to note that the authors evaluate the speed-up based on the convergence in training loss which should be biased in this setting. Katharopoulos and Fleuret [7] later published a study where they leverage an upper bound on the gradient norm instead. In fact, it has been theoretically shown that the optimal proposal distribution should be proportional to the per-sample gradient norm [10]. However, the per sample gra- dient norm is computationally prohibitive to compute, so it may be impractical to apply in practice. Nonetheless, the authors suggest that the gradient of the loss with respect to the pre-activation outputs of the last neural network layer gives an upper bound to the gradient norm of all the parameters, and can be utilized in IS to converge faster to a lower test error. In contrast to the aforementioned studies, other studies have shown compelling ev- idence for importance samplers that concentrate on uncertain samples rather than purely high loss samples. Chang et al. [11] compared Loshchilov and Hutter [12]’s approach with samplers that, instead of using loss, focus on samples with high pre- diction variance or threshold-closeness. Their observations concluded that preferring high loss samples works well when the task is relatively easy (i.e. both training and testing accuracy are high). However, when the task is complex and the dataset 28 3. Related Work is challenging and/or noisy, a preference towards uncertain examples showed to be more beneficial for the final generalization performance. Online batch selection A popular method for addressing the problem of a rapidly evolving importance distribution involves doing additional forward passes. Some of the aforementioned studies employ this concept and several other works have explored this route [7], [9], [12], [47], [48]. They motivate its advantage over the naive sampling from Algorithm 4 when measuring convergence based on wall- clock time instead of the number of forward propagations. The idea is to sample a big batch to forward propagate. From this, a sample heuristic could be utilized to select a subset for the backward-pass. Assuming that the backward pass is the bottleneck in the algorithm, this approach could potentially result in a speedup. However, for this, many samples have to be loaded from disk into memory and de- coded by the Central Processing Unit (CPU) at a time, which in itself may be a greater bottleneck than the backward-pass when working with large images. For this reason, online batch selection does not appear useful in OD for AD and is not investigated in this thesis. Reducible holdout framework Mindermann et al. [9] proposed Reducible Hold- out (RHO) which is a framework that is used in combination with IS. It first trains a smaller model on the same task using a held-out part of the dataset denoted as Dho that is consequently excluded from the main dataset. This smaller model is trained once, after which it performs one forward-pass over the whole dataset. Using the forward-pass and a heuristic such as the loss, a score for each sample is computed. They explore the loss as an heurstic. The loss produced from the small model is denoted the irreducible loss. When training the real model, this irreducible loss is subtracted from the normal loss produced by the main model to form a new heuristic called the reducible loss. The formal notation for this importance heuristic is simply: RHO loss︷ ︸︸ ︷ L(x, y; Dt)︸ ︷︷ ︸ training loss − L(x, y; Dho)︸ ︷︷ ︸ irreducible holdout loss (3.1) The general intuition behind RHO loss selection relies on the empirical observation that it avoids easy, noisy, and out-of-distribution samples. Noisy samples are chal- lenging but uninformative. Such samples include samples with incorrect labeling and may contribute to degrading the model’s performance. Out-of-distribution samples are in other words outliers. Despite not being mislabeled, from the perspective of the test data evaluation, it could be hurtful for the model to put too much emphasis on these samples. The following points are stated in the paper: • Since easy samples tend to have a low training loss, and the RHO loss is always less than the training loss, such samples are less favored. If the model were 29 3. Related Work to forget an easy sample, the metric would account for this since the training loss would be high, and which increases the chance of revisiting it. • Noisy samples would be harder for the holdout model to generalize, incurring a high irreducible loss and consequently a low RHO loss. • Out-of-distribution samples follow a similar reasoning as for the noisy samples. Consider an outlier d′ and a non-outlier d. Since Dho is an unbiased split from D, the following holds true for both datasets P (d′) < P (d). Naturally, the holdout model will perform worse on out-of-distribution samples and induce a high irreducible loss and low RHO loss. Several of the aforementioned studies suggest that accelerated training can be ob- tained by exploiting informative samples using importance sampling. However, none of them show any empirical evaluations in the task of OD, which is a seemingly more complex task when compared to image classification. Not only is the complexity of OD higher than the tasks that IS has been reported with, but the dataset [30] used in this thesis is vastly more complex than CIFAR10 and CIFAR100 which is among the popular choices in these studies. Previous studies have investigated other ideas for enhancing training convergence and also try to address the long tail problem in the context of DL OD [49]. These methods are generally concerned with some func- tion for selecting the most important regions within images rather than operating on a dataset level and often require modifications to the optimization problem by introducing some component to the loss or the backpropagation stage. Online Hard Example Mining (OHEM) by Shrivastava et al. [49] has shown to be one of the more prominent methods where the idea is to consider only the most beneficial regions of interest for the backpropagation. This method is, however, limited to two-stage object detectors as the regions are parsed from a RoI pooling stage. Based on the idea of OHEM is a method called Loss Rank Mining by Yu et al. [50]. The method can be applied to one-stage detectors and works similarly by filtering out all but the k highest prediction-wise losses from the final feature map. Nevertheless, this approach is mainly tailored to grid-based and template-based detectors and does not translate to a stable solution for detectors where the image plane stride of the dense prediction map is low. The culprit is the fixed k along with the fact that the number of positive target pixels can vary substantially between samples. Another method worth mentioning is Prime Sample Attention (PISA) [51]. It presents an intermediate preference between hard mining and random sampling of image regions. While it does propose an importance-based sampling scheme, it does not, similar to the other aforementioned methods that are applicable in OD, operate on a dataset level. 30 4 Methods This thesis studies the possibility of leveraging importance sampling to accelerate training in DL OD. Given this objective, the project branches into two main areas of practice: 1) The development of a sufficiently performing, low-latency object detec- tion system along with its full data preprocessing, training, evaluation, and inference pipeline, and 2) The development of an IS framework for applying IS methods and conducting relevant experiments. The methods presented in this chapter heavily rely on theory provided in chapter 2 and related work from chapter 3. However, due to the absence of literature on IS in OD, related work has been adapted from a predominantly image classification setting to an OD setting. This discrepancy alone has revealed multiple challenges, which, together with other applied topics, are presented in this chapter in more detail. The methods outlined in the sections below first cover details about the object detection system used in the project. Subsequently the IS framework and design choices for the system as a whole are presented. 4.1 Object Detection This thesis employs a single-stage anchor-free object detector that closely resembles FCOS [35], presented in section 2.2.5. Similarly to FCOS, the model used in this project is fully convolutional and directly computes its predictions on dense location- aware feature maps where all the predictions for classification and regression for a location (x, y) are stacked along the channel axis as illustrated in figure 4.1. 4.1.1 Backbone Tian et al. [35] use ResNet50 and ResNet101 as backbones, i.e. ResNet models with 50 and 101 convolutional layers respectively. This thesis uses a ResNet34 where each convolutional layer has only 50% of the channel count of the original model proposed by He et al. [52]. This is due to the constraints of AD, where the model has to be small in order to run in real-time in a car while not compromising too much performance and to accommodate the size of the dataset to minimize the risk of severe overfitting. 31 4. Methods 4.1.2 Architecture Tian et al. [35] use an FPN in FCOS, in order to handle differently scaled objects. This project instead uses Deep Layer Aggregation (DLA) [53] where the last 3 stages are aggregated to go from 32x to 8x down-sampling. Hence, a single feature map is returned from the model. This keeps the architecture simple and the scale invariance in latent feature space. For this thesis, it is deemed secondary to achieve state-of- the-art performance, and simplicity is prioritized to have a clean environment for studying IS. Figure 4.1: An abstract overview of our model’s architecture. The ResNet backbone extracts features from the input image, while the Deep Layer Aggregation module aggregates the extracted feature maps to produce the output in the form of a dense prediction map. 4.1.3 Don’t-Care Tokens In a dataset, there may be objects whose class is unknown, so called “don’t-care” objects. This label can for example represent an object where the annotator could tell that there was an object, but could not tell which class it belongs to. For this kind of object, it is desired to only propagate gradients for the bounding box, but not for the classification part. To accomplish this, a mask is defined during data processing, which marks any pixel that is contained by a don’t-care object. The mask for these pixels is 0, and 1 otherwise, and the mask is pixel-wise multiplied by the classification loss in order to only propagate gradients of pixels where the class is known. 4.1.4 Loss Function In their paper, Tian et al. [35] use equation 2.5 as the loss function. Instead of using focal loss, or training C binary classifiers, that is one for each class, this thesis uses a Softmax activation function followed by categorical cross-entropy loss. Softmax is in this context expressed as σ(zx,y,i) = ezx,y,i∑C j=1 ezx,y,j , where zx,y,i is the preactivation logit for the classification prediction at location (x, y) and class index i. As such, the output will represent a probability vector which conforms to the fact that an object 32 4. Methods can only belong to one of the classes. Categorical cross-entropy loss is expressed as follows; Lcls(cx,y, c∗ x,y)) = − C∑ i=1 1i=c∗ x,y · log(cx,y,i) . Additionally, it is worth noting that the original FCOS loss function does not con- sider don’t-care tokens. Consequently, the loss function in this project is as follows; L((cx,y, ux,y), (c∗ x,y, u∗ x,y)) = λcls Ncare ∑ x,y 1carex,yLcls(cx,y, c∗ x,y) + λreg Npos ∑ x,y 1c∗ x,y>0Lreg(ux,y, u∗ x,y) (4.1) where • λcls and λreg are weights for the classification and bounding box regression losses respectively (and are set to 1 unless otherwise specified). • 1carex,y is the don’t-care mask value at pixel (x, y), as described in section 4.1.3. • Ncare is the number of positive pixels in the don’t-care mask, i.e. ∑ x,y 1carex,y . 4.1.5 Non-Maximum Suppression Since every pixel in an image predicts a bounding box and crowds the region pro- posals, a method is required to filter the most relevant predictions. There are two kinds of bounding boxes that need to be filtered away: low-confidence bounding boxes, and bounding boxes predicting the same object as another bounding box. As described in Section 2.2.4, NMS takes care of both of these cases and is used during inference in this project. The confidence- and IoU-thresholds are both set to 0.5. 4.2 Importance Sampling For the evaluation of several IS methods to be convenient, a framework is developed which allows for customization of the strategy. This framework works by keeping a list of importance scores, one for each training example. Whenever the model is going to train, one or more examples are sampled using the importance distribution. The importance scores are all positive, and the sampling is done by sampling an example with a probability proportional to its importance score: pi = si∑N j=1 sj (4.2) where si is the importance score of sample i and N is the number of training samples. After the model has trained on these examples, the importance scores of the selected examples are updated using the outcome of training on them. The outcome of 33 4. Methods the training refers to any attribute which can be computed from a single forward and backward pass, e.g. loss, prediction error, gradient norm etc. How the new importance score is computed given these attributes differs, and is here called a sampling heuristic. Various heuristics are explained in this section. There are also strategies that are invariant to which heuristic is used, but which influence the sampling in other ways. 4.2.1 Baseline The baseline approach is to not use any importance sampling at all. In this approach, the order of the training examples is randomized before every epoch to remove any bias in how the data is ordered. The model then trains on every training sample exactly once per epoch. This approach serves as a baseline to compare all other approaches to and is referred to as either “baseline” or “sequential” in the rest of the report. 4.2.2 Smoothing Smoothing is a heuristic-invariant approach that can serve two purposes: 1) Induce more exploration, and 2) Reduce the effect of stale importance values. It is important to note that smoothing can either be accumulated after each iteration or simply used as a non-accumulating bias term to the importance score. When the probabilities are calculated as in equation 4.2, the distribution is closer to uniform than it would have been otherwise, which induces more exploration since all samples’ importance scores are increased, not just the ones that were selected and trained on. In this thesis, smoothing acts as a non-accumulating bias term and the value is dynamically set to the mean importance divided by the number of examples in the dataset: smoothing = 1 N2 N∑ i=1 si (4.3) 4.2.3 Loss-based Importance Sampling The Loss Importance Sampler (LIS) is the first IS heuristic. The importance of a sample is simply equal to the loss of that sample; si = Lθ(xi, yi). As conveyed in chapter 3, the loss is a common IS heuristic and has been used by many practitioners. However, when dealing with the loss of an object detector, it is important to be mindful of the differences between the optimization problem and the indirect goal of the importance sampler. Recall that the loss in this project produces a pixel-wise loss in the dense prediction tensor. Since the aim is to accelerate the training of detecting objects, computing an importance score that is homogeneously reduced per pixel is counterproductive, because the importance scores can be dominated by large objects (i.e. objects that occupy a great number of pixels) along with negative locations (i.e. background). 34 4. Methods 4.2.4 Object-wise Reduction One way of addressing the above-mentioned problem is to reduce the scores per object. This is done by keeping track of which pixels belong to each object and averaging the values in an object-wise manner. This yields one importance value per object, which can then be reduced further into a sample-wise value. Some suggestions for the second reduction are as follows: • Sum: The object-wise importances are summed. • Average: The object-wise importances are averaged. • Area-weighted mean: The object-wise importances are scaled by some value proportional to the area of their objects, and the averaged. They can for example be scaled by the square root of the areas. • Max: Pick only the highest of the object-wise importances to represent the sample. • Root mean square: Compute the root of the mean of the squares of the object- wise importances. This thesis considers mainly the average as an object-wise reduction technique, which is the one that is referred to unless otherwise stated. An illustration of how the average reduction works can be seen in figure 4.2. Figure 4.2: Visual representation of the average object reduction. Note that the set of background pixels is also considered to be an additional object. The average is prioritized in this project because it is the most stable of the reduc- tions, and the other methods have effects that might be undesired. For example, using the sum would generally give high scores to samples with many objects, and using the area-weighted average would generally give high scores to samples with large objects. The average is however not perfect, as it risks drowning out impor- tant objects if there are many unimportant objects in the same sample. The max reduction would not do this, but is not as stable and would be susceptible to noise. 4.2.5 Error-based Importance Sampling The error-based sampler assigns a score scaled with how poorly the model performs in its prediction about each sample. This is similar to how the loss sampler works. One of the original methods is presented by Chang et al. [11]. They compute a 35 4. Methods score that depends on the history of an error-based heuristic over the epochs. For classification the error-based heuristic is 1 M ∑M t=1(1− yT ŷt), where M is the current epoch. Paul et al. [4] adopt a similar heuristic but compute the score as the norm of the error vector and do not average the score over multiple epochs. The error-based sampler used for this thesis is similar to the approach from Chang et al. [11], but does not consider the history, and is adapted in several ways to include both the multi-modality of OD and the dense output feature map of the FCOS approach. For the classification part, the score for a sample is defined as 1 Ncare ∑ x,y −log(c∗T x,ycx,y) · 1c∗ x,y>0 (4.4) The score for the bounding box regression task is similar but uses the per-pixel IoU-loss, and the regression mask instead of the classification (don’t-care) mask. These two components, classification and regression, each being a masked average, are then summed, yielding a single value representing the poorness of the prediction for the sample. In the case that object-wise reduction is desired, it can be used instead of a masked average. Note how the calculations, similarly to cross-entropy, use the negative logarithm to enable an importance score that could range from zero to infinity. In the case of an averaged object-wise reduction, this trick could alleviate the problem of hard underrepresented classes getting drowned out by other easy classes within the same image. 4.2.5.1 Uncertainty-based Importance Sampler - Decision Boundary Close- ness It can be beneficial for the model to train not just on samples that it makes big mistakes on, but samples where the model is uncertain, i.e. the confidence of a prediction is close to the decision boundary. In this case, the computations of the two components of the error can be modified to instead represent the model’s uncertainty. Let ax,y = c∗T x,ycx,y. Then, replace ax,y with 4 · ax,y · (1 − ax,y). The classification part of the score for a sample is now 1 Ncare ∑ x,y −log(4 · ax,y · (1− ax,y)) · 1c∗ x,y>0 (4.5) Since ax,y ∈ [0, 1], ax,y · (1 − ax,y) will reach its highest point when ax,y = 0.5; at its decision boundary. The expression is multiplied by 4 simply to normalize it back into the range [0, 1]. This technique is applied similarly for the regression task but with IoU-loss instead, which is also in the range [0, 1]. We denote this sampler Uncertainty Importance Sampler (UIS), and it is analogous to the threshold- closeness-based sampler from Chang et al. [11] which was shown to outperform loss- based approaches. Similarly to the loss-based sampler UIS uses the object-wise reduction instead of the masked average in equation 4.5. 4.2.6 Reducible Hold-out Loss The loss value of a noisy or out-of-distribution sample can be very high, so using it directly would give high importance to such training examples, which is undesirable. 36 4. Methods For this reason, the RHO-framework is considered. A discrepancy between the orig- inal paper from Mindermann et al. [9] is that they use online batch selection, which this thesis does not. Another difference is how the reducible losses are computed. The authors compute the reducible loss as the difference between the training loss and the irreducible loss, see equation 3.1. In this thesis, the reducible losses are produced in a similar way but are also shifted and re-scaled: Let s = L(x, y, Dt)− L(x, y, Dho). Then ŝi = si −min s max s−min s (4.6) This is done in order to keep the reducible scores positive and compatible with equation 4.2. Finally, the authors sample training examples by picking the highest reducible score. This is problematic when not using online batch selection, as some of the importance scores can be stale. In this project, the training samples are instead sampled with the probabilities given in equation 4.2. Hyperparameters revolving around the holdout-training are given in table 4.1 below. Table 4.1: Hyperparameters for the holdout-training in our experiments. Hyperparameter Value Backbone Resnet 34 Feature multiplier 0.25 Num epochs 100 Batch size 128 Holdout training split 0.15 4.2.7 Rejection-based sampling Estimating each sample’s importance while the model’s parameters evolve is a chal- lenge that has been pointed out in numerous previous studies [7], [9], [43], [48]. Outdated importance scores may decrease the effectiveness of importance sampling as the proposal distribution in such a case poorly reflects the target importance distribution. This problem is often addressed using online batch selection, which appears inapplicable for this project for reasons discussed in chapter 3. Apart from smoothing, to further alleviate this issue, we explore a method analogous to a ran- dom walk in the dataset. The idea is to accept or reject samples based on a compar- ison between one of the latest forward-propagated samples and a randomly drawn proposed sample. This is done instead of sampling naively based on the scheme highlighted in algorithm 4. Although the proposed sample may have an outdated importance score, at least the latest forward propagated sample should be up-to- date. For this approach, there is a lack of theoretical guarantees. It is however an intriguing approach to investigate empirically as a means to explore alternative sampling strategies. The algorithm in question is outlined as follows: 37 4. Methods Algorithm 5 Rejection-based sampling scheme as an alternative to the naive sam- pling scheme in algorithm 4 Data: D Batch size: nb Model parameters: θ q0 ∝ U(1, |D|) D0 ←− nb data points sampled with q from D for t = 1, ..., T do for i = 1, ..., nb do Sample candidate sample xc ∼ q Get active sample xa ←− Dt−1,i if q(xa) > 2 · q(xc) or q(xc) < p · (q(xc) + q(xa)) where p ∼ U(0, 1) then Reject candidate sample Dt,i ←− xa else Accept candidate sample Dt,i ←− xc end if end for [s, ∇L]←− ForwardBackward(Dt, θ) θ ←− θ − ηt ·∇L q ←− UpdateImportanceScores(q, s) end for 4.3 Regression-agnostic sampling As the results in chapter 5 will show, some sampling strategies intrinsically prioritize and improve performance for certain sizes of objects, even when using object-wise reduction. For this reason, it is interesting to completely ignore regression in the importance calculations. In the heuristics presented above, the importance scores are always computed as the sum between a classification part and a regression part. When ignoring regression, the regression part of the equation is simply set to 0. 4.4 Dataset The ZOD [54] is a recently released open dataset from Zenseact [30]. ZOD includes many modalities useful for AD, e.g. road semantic segmentation, and 2D- and 3D bounding box detection in both single frames and sequences. For this thesis, only 2D bounding boxes for dynamic objects in single frames are used. This includes approximately 90,000 training images and 10,000 validation images. As can be seen in figure 4.3, the license plates for cars in ZOD are anonymized, and this is also the case for faces. ZOD gives two types of anonymization; blurring and deepfakes. However, the anonymization is automatic and imperfect, and occasionally leaves images visually corrupted. In order to use as clean a dataset as possible, this thesis uses the original, non-anonymized version of the data, which is not available publicly. 38 4. Methods Since ZOD consists of frames with a resolution of 3848 × 2168, to streamline the experiments and minimize data-loading bottlenecks, the images are cropped in the data preprocessing pipeline. The Region of interest (RoI) is approximately the middle third in the x-dimension and the middle sixth in the y-dimension. This crop is responsible for objects far away in a typical scenario, so it is called the far Region of interest (RoI), and this version of the dataset is called ZODFAR. Figure 4.3 depicts how the crop is applied. Figure 4.3: Visualization of a sample with its annotations. The orange box highlights a wide RoI while the pink box highlights a far RoI. Although both RoIs are important in a real scenario, only the far RoI is used in ZODFAR as a means to keep the system simple. The original image has image dimensions: 3848× 2168 while the processed image has dimensions: 1280× 384. There are several reasons why this is the region of interest. Firstly, it reduces the size of the input image to the network by almost a factor of 20, which is necessary to keep the model running in a streamlined setting. Secondly, cropping a small part of the image is preferred to cropping a large part of the image and scaling it down, which leaves a lot of the bounding boxes extremely small in comparison to the total 39 4. Methods image size. The number of objects in ZODFAR are stratified by class and size in table 4.2. Small objects are defined as objects with an area of < 0.33% of the image size, medium objects have an area between 0.33% and 3%, and large objects have an area > 3% of the image size. These thresholds are mainly based on COCO’s size stratifications [55]. Table 4.2: Number of objects in ZODFAR’s dataset stratified by size and class. The number to the left of the parenthesis is the number of objects in the training set, while the number inside the parenthesis is the number of objects in the validation set. Pedestrian Vehicle VulnerableVehicle Animal DontCare large 2342 (326) 93084 (11435) 2120 (353) 61 (5) 0 (0) medium 32295 (3936) 217714 (21478) 11315 (1614) 578 (79) 0 (0) small 66581 (7563) 193465 (24913) 12942 (1736) 1050 (135) 85999 (9168) 4.5 Experiment Details Each sampling strategy was run several times (10 unless otherwise specified) and all runs of the same kind were averaged at each time step in order to give a robust estimate of the performance. Each run used a batch size of 128 and was trained for 350 epochs with 89976 training samples on an Nvidia A100 GPU 40GB, and took roughly 40-50 hours per run until complete. Note that since our system is running compute-intensive operations to log and collect certain data for monitoring and summaries of each iteration and validation, 40-50 hours might not be representative of the time it would take in a streamlined setting. Also note that for the runs with importance samplers, the notion of an epoch disappears after the first epoch is complete since they no longer iterate over the data sequentially. Lastly, all object detectors were trained using the Adam optimizer [18], with an initial learning rate of η = 10−3, β1 = 0.9 which is the exponential decay rate for the 1st moment estimates, β2 = 0.999 which is the exponential decay rate for the 2nd moment estimates, and a numerical stability constant ϵ = 10−7. 40 5 Results In this section, we present the results of our project. As stated, the purpose of this thesis is not to achieve state-of-the-art performance in OD, but instead to evaluate IS methods in terms of how much they speed up training in comparison to the baseline. As such, the results focus on the performance influenced by importance samplers and their various configurations. The results are presented with different levels of stratification, ranging from general mAP to AP stratified by both object size and class. The AP is computed over all recall values, and with interpolated precision values. mAPsmall, mAPmedium, and mAPlarge are also computed, meaning the same computation as above but filtered on small, medium, and large objects respectively. The performance of an IS method is primarily measured by how much faster in terms of the number of iterations it could reach the maximum average mAP of the baseline, as in figure 1 in [9]. Although wall-clock time is what one ultimately wants to decrease, measuring speedup based on wall-clock is not meaningful in practice as it can depend on miscellaneous implementation details. In particular, speed-up is based on the rate of learning and can be calculated as follows; num_iterationsbaseline num_iterationsother_method − 1 (5.1) where • num_iterationsbaseline is the iteration where the baseline reaches its maximum mAP • num_iterationsother_method is the iteration where the mAP of the other method surpasses the maximum mAP of the baseline Furthermore, as mentioned and as a means to delve deeper into the detection per- formance of certain subgroups in the data, we assess how the importance sampler affects performance across sizes and classes. It is especially interesting to observe how IS affects the performance in data points that lie in the low-density regions of the unknown data distribution. Since these types of samples are so scarce in the data, generalizing to such samples could require significantly more training itera- tions when sampling the batches uniformly without replacement. It is in our best interest to accelerate training on these types of samples, as they could bottleneck 41 5. Results the finish point of training. As a proxy for these types of samples, we will evaluate the performance of the Animal class in greater detail. The Animal class is vastly underrepresented in the data, as shown in table 4.2. 5.1 Loss and Uncertainty Samplers In figure 5.1 we can see that for an average over 10 runs, sampling training exam- ples using the mean-object-reduced UIS is beneficial and speeds up training by 95%, meaning that the training time could be almost halved if further performance is not required. On the other hand, sampling based on mean-object-reduced LIS does not appear to accelerate training. It is worth noting that measuring the speedup ac- cordingly may be subject to volatility and is best assessed with a significant amount of runs averaged together. However in this case, the plot in figure 5.1 indicates a consistent and clear superiority of the UIS when compared to the baseline. 0 5 10 15 20 25 1e4 Iterations 0.28 0.30 0.32 0.34 m AP 95% speedup mAP Sequential Sampler UIS LIS Figure 5.1: Comparison of overall mAP on ZODFAR for models trained with Sequen- tial (baseline), LIS, and UIS. The closed black line marks the speedup for achieving the baseline’s highest averaged mAP. The filled-in area marks ±1 standard devia- tion. Looking closer at the results, specifically comparing performance on the different sizes of objects in figure 5.2, it is clear that both sampling approaches improve the 42 5. Results learning for learning small objects. This is also where the model performs worst, as indicated by the mAP for small objects being lower than for medium or large ones. For medium objects, the UIS appears to outperform the baseline, which the LIS does not. For large objects, neither sampling strategy reaches the highest mAP of the baseline, so no speedup can be measured. 0 5 10 15 20 25 1e4 Iterations 0.28 0.30 0.32 0.34 m AP mAP 0 5 10 15 20 25 1e4 Iterations 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 m AP mAP small 0 5 10 15 20 25 1e4 Iterations 0.36 0.38 0.40 0.42 0.44 0.46 m AP mAP medium 0 5 10 15 20 25 1e4 Iterations 0.40 0.42 0.44 0.46 0.48 0.50 m AP mAP large mAP stratified by size Sequential Sampler UIS LIS Figure 5.2: Comparison of mAP , mAPsmall, mAPmedium, and mAPlarge on ZODFAR for models trained with Sequential (baseline), LIS, and UIS. Figure 5.3 reveals the performance of the different samplers stratified by both class and size. The UIS outperforms both the baseline and the LIS in the majority of categories and only degrades the model noticeably in APlarge,vulnerablevehicle. It is outperformed by the LIS in APlarge,animal. The LIS does however degrade the AP in most categories. Specifically, the performance in all large categories and all medium categories, except for APlarge,animal seems to be degraded with the use of the LIS. 43 5. Results 0 5 10 15 20 25 1e4 Iterations 0.12 0.14 0.16 0.18 0.20 0.22 APPEDESTRIAN small 7563 objects 0 5 10 15 20 25 1e4 Iterations 0.300 0.325 0.350 0.375 0.400 0.425 0.450 0.475 AP medium 3936 objects 0 5 10 15 20 25 1e4 Iterations 0.24 0.26 0.28 0.30 0.32 0.34 0.36 AP large 326 objects 0 5 10 15 20 25 1e4 Iterations 0.350 0.375 0.400 0.425 0.450 0.475 0.500 0.525 0.550 APVEHICLE 21478 objects 0 5 10 15 20 25 1e4 Iterations 0.50 0.55 0.60 0.65 0.70 0.75 AP 24913 objects 0 5 10 15 20 25 1e4 Iterations 0.65 0.70 0.75 0.80 0.85 0.90 0.95 AP 11435 objects 0 5 10 15 20 25 1e4 Iterations 0.10 0.12 0.14 0.16 0.18 0.20 AP VULNERABLE VEHICLE 1736 objects 0 5 10 15 20 25 1e4 Iterations 0.300 0.325 0.350 0.375 0.400 0.425 0.450 0.475 AP 1614 objects 0 5 10 15 20 25 1e4 Iterations 0.375 0.400 0.425 0.450 0.475 0.500 0.525 0.550 0.575 AP 353 objects 0 5 10 15 20 25 1e4 Iterations 0.020 0.025 0.030 0.035 0.040 0.045 APANIMAL 135 objects 0 5 10 15 20 25 1e4 Iterations 0.06 0.07 0.08 0.09 0.10 0.11 AP 79 objects 0 5 10 15 20 25 1e4 Iterations 0.08 0.10 0.12 0.14 0.16 0.18 AP 5 objects Sequential Sampler UIS LIS Figure 5.3: Comparison of AP on ZODFAR stratified by bounding box size and class for runs with three different sampling heuristics; Sequential (baseline), LIS, and UIS. Each curve is plotted as an exponential moving average using a smoothing factor of 0.8 for visualization purposes. The corresponding runs can be found unsmoothed in figure A.3 in appendix. Through closer examination of APanimal and the stratified categories APsmall,animal, APmedium,animal, and APlarge,animal (see figure 5.4), it is apparent t