img₁ model₁ model₃model₂ model₄ img₂ model₁ model₃model₂ model₄ img₃ model₁ model₃model₂ model₄ img₄ model₁ model₃model₂ model₄ time, t output₁ output₂ output₃ output₄ Predictive Performance and Calibration of Deep Ensembles Spread Over Time A simple way of limiting the computational load of deep en- sembles when applied to sequence data Master’s thesis in Data Science and AI ALEXANDER BODIN ISAK MEDING DEPARTMENT OF ELECTRICAL ENGINEERING DIVISION OF SIGNAL PROCESSING AND BIOMEDICAL ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se www.chalmers.se Master’s thesis 2023 Predictive Performance and Calibration of Deep Ensembles Spread Over Time A simple way of limiting the computational load of deep ensembles when applied to sequence data ALEXANDER BODIN ISAK MEDING Department of Electrical Engineering Division of Signal Processing and Biomedical Engineering Chalmers University of Technology Gothenburg, Sweden 2023 Predictive Performance and Calibration of Deep Ensembles Spread Over Time A simple way of limiting the computational load of deep ensembles when applied to sequence data ALEXANDER BODIN ISAK MEDING © ALEXANDER BODIN, 2023. © ISAK MEDING, 2023. Examiner: Lennart Svensson, Chalmers University of Technology Industrial Supervisors at Zenseact: Joakim Johnander, Christoffer Petersson, and Adam Tonderski Master’s Thesis 2023 Department of Electrical Engineering Division of Signal Processing and Biomedical Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Schematic illustrating the proposed approach – an ensemble spread over time. Here the previous prediction from models that do not run inference on this timestep also count towards the output of the model, as described in section 1.1. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2023 iv Predictive Performance and Calibration of Deep Ensembles Spread Over Time A simple way of limiting the computational load of deep ensembles when applied to sequence data Alexander Bodin & Isak Meding Department of Electrical Engineering Division of Signal Processing and Biomedical Engineering Chalmers University of Technology Abstract In recent years, machine learning models that can provide uncertainty estimates that match their observed accuracy have seen an increased interest in academia. Such models are called calibrated, a quality essential for the safe application of neu- ral networks in high-stakes situations. However, good calibration is not enough – high predictive performance is also essential. Autonomous driving (AD) is a setting where this combination of model qualities is much-needed, with the additional re- quirements of real-time processing of sensor inputs such as camera video sequences. Deep ensembles (DEs) are state-of-the-art for non-Bayesian uncertainty quantifica- tion with high predictive performance. However, their deployment in AD has been limited due to their high computational load. We propose the deep ensemble spread over time (DESOT), a simple modification to DEs that seeks to limit their computational load on image sequence data by letting a single ensemble member perform inference on each frame of the sequence. We apply this proposed system to the problem of traffic sign recognition (TSR), a subfield of AD with a distinctly long-tailed class distribution. DESOTs display predictive performance competitive with DEs for traffic sign classification, using only a fraction of the computational power. For in-distribution uncertainty performance, DESOTs outperform MC-dropout and perform on par with DEs. We conduct two out-of- distribution (OOD) experiments. First, we show that DESOTs increase calibration robustness to common augmentations compared to single models while matching DEs. Second, we test performance on a completely unseen class, for which all models increase their uncertainty in terms of output distribution entropy. Post-hoc calibration using temperature scaling is also evaluated and is shown to improve the uncertainty quantification performance of DESOTs, both in and out of distribution. Keywords: Machine learning, artificial intelligence, computer vision, deep ensemble, deep neural network, uncertainty quantification, calibration, traffic sign recognition. v Acknowledgements First of all, we would like to thank our industrial supervisors Joakim Johnander, Christoffer Petersson, and Adam Tonderski for their never-ending support and en- couragement throughout our work with this thesis. Without the interesting dis- cussions we had, working on this project would not have been nearly as fun and rewarding. We would also like to thank Zenseact for allowing us to use their facili- ties and computational resources, without which this thesis would have taken a lot longer to finish. Lastly, we would like to thank our examiner at Chalmers, Lennart Svensson, for facilitating this thesis project. Alexander Bodin and Isak Meding, Gothenburg, June 2023 vii viii Acronyms + T temperature scaling applied to model. AD autonomus driving. CNN convolutional neural network. DE deep ensemble. DESOT deep ensemble spread over time. ECE expected calibration error. MC-dropout Monte Carlo dropout. MCE maximum calibration error. ML machine learning. NN neural network. OOD out-of-distribution. pp percentage point(s). px pixel(s). SM single model. SotA state of the art. TSR traffic sign recognition. UQ uncertainty quantification. ZOD Zenseact open dataset. ix Contents List of Acronyms viii List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Background and context . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Discussion of ethical and sustainability aspects . . . . . . . . . . . . . 6 2 Theory 7 2.1 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Residual networks . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Evaluating the predictive performance of neural networks . . . 8 2.1.3 Interpreting neural network outputs as probabilities for clas- sification tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Uncertainty quantification and calibration . . . . . . . . . . . . . . . 10 2.2.1 Expanding the concepts of predictive uncertainty and calibra- tion to neural networks . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Additional ways to measure calibration . . . . . . . . . . . . . 13 2.2.3 Temperature scaling . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.4 Aleatoric and epistemic uncertainty . . . . . . . . . . . . . . . 16 2.3 Neural network ensembles . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Deep ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Monte Carlo dropout . . . . . . . . . . . . . . . . . . . . . . . 17 3 Methodology 19 3.1 The machine learning task at hand . . . . . . . . . . . . . . . . . . . 19 3.2 Software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.2 Data preprocessing and datasets . . . . . . . . . . . . . . . . . 23 3.3.3 Dataset implementation details . . . . . . . . . . . . . . . . . 24 xi Contents 3.4 Experimental approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.1 Formal model definition . . . . . . . . . . . . . . . . . . . . . 25 3.4.2 Computational footprint . . . . . . . . . . . . . . . . . . . . . 26 3.4.3 Evaluating predictive performance . . . . . . . . . . . . . . . . 26 3.4.4 Uncertainty quantification and the difficulties of measuring calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.5 Evaluating performance on OOD data . . . . . . . . . . . . . 28 4 Empirical Findings 29 4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Choice of model architecture and size . . . . . . . . . . . . . . 29 4.1.2 Training and evaluation . . . . . . . . . . . . . . . . . . . . . 30 4.2 Predictive performance . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.1 Evaluation on classes with few samples . . . . . . . . . . . . . 32 4.2.2 Discussion of predictive performance . . . . . . . . . . . . . . 33 4.3 In-domain uncertainty quantification . . . . . . . . . . . . . . . . . . 34 4.3.1 Discussion on in-domain uncertainty quantification . . . . . . 36 4.4 Out-of-distribution uncertainty quantification . . . . . . . . . . . . . 37 4.4.1 Experiments on gradually augmented OOD data . . . . . . . . 37 4.4.2 Discussion of performance on gradually augmented OOD data 41 4.4.3 Experiments on complete OOD data . . . . . . . . . . . . . . 41 4.4.4 Discussion of model performance on complete OOD data . . . 46 5 Conclusion 51 5.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Bibliography 53 A Additional experiments I A.1 Sequence quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Comparing model architecture size . . . . . . . . . . . . . . . . . . . III A.3 Temperature scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . IV xii List of Figures 1.1 Single model architecture. One model performs inference on the input image at each time step. . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Traditional ensemble architecture. Every ensemble member performs inference on the input image in each time step. . . . . . . . . . . . . . 3 1.3 Schematic illustrating the proposed approach – an ensemble spread over time. Note that the number of forward passes is the same as for the single model in Figure 1.1. The previous prediction from models that do not run inference on this timestep also count towards the model output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Example of a reliability diagram. The height of the blue bars is the average accuracy in each bin, and the height of the pink bars is the average confidence in each bin. The dotted gray line is the identity function that a reliable model’s predictions follow. . . . . . . . . . . . 15 3.1 An example of a frame from the ZOD frames dataset with 2D anno- tation boxes for traffic signs drawn on the image. . . . . . . . . . . . 21 3.2 Random sample of still images from the training data, taken from the Frames subset of the ZOD. . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Random sample of eleven sequences taken from the sequences dataset that is used for comparing the tested models. . . . . . . . . . . . . . 22 3.4 Visualisation of the datasets used for the project. The three datasets train, validation, and test are all mutually exclusive subsets of the ZOD Frames dataset. The datasets sequences is an extension from the frames in the test dataset where the frames prior have been tracked and cropped. Note that ti denotes the ith frame of a sequence, i ∈ {1, ..., 11}, each frame in such a sequence originating from the same video as the corresponding t11-frame. . . . . . . . . . . . . . . . . . . 24 4.1 Graph comparing the predicting performance on the sequences dataset in terms of accuracy of a 5-member DESOT with a 5-member DE, a single model, as well as a MC-dropout model. The error bars are drawn for ±1 std. The ensemble spread over time (DESOT5) per- forms on par with the deep ensemble (DE5) despite requiring only 20% as much computation, while outperforming the single model. . . 31 xiii List of Figures 4.2 Final-epoch predictive performance on the sequences dataset, com- paring DESOT5 with a single model, DE5 and MC-dropout. The DESOT5 performs about as well as the DE5 on accuracy and F1- score, and both of these perform better than the single model and MC-dropout. Again, note that DESOT5 uses the same amount of computations as a SM or MC-dropout. . . . . . . . . . . . . . . . . . 32 4.3 Final-epoch predictive performance on a minority class version of the sequences dataset, comparing DESOT5 with a single model, DE5 and MC-dropout. The DESOT5 outperforms the DE5 on both accuracy and F1-score. Additionally, it outperforms single models and MC- dropout by a large margin in both metrics. . . . . . . . . . . . . . . . 33 4.4 Uncertainty quantification performance for each model on in-distribution data measured in Brier reliability. Lower is better. Temperature scal- ing seems to significantly improve the calibration for ensembles both on the test dataset and sequences dataset. For single models, it in- stead seems to worsen calibration. Overall, MC-dropout is the worst calibrated out of all the models. . . . . . . . . . . . . . . . . . . . . 35 4.5 Uncertainty quantification performance for each model on in-distribution data measured in ECE. Lower is better. Temperature scaling seems to significantly improve the calibration for both single models and ensembles on the test dataset. However, for the sequence dataset, temperature scaling seems to increase ECE. Overall, MC-dropout is the worst calibrated out of all the models. . . . . . . . . . . . . . . . 35 4.6 Illustration of the different augmentations used at various intensities, from no augmentation to maximal intensity. . . . . . . . . . . . . . . 38 4.7 Uncertainty quantification performance for each model on augmented data of increasing intensity. The performance is measured in accuracy, Brier reliability, and mean entropy. Lower Brier reliability is better. Tested on the sequences dataset. No model is clearly better or worse, though single models and MC-dropout are outliers in some respects. . 40 4.8 Entropy for OOD-data (red) compared to in-distribution data (blue) for the single-frame test and sequence datasets. The tests are run for various ensemble sizes M ∈ {1, 5, 10}, which are differentiated by color shade. Again, note that DE1 and DESOT1 are special cases, which are in effect both a single model (SM). The vertical dashed lines are the mean entropy for the model of the same color. . . . . . . 43 4.9 Entropy for in-distribution data (top row) compared to entropy for OOD data (bottom row) for the single-frame test dataset (left) and the sequences dataset (right). . . . . . . . . . . . . . . . . . . . . . . 45 xiv List of Figures 4.10 Illustration of the thresholding strategy applied to the entropy of a single model with and without temperature scaling. The solid line is the in-distribution entropy, and the dashed line is the OOD en- tropy. The vertical gray dotted line is the optimal threshold for that particular model, which is the same as in Table 4.4. All data points to the right of the threshold, inside the red area, are classified as OOD. Note that the increased separation between the in- and out-of- distribution lines for the temperature scaled version allows for higher OOD-detection performance. . . . . . . . . . . . . . . . . . . . . . . . 46 4.11 Examples of images with low entropy from the NotListed class. These are very similar to the in-distribution data and may be incorrectly annotated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.12 Examples of images in the NotListed class that are similar to in- distribution data. The top row contains samples from the training set (in-distribution) and the bottom row is similar signs that can be found in the NotListed class (out-of-distribution). . . . . . . . . . . . 48 4.13 Examples of images with high entropy from the in-distribution and the OOD data. For the in-distribution data, we see the most prob- lematic images for the model to classify. For the OOD data, these are the clearest examples of images of OOD data. . . . . . . . . . . . 49 A.1 Accuracy per frame at each timestep of the sequence dataset for a vanilla ResNet18 model and a 5-member ResNet18 ensemble. . . . . . I A.2 The crop size distribution for frames 1 and 11 across all sequences, plotted with 100 bins. The crop size is defined as the size (in pixels) of the smallest crop dimension. . . . . . . . . . . . . . . . . . . . . . II A.3 Comparing the effects from the different methods of creating a sin- gle frame dataset. The sequence frames are all the frames from the sequence randomly sampled without replacement. The Test frames are the last image in the sequence, which contains the most high- quality information. For the sequences dataset, a DESOT is applied. It seems that average input quality dominates quantity. . . . . . . . . III A.4 ResNet18 and ResNet50 compared on accuracy on the validation dataset across training epochs. . . . . . . . . . . . . . . . . . . . . . . IV A.5 The optimal temperature for models trained for different numbers of epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V A.6 The reliability plots comparing temperature scaling for a DE5 using two different techniques – individual temperature scaling and joint temperature scaling. The results are shown for the single-frame vali- dation dataset that the models are temperature scaled using. . . . . . VI xv List of Figures xvi List of Tables 2.1 Confusion matrix across a set of predictions for a binary classification problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 Predictive performance for each model tested on the sequence dataset in terms of accuracy and F1-score. The results include plus and minus one standard deviation of performance between runs. . . . . . . . . . 31 4.2 Augmentations and values used for OOD data creation. Brightness, saturation, and contrast all gradually decrease from their original values with increasing intensity. . . . . . . . . . . . . . . . . . . . . . 37 4.3 Mean entropy on OOD data (NotListed class). . . . . . . . . . . . . 44 4.4 Table of results from applying an entropy threshold for OOD detec- tion on the sequences dataset. . . . . . . . . . . . . . . . . . . . . . . 45 xvii List of Tables xviii 1 Introduction Recently deep machine learning has irrevocably changed the landscape for many industries, not least the automotive industry. The quest for autonomus driving (AD) vehicles is part of a big push, where machine learning (ML) techniques are key. A problem with the application of ML is that it has been shown that modern machine learning models are over-confident in their predictions, and have become more so amid the performance developments in recent decades [1]. In safety-critical applications, such as AD, the overconfidence of modern neural net- works is particularly troublesome since the model’s certainty in its output greatly influences what actions are sensible to take [2]. Naturally, it is then of great im- portance that the probability estimates that the model reports correspond to the actual predictive performance observed in the model’s output over time. This is typically referred to as calibration and is the measure of how well the subjective output probabilities and the observed long-term prediction performance match [3]. A model that closely and precisely assesses its own uncertainty is referred to as well calibrated [4], [3]. There are two main paradigms of machine learning models: Bayesian and non- Bayesian models [5]. A Bayesian model treats model parameters as stochastic vari- ables, each with a probability distribution that is updated during training based on the input data. This enables high-quality posterior distributions over the output space, with great uncertainty quantification performance. One such model is the Bayesian neural network. The disadvantage of this Bayesian approach is that the models are typically difficult to implement and slow to train compared to a nor- mal neural network (NN) [6]. This limits the Bayesian models to smaller networks or various approaches for approximating some other Bayesian model. Therefore, non-Bayesian approaches such as ensembles are more common in practice. For an ensemble, a number of member models are trained and are all applied to each data point at the time of inference. Ensembles have been shown to produce an improved predictive performance in machine learning [7], [8], with the disadvantage of in- creased computational load during training- and inference time since it typically scales linearly with the number of ensemble members. It has long been known that ensembles can quantify uncertainty in their predic- tions [9]. However, the ability of neural network ensembles to produce uncertainty 1 1. Introduction estimates of quality that rivals Bayesian models, while also achieving high predictive performance was first demonstrated in 2017 by Lakshminarayanan et al. [6]. This allowed for a practical and high-performance alternative to the Bayesian approach. This type of ensemble has since been called deep ensemble (DE) [10]–[12], and con- tinues to be the state of the art (SotA) in the field of uncertainty quantification (UQ) for machine learning [12], [13]. As previously mentioned, uncertainty estima- tion performance in terms of good calibration is essential in safety-critical tasks such as AD. Thus, the benefits of DEs are of interest in the AD space, but their large computational load limits deployment. This thesis proposes a new model that we call a deep ensemble spread over time (DESOT), an augmented version of deep ensembles. This new model limits compu- tational load while hopefully upholding the benefits of deep ensembles concerning predictive efficacy and uncertainty quantification performance. 1.1 Background and context Autonomous driving systems employ a wide array of sensors that allow the vehicle to perceive its surroundings and take appropriate action [14], [15]. These many sensors are then used for localization and mapping, path planning, decision making, and ultimately vehicle control [14]. One of the most important of these sensors is the camera, and AD systems use a suite of them to gain a full surround view of the environment. These many cameras produce continuous video feeds that the car has to process in real time, which requires a lot of computational power due to the additional requirements of low latency. One subtask of AD that uses the cameras of the vehicle is traffic sign recognition (TSR), which is all about detecting and classifying the traffic signs that the vehicle encounters on the road [16]. Modern TSR systems take the image sequences pro- duced by the vehicle’s cameras and use advanced machine learning models to classify the signs [17], [18]. Just like any deep machine learning system, these systems could benefit from using ensembles to boost performance. However, their application is limited by the computational resources available in the car, which have to be shared across all the functions previously mentioned. This project is commissioned by Zenseact, a software company developing autonomous driving solutions for major car makers. Zenseact wants to explore the use case for a system that integrates the aforementioned aspects by investigating whether ensem- bles can be implemented in a way that improves performance compared to individual machine learning models, but which is more resource-efficient than traditional deep ensembles. The machine learning setting that is in focus is that which works with temporal information in the form of sequences of images. Each sequence tracks an object across time, meaning the informational content is slightly augmented versions of the same object. 2 1. Introduction img₁ img₃img₂ img₄ model₁ model₁model₁ model₁ output₁ output₃output₂ output₄ time, t Figure 1.1: Single model architecture. One model performs inference on the input image at each time step. img₁ model₁ model₃model₂ model₄ img₂ model₁ model₃model₂ model₄ img₃ model₁ model₃model₂ model₄ img₄ model₁ model₃model₂ model₄ time, t output₁ output₂ output₃ output₄ Figure 1.2: Traditional ensemble architecture. Every ensemble member performs inference on the input image in each time step. img₁ model₁ model₃model₂ model₄ img₂ model₁ model₃model₂ model₄ img₃ model₁ model₃model₂ model₄ img₄ model₁ model₃model₂ model₄ time, t output₁ output₂ output₃ output₄ Figure 1.3: Schematic illustrating the proposed approach – an ensemble spread over time. Note that the number of forward passes is the same as for the single model in Figure 1.1. The previous prediction from models that do not run inference on this timestep also count towards the model output. All machine learning models considered in this thesis are trained and produce their outputs in the single-frame setting, but their predictions are then aggregated across time steps in the sequence. In this manner, the typical way to apply a single model is to let it produce outputs for the single frame in each time step in the sequence, and then aggregate the predictions across time (see Figure 1.1). In a similar manner, a deep ensemble would be applied by letting every member produce their prediction for each time step. The outputs across members are then aggregated for the current 3 1. Introduction frame before it is combined across time steps (see Figure 1.2). The proposed system combines the outputs of each member, just as for the traditional ensemble (see Figure 1.2), but only one member runs inference in each time step (see Figure 1.3). This means that the same number of forward passes are made as in the case of the single model approach, a fact that limits the computational load of the system. 1.2 Aim The aim of this thesis is to implement an ensemble spread over time and evaluate its performance against some high-performing models in uncertainty quantification. Baseline models are established and the performance of ensembles spread over time is compared to the performance of these. The system’s uncertainty quantification performance in and out of distribution is also compared to the baselines. Addition- ally, temperature scaling [1], a post-training calibration method, and its effects on UQ performance are evaluated. 1.3 Research questions Here follow the research questions that we aim to answer. Is it possible to achieve the benefits of a traditional deep ensem- ble in terms of predictive- and uncertainty quantification perfor- mance using a deep ensemble spread over time? (A) This is the main research question of the thesis, but it is quite broad. To answer this question we need to also answer the following questions. How does the predictive performance for deep ensembles spread over time compare to that of the chosen baselines in terms of F1-score and accuracy? (B) How does the UQ performance of deep ensembles spread over time compare to the chosen baselines in terms of ECE, MCE, and Brier reliability? (C) Do deep ensembles spread over time successfully adjust their con- fidence on OOD data by increased entropy, and how does their performance compare to the chosen baselines? (D) 4 1. Introduction 1.4 Delimitations Since the goal of this thesis is to investigate whether the benefits of using ensembles can be achieved using the proposed ensemble spread over time, only relatively stan- dard machine learning architectures are used. This increases the generalizability of the results, and in extension also means that obtaining SotA predictive performance was not a priority of this thesis. One of the main motivations of DESOT is to limit the computational load at the time of inference in order to enable the use of deep ensembles in settings with limited computational budgets. However, directly measuring the computational load of different ML models is highly non-trivial since many factors influence performance. Such factors include but are not limited to the speed of data transfer, memory capacity, as well as what other processes are running simultaneously. Because of this, it was decided that the number of model forward passes was to be used as a proxy for the computational load. This proxy is relatively useful since all models tested have the same basic architecture. Before actually employing DESOT or any other ML model on a real system, a thorough performance test would have to be conducted on the machine in question. While spatiotemporal fusion models such as those proposed by Ji et al. [19] and Tran et al. [20] are promising, they are challenging to design, obtain annotated data for, and train. Due to this, it is common to design detection, segmentation, and classification models for the single frame setting. Then, the single-frame predictions are combined in some way. This is how the system proposed in this thesis works, and such systems have proved to work fairly well. Furthermore, the spatiotempo- ral fusion approach is in some ways orthogonal to the ensembling scheme proposed for this thesis, where the two could conceivably be combined. For example, a spa- tiotemporal convolution similar to that proposed by Tran et al. [20] might be used to create an optimal combination rule between the outputs from individual ensemble members, instead of using simple averaging. For these reasons, we chose to limit the scope of the thesis to models that do not explicitly model the additional information that comes from the temporal aspects of a sequence of frames. 1.5 Contributions The main contribution of this thesis is to propose the idea of ensembles spread over time, an idea that to the best of our knowledge is novel. The method is compared to some of the most common non-Bayesian alternatives. DESOTs are shown to be competitive with traditional deep ensembles, both on pure predictive performance and uncertainty quantification performance, while using a fraction of the computa- tions. Experiments for predictive performance on rare classes are conducted, which seems to be a strong point of DESOTs. While more work is needed to investigate whether the results generalize to other tasks, datasets, and model architectures, we believe that this project serves as a strong baseline for further research into ensembles spread over time. 5 1. Introduction 1.6 Discussion of ethical and sustainability aspects The advent of autonomous vehicles and more intelligent artificial intelligence will radically change our society in a multitude of ways. This project is a small part of the journey to fully autonomous driving cars and although it might not have radical effects in isolation, the combined work of many related projects will. Therefore, potential issues related to ethics and sustainability must be taken into account. The safety aspects of AD are an important reason why it is such an important innovation. Around the world, there are 1.35 million fatal accidents every year [21], and more than 90 % of these are due to human error [22]. Vehicle safety is identified as a key factor in lowering these losses [21]. With AD software the driving style of the world’s safest human can hopefully be replicated and even exceeded, reducing the accident count greatly and in the best case lowering it to zero. The technology is not yet there but this project might be a step on the way there. Having millions of cameras on the road, and recording every second being necessary for a working AD system comes with the risk of these cameras being used for malev- olent purposes, making it essential to have a significant level of security around the AD software and the data generated [22]. Personal integrity is also a key concern in this regard, as well as the impact on the greater society. This means that companies working in AD must have strict ethical guidelines in order to ensure that the data collected by their systems is handled in a correct manner. Autonomous vehicles will most likely change the way we travel, providing convenient, fast, and cheap transport [23]. This will most likely come to the detriment of having more vehicles on the roads creating more pollution in the form of air particulates and noise pollution. Part of this issue will be reduced by converting the fleet of cars to electric, but there will still be pollution from the road surface and rubber tires. Vehicle utilization with AD is likely to increase [24] as services for sharing vehicles or utilizing fleets like public transport are created, which will potentially increase the practicality of not owning a private vehicle. The result of this is a better utilization of Earth’s limited resources which will help reduce the environmental strain from the production of new vehicles, ultimately helping the world reach the UN’s climate goals. Equity and equality will be affected greatly around the world, especially in poorer communities. Transportation can be a big financial strain on families with low income, but in the long term with AD, the cost of transportation can come down to be as low as ten times less than owning a private vehicle [24], close to the cost of public transport. This allows for more equity and equality as opportunities for work, education, and leisure can be more accessible to everyone. 6 2 Theory In this section, some of the most relevant previous work and the theory behind the concepts used will be presented with the aim of placing the thesis into a context in literature. 2.1 Convolutional neural networks Convolutional neural networks (CNNs) are used for increasingly complex tasks, which in turn require an increasing number of parameters to be added. The added parameters come with a cost both in terms of memory and in terms of computa- tional footprint for training and inference. However, it has been shown [25] that when the number of parameters increases, using ensembles will speed up training and inference, as well as increase prediction accuracy compared to individual models with the same total number of parameters. A CNN model consists of convolutional layers and many fully connected neural layers. The outputs from the final fully connected layer in a neural network are called logits. These are real-valued numbers that communicate the network’s prediction. 2.1.1 Residual networks Residual networks (ResNets) [26] is a type of deep convolutional neural network containing residual connections over blocks. A standard deep neural network can be hard to train since vanishing gradients start to become a problem for deeper neural networks [26]. Vanishing gradients is a problem that is caused by the products of gradients of the loss function getting increasingly small. In a deep network, the gradient may approach zero, leaving the next layers unable to update the weights. This can be mitigated by adding residual connections which allow the gradient to be passed through allowing for deeper CNNs. There are many different sizes of ResNets to choose from, for example, ResNet18 or ResNet50, which contain 18 layers and 50 layers respectively. 7 2. Theory 2.1.2 Evaluating the predictive performance of neural net- works There are many different ways of evaluating the predictive performance of a machine learning model in a classification task. Given a binary classification problem, a model prediction can be characterized as either a true positive (TP), a false negative (FN), a false positive (FP), or a true negative (TN) (see Table 2.1). Table 2.1: Confusion matrix across a set of predictions for a binary classification problem. Prediction True False Ground Truth True True positive False negative False False positive True negative Over a data set for which the model produces predictions, the counts for each of these four positions of the confusion matrix can be computed. Then, a number of metrics can be computed. The first of which (and most commonly used [27]) is accuracy, which is defined as accuracy = TP + TN TP + FP + FN + TN . (2.1) One well-known problem with accuracy is that it is insensitive to class imbalances in the dataset, which for imbalanced datasets can lead to deceptively high perceived performance. To combat this problem, two additional metrics and a derivative of them might be employed. These are precision and recall, which are defined as [27] precision = TP TP + FP (2.2) and recall = TP TP + FN . (2.3) Intuitively, precision is the share of positive predictions that were actually positive, and recall is the share of true positives that were captured in the prediction. From these two metrics, the Fβ-score can be computed. It is defined as Fβ = (β2 + 1) · precision · recall β2 · precision + recall (2.4) 8 2. Theory where β ∈ R, β ≥ 0 [27]. The Fβ-score ranges from 0 to 1, where higher is better. If β = 1, the score equally weights precision and recall and is then called the F1-score, often written without a subscript. This is the most common version of the Fβ-score. There are some extensions of the binary F1-score to the multiclass classification domain. For these, we use the naming employed in their implementation in scikit learn. One extension is the F1micro-score, which computes the different fields in the confusion matrix sample-wise one versus all, such that all samples are equally weighted in the resulting score. This is equivalent to accuracy. Another is F1macro, which uses this same operation class-wise and takes the unweighted average across classes. The former approach will allow the majority classes to dominate the score, while the latter will relatively over-weight the importance of prediction performance on samples belonging to rare classes. F1macro is useful in cases when performance on each class is equally important, and is usually what is meant when referring to F1-score in a multiclass setting. This is also the naming convention we use. 2.1.3 Interpreting neural network outputs as probabilities for classification tasks A classification neural network is a function that maps an input example i to an output vector of dimension C, which is the number of classes. In order to interpret neural network outputs as probabilities, it is common to use the function that in machine learning is colloquially called the Softmax function. It is applied to the final neural network output logits vector z ∈ RC and transforms each element zj into the range [0, 1] s.t. ∑C j=1 softmax(zj) = 1. Its mathematical expression was first introduced by Boltzmann [28] in 1868 as the Boltzmann distribution, and later used by Bridle [29] for interpreting neural net outputs as probabilities. Given an output logit vector z, the softmax value for the element zj ∈ z is computed as softmax(zj) = ezj∑C k=1 ezk . (2.5) Softmax applied to a vector is defined element-wise. The prediction ŷ for a neural network given an input example x is typically said to be ŷ = arg max j {softmax(z1), softmax(z2), ..., softmax(zj), ..., softmax(zC)} (2.6) where j is an index in the logit vector. The model’s confidence in that prediction is then simply conf(ŷ) = softmax(zŷ). One notable property of the softmax function is that it is scale-variant due to the nature of exponentials, meaning that softmax(z) ̸= softmax(z/r) ∀ r ∈ R, r ̸= 1. 9 2. Theory 2.2 Uncertainty quantification and calibration The theory of uncertainty quantification has long been of interest in academic circles but it came to be a more closely studied subject during the development of weather forecasting when meteorological forecasters give subjective estimates of the prob- ability of various weather phenomena. As we shall later see, these concepts quite naturally generalize to the machine learning context. One key aspect of the quality of a forecaster’s predictions is their calibration, a concept which is commonly defined as the deviation of the forecaster’s subjective probability assigned to an event from the long-term empirical probability of that event [3], [4], [30]. Calibration is also commonly referred to as reliability [31]. For the sake of clarity and mathematical correctness, consider an example adapted from DeGroot & Fienburg (1983) [3]. Assume that there are two possible events for a given day: rain, and no rain. For a given day i, a forecaster gives a subjective probability pi of the chance of rain from a set of K possible values in the set P , s.t. pi ∈ {P1, ..., PK}. Why limiting the number of possible values is necessary will become apparent later. At the end of the day, he observes the outcome yi of the event, where yi is an indicator variable that is 1 in the case of rain and 0 otherwise. Let ν(p) be the long-term distribution of the forecaster’s predictions of rain. Next, let ρ(yi = 1|pi) be the long-term empirical probability of rain given the forecaster’s prediction of rain pi. That is, given that the forecaster predicts a probability pi of rain on day i, what is the actual probability that it will rain based on historical predictions. With this background, a forecaster is said to be well calibrated if ρ(y = 1|p) = p for all values of p ∈ P such that ν(p) > 0 [4]. For example, this means that out of 10,000 days that a well calibrated forecaster claimed that the probability of rain was 0.4, it actually rained 4,000 of them. This also gives insight into why the number of allowed values that the predictions can take on has to be limited — it is necessary to allow for statistically sound empirical estimates of ν(p) and ρ(y = 1|p). Interestingly and importantly, a forecaster can be well calibrated but produce com- pletely uninformative predictions. The forecaster might simply observe the climato- logical base rate (long-term probability) for rain on any given day, and give that as his prediction of the chance of rain every day. Using the definition of calibration from earlier, this will lead to well calibrated but useless predictions. Thus, the notion of refinement is given as a complement to calibration [3]. Assume that forecasters A and B are both well calibrated. Let forecaster A give the climatological base rate as his prediction on any given day, and let forecaster B give either p = 0 or p = 1 on any day and always be correct. Then A is least refined, since any well calibrated forecaster is at least as refined as him, and B is most refined. See DeGroot and Fienberg [3] for a more mathematically expressive definition of refinement. Using this intuition regarding forecasters A and B, refinement can be seen as a measure of the usability of the forecasts given by the forecaster. One important point of contention in meteorological circles early in the field’s life cycle was how to evaluate the quality of a forecast. This created a need for metrics that are not possible for the forecaster to artificially game by adjusting their forecast 10 2. Theory to optimize for a higher score [30], such as by simply mimicking the known base rate of the events [3]. This prompted developments in the theory surrounding proper scoring rules, which are reward functions that encourage forecasters to give their actual subjective forecast probabilities rather than trying to game the system. More formally, suppose a forecaster believes the probability of rain to be pa. A scoring rule is one which rewards the forecaster S(pa) if it rains, and S(1 − pa) if it doesn’t. Assume that the forecaster wishes to maximize his reward. If the forecaster instead gives pb as his prediction of rain, contrary to his actual subjective prediction pa, his expected reward is paS(pb) + (1 − pa)S(1 − pb). If the scoring rule is proper, the choice to give pb = pa maximizes the forecaster’s reward. If the scoring rule is strictly proper, then pb = pa is the only prediction that maximizes the reward [3], [32]. Here, we have assumed that the forecaster wishes to maximize his reward, but the same of course holds for the negation of a score that the forecaster wishes to minimize. Many classification problems, such as weather forecasting, are not binary but multi- class, where an event can fall into any class c ∈ {1, ..., C}. A prediction of subjective probabilities for an event xi, i ∈ {1, ..., N}, where N is the number of events, are to be given by the forecaster. Following the notation from the previous paragraph but expanded to a multi-class problem, we denote the subjective probabilities given by the forecaster as a categorical distribution p(xi) = (pi1, pi2, pi3, ..., piC) across the classes, such that for any instance xi, ∑C j=1 pij = 1. With this background, a score was developed by Brier [30], that has later become known as the Brier score and is defined as BS = 1 N C∑ j=1 N∑ i=1 (pij − yij)2 (2.7) where yij is an indicator variable that is 1 if instance i falls into class j, and 0 otherwise. The Brier score of a perfect forecaster (forecaster B previously mentioned) will be 0. Forecasts of lower quality receive a higher Brier score. An important feature of the Brier score is that it is a strictly proper scoring rule [3]. More correctly, its negative is a proper scoring rule. However, as previously discussed, this makes no difference in practice since the same argument can be made for the negative of a score that the forecaster wishes to minimize. Murphy [31] showed that the Brier score can be decomposed into three distinct parts BS = Reliability − Resolution + Uncertainty︸ ︷︷ ︸ Refinement (2.8) and Bröcker [33] later showed more generally that any strictly proper scoring rule can be decomposed in such a way. 11 2. Theory The reliability part of the Brier score decomposition is often referred to as Brier reliability and is in itself a measure of calibration and can be used as such. In the binary case, it is defined by DeGroot and Fienberg [3] as Brier reliability = ∑ p∈P ν(p)(p − ρ(p))2. (2.9) Remember from before that P is the set of allowed values that the subjective forecast probabilities p can take on, ρ(p) is the actual probability of rain given the forecaster’s prediction, and ν(p) is the long-term distribution of the forecaster’s prediction of rain. The Brier reliability score is non-negative, and a perfectly reliable forecaster gets a score of zero. The Brier reliability can also be expanded to the multi-class setting, where Brier [30] first defined his score Brier reliability = C∑ j=1 K∑ k=1 njk Nj ( pjk − ojk njk )2 (2.10) where K is the number of different values that the subjective probabilities can take on, njk is the number of times that the kth forecast was issued for class j, Nj is the number of occurrences in class j, and ojk is the total number of events that occurred before the jkth forecast was issued [34]. Note that Equation 2.10 is the empirical decomposition of the Brier score, which is why the term ojk njk in Equation 2.10 is used as an estimate for ρ(x). By the same reasoning, the term njk Nj is used as an estimate for ν(p). This is also why the number of different values that the subjective probabilities are allowed to take on must be limited to only K. If the probabilities pi were allowed to take on any real number in the range [0, 1], it would not be possible to calculate the empirical long-term distributions ν(p) and ρ(p) since the number of observations is finite and the possible output space is continuous and therefore infinite. Thus, in practice, binning of the forecast probabilities into K bins is used for such scenarios. 2.2.1 Expanding the concepts of predictive uncertainty and calibration to neural networks We introduced the concept of calibration in the meteorological setting in section 2.2, and it is now time to expand it to the neural network setting. When introducing the concept, we used the definition that calibration is ”the deviation of the forecaster’s subjective probability assigned to an event from the long-term empirical probability of that event”. Thus, we need to expand the meaning of subjective probability and long-term empirical probability. In the neural network setting of this thesis, we lean on the theory from subsection 2.1.3 where the softmax output distribution is interpreted as class-wise output probabilities. This is the subjective probability in 12 2. Theory the definition above. With ”the long-term empirical probability of that event”, we refer to the actual probability, across the whole dataset, that a given class was the ground-truth class given the probability that was reported in the categorical output distribution. One interpretation of the uncertainty of the neural network in its output is the entropy of the predictive distribution. Here, entropy is meant in the information theoretical sense, which has intuitive connections to the physical interpretation. First introduced by Shannon [35] for digital communications, the idea of entropy in information theory is closely related to the average information content of the message to be relayed. Assume a message of length N , with C distinct characters, or values. If the fractions of the different characters’ prevalence in the message are denoted pj = nj N , j ∈ {1, ..., C}, where nj is the count of the jth character, the informational entropy of the message can be expressed by entropy = − C∑ j=1 pj log pj (2.11) which has later become known as Shannon’s entropy. If the logarithm used is of base 2, then the unit is bits. If the natural logarithm is used, the unit is nats [35]. In information theory, Shannon entropy acts as a lower bound on the number of bits, or nats, needed to relay the message in its theoretically most compressed state. Applying Shannon’s entropy to the probability vector produced by the softmax function allows for an interpretation where the entropy of the categorical output distribution p(xi), for an input example xi, is a measure of model uncertainty. The intuition is that if the network is uncertain, we expect a flat output distribution across classes, which yields a high entropy. Conversely, if the network is very cer- tain in its prediction, we expect a very sharp distribution, which results in a low entropy. The output distribution is a frequentist point estimate of the distribu- tion over classes, and the uncertainty is therefore not the uncertainty across model parameters in the Bayesian sense. 2.2.2 Additional ways to measure calibration One common way to visualize the degree of calibration for any model is a reliability diagram [3], [36] (see Figure 2.1 for an example). It plots the observed accuracy of the model against the self-reported confidence that the model had in the prediction. For a perfectly calibrated model, the reliability diagram would be the identity function [1]. Typically, the reliability diagram is drawn as a histogram with B equal-width bins, in which the height of any bin b ∈ B = {1, ..., B}, is the expected (average) accuracy in that bin. The interpretation of model prediction and confidence from subsection 2.1.3 can be used to place each sample into the bin representing the confidence for the top prediction of the sample. Let Lb be the set of indices of the predictions in the bth bin, the accuracy and the confidence for that bin [1], [10] can 13 2. Theory be defined as accavg(Lb) = 1 |Lb| ∑ i∈Lb 1(ŷi = yi) (2.12) and confavg(Lb) = 1 |Lb| ∑ i∈Lb conf(ŷi). (2.13) Remember from before that we define model confidence as the probability assigned to the top prediction. Then, two numerical metrics can be computed based on the reliability diagram. These are expected calibration error (ECE) and maximum calibration error (MCE) [37]. ECE is computed as ECE = ∑ b∈B |Lb| N |accavg(Lb) − confavg(Lb)| (2.14) where N is the total number of predictions, and it is the weighted average of the expected calibration error across all B bins. Intuitively, this is the average difference between the observed expected calibration error and the identity function across bins in the reliability diagram. MCE is computed as MCE = max b∈B |accavg(Lb) − confavg(Lb)| (2.15) and is the maximum expected calibration error across all B bins or the largest gap between the average bin accuracy and average bin confidence. Reliability diagrams and the metrics derived from them have in recent years come to be criticized [38], [39] for some quite serious flaws. This is despite their relatively wide use in deep learning literature. The criticisms include the fact that ECE, as it was first introduced in [37], is a metric intended for binary classification. It has since often been used for multi-class problems, and in that case, it only considers the most probable class for calibration evaluation. Vaicenavicius et al. [39] calls this induced binary classification since it relies on an implied true-class-versus-the-rest classification problem. This means that a large part of the information in the output distribution is quietly discarded, leading to a possibly uninformative score. Many practical applications of neural networks also require all probabilities to be calibrated, not just the top prediction [39]. Additionally, the B bins are all equally wide which leads to an unequal distribution of predictions in the bins since most modern NNs are over-confident [1], often leading to a heavy right-skew in the bin count distribution in the reliability diagram. This means that despite large numbers of validation samples the bins of lower confidence might be close to empty, leading to the performance on a few samples largely dominating the score since the absolute difference in all bins is weighted equally. Furthermore, the choice of the number of bins is a hyperparameter to which the reliability diagram is highly sensitive [40], 14 2. Theory 0.0 0.2 0.4 0.6 0.8 1.0 Confidence 0.0 0.2 0.4 0.6 0.8 1.0 Ac cu ra cy ECE: 0.0049 MCE: 0.17 Outputs Gap Figure 2.1: Example of a reliability diagram. The height of the blue bars is the average accuracy in each bin, and the height of the pink bars is the average confidence in each bin. The dotted gray line is the identity function that a reliable model’s predictions follow. making comparisons difficult and leading to a bias-variance tradeoff in the number of bins. Lastly, the predictions within a bin may have a high variance but close to zero mean in confidence, making the reliability diagram deceptive. For example, a bin may have a distribution of samples resembling a U shape or a uniform distribution, giving the bin the same mean but a different variance. For all these reasons, some modifications to ECE are proposed in [38]. These include adaptive bin widths to make the bin distribution uniform and other measures to increase how well the metrics reflect the calibration of the model. With these shortcomings in mind, reliability diagrams, and their derivative metrics remain widely used in literature and practice, perhaps due to their intuitive nature. 2.2.3 Temperature scaling Because of the properties of proper scoring rules outlined earlier in this section, it is to be expected that any neural network that is trained for a classification task with the use of a proper scoring rule as the loss function will be well calibrated by default since this minimizes loss. However, this is not what is observed empirically. Guo et al. [1] show that modern neural networks are over-confident when evaluated on an unseen validation data set. To combat this problem, they propose temperature scal- ing, a simple method where a single scalar T ∈ R, T > 0 is used to scale the output logits before softmax is applied. This affects the entropy of the output distribution, and if used in conjunction with the prediction interpretation in Equation 2.6 adjusts the network’s confidence. Importantly, the class order in the predicted distribution is not altered by this augmentation. In this framework, the temperature T that optimizes calibration is found by minimizing a proper scoring rule with respect to T on a separate validation data set. 15 2. Theory Guo et al. [1] claim that temperature scaling is the best-performing calibration method in most cases, while also being the fastest and simplest to implement. It has later been shown that a temperature scaled single model performs poorly for out- of-distribution data [13]. It should be noted that deep ensembles and temperature scaling are not mutually exclusive, but can be combined. Ashukha et al. [12] even go as far as to claim that all models, including ensembles, should be temperature scaled before they are compared on UQ performance since some models might be uncalibrated by default. This notion is corroborated by Minderer et al. [41], who claim that temperature scaling helps unveil the underlying differences in calibration that are otherwise obscured by simple average under- or overconfidence. 2.2.4 Aleatoric and epistemic uncertainty The total predictive uncertainty for a model’s output is often partitioned into aleatoric uncertainty and epistemic uncertainty. Epistemic uncertainty is that which is due to a lack of knowledge about the world and therefore can be mitigated by the collection of additional information. An example of epistemic uncertainty might be out-of-distribution data where the network has not trained on a given class. On the other hand, aleatoric uncertainty is uncertainty that is inherent to the world, usually because of some stochasticity in the data generation, and it is therefore impossible to incorporate and compensate for in the model [42]. An example of aleatoric uncer- tainty is the noise generated by a sensor or the inherent stochasticity when rolling dice. The distinction between aleatoric and epistemic uncertainty is useful since it allows for the formalization of which parts of the uncertainty can be reduced by way of model augmentation or extension of the dataset and which cannot [42]. Conse- quently, this thesis concerns itself with how well the chosen models perform for both the cases of epistemic uncertainty as well as aleatoric uncertainty. 2.3 Neural network ensembles Ensembles are a collection of models which together help predict the outcome of the model. Ensembling is a technique used in all fields of machine learning that obtains a higher-performing model from a diverse set of worse-performing ones [8]. This concept has been extended to neural networks where the same effects can be seen when aggregating the results. There are a multitude of different ensembling techniques available to practitioners [12]. 2.3.1 Deep ensembles One of the many ensemble approaches that serves to promote diversity among ensem- ble members is random weight initialization of the neural network. This approach is used by Lakshminarayanan et al. [6] in combination with random shuffling of train- ing data in their 2017 paper on uncertainty estimation using ensembles, and it is one of the most commonly used in practice due to its simplicity. This type of model is often called deep ensemble (DE). 16 2. Theory DEs were shown to have good predictive performance on par with other techniques, as well as high UQ performance both for in-distribution and out-of-distribution (OOD) data [6]. As described in the original paper, training DEs is quite simple, with three steps involved. The first is to use a proper scoring rule as the loss function, the second is to optionally use adversarial training to increase robustness, and lastly to train the ensemble using randomized initialization of model parameters to increase variety in the ensemble [6]. Many common loss functions, such as cross- entropy loss, are strictly proper scoring rules and can therefore be used in the deep ensemble framework. In practice, adversarial training is often omitted if improved robustness is not strictly necessary. It has been shown that deep ensembles are some of the best-performing models for uncertainty estimation [6], [12]. Ashukha et al. [12] find that deep ensembles are superior to any other model tested for uncertainty quantification given a fixed test-time budget. The intuition behind why this might be is that since each model is trained independently, each model will find different local minima in the high dimensional loss space, which makes the model’s ability to quantify its uncertainty more robust [12]. Lakshminarayanan et al. [6] could also demonstrate that the model had the attrac- tive property of decreasing its certainty of prediction in out-of-distribution examples, which was demonstrated using an ensemble trained on the MNIST dataset on ex- amples from the NotMNIST dataset which contains letters instead of digits. It has later been verified that DEs are the SotA for UQ on OOD data [12], [13]. Now, let’s more formally define deep ensembles. Assume we have a deep ensemble of M members. Following the notation used earlier, let pm(xi) be the output dis- tribution of the mth member of the ensemble, m ∈ {1, ..., M}, on an input example xi that is to be classified into one of C distinct classes. This distribution is rep- resented as a C-dimensional vector. Then, the output distribution p DE (xi) of the deep ensemble on input example xi is the element-wise average over the individual members’ output distribution pm, such that p DE (xi) = 1 M M∑ m=1 pm(xi), (2.16) which is the element-wise average across the categorical output distributions for all of the M member models for input example xi. 2.3.2 Monte Carlo dropout Dropout was first introduced by Srivastava et al. [43] as a regularization measure during training to limit overfitting and increase the generalizability of the learned representation. With dropout, each neuron is turned off at random during training according to a pre-specified probability, or dropout rate, p. This helps the network not to overfit, and therefore generalize better, as it has to create a more robust representation when any neuron can be dropped at any time. Recognizing that using an ensemble of a set of models is usually beneficial for model performance, Srivastava 17 2. Theory et al. [43] show that using dropout during inference is equivalent to sampling from an exponential set of possible smaller models which yields higher overall performance. Gal and Ghahramani [44] later showed that performing a number of forward passes through a model with dropout enabled and averaging the results can be seen as a Bayesian approximation. They chose to call this Monte Carlo dropout (MC- dropout) and claimed that it enables superior uncertainty estimation performance in both regression and classification tasks compared to vanilla models. Of note is that since the introduction of MC-dropout, Lakshminarayanan et al. [6], Ashukha et al. [12], and Ovadia et al. [13], have all claimed that deep ensembles are superior in uncertainty quantification. However, MC-dropout remains widely used due to its simple implementation and general improvement of performance compared to vanilla single models. 18 3 Methodology In this chapter the methodology chosen for the study is outlined, beginning with a specification and motivation for the choice of machine learning problem as well as a more formal definition of the proposed model. Then, the software libraries used are mentioned, along with short motivations. The data used for the training and evaluation of the models is highlighted, along with the data pipeline used for the project. 3.1 The machine learning task at hand One key consideration for the project was what kind of problem to apply the pro- posed ensemble approach to. Through discussions with Zenseact, we settled on using the model on a classification problem where the model should predict labels for cropped patch sequences of traffic signs. The reasoning behind choosing this problem instead of some more complex problem was that the focus of the thesis is to evaluate the potential of ensembles spread over time. Implementing a working system for this purpose and investigating a variety of aspects of the proposed sys- tem was chosen over solving more complex problems that in principle are not any more novel than classification, such as object detection or semantic segmentation. Though the problem domain itself is not crucially important for the thesis, it framed the project and influenced some of the choices made during the design of the in- vestigation. It also affects how the results should be interpreted. Therefore, a brief overview of the problem and its domain is given for the rest of the section. The chosen problem falls into the domain of traffic sign recognition (TSR), a field with a decades-long history of development [45]. The first systems available for private end users were introduced in higher-end vehicles in the late-2000s or early 2010s as an aid for drivers. These systems were often limited to a few different classes of traffic signs [46]. Lately, TSR has become an important part of AD systems, and high and reliable performance is important for safety. There are two main subproblems of TSR, traffic sign detection and traffic sign classification [16]. This project concerns itself with the latter and assumes that regions of interest have already been identified by a system (in this specific case, human annotators) earlier in the ML pipeline. The domain is characterized by a large 19 3. Methodology number of classes with an imbalanced class distribution. Additionally, variations in illumination, perspective, and occlusions are common [45], making the problem distinctly long-tailed. Furthermore, many of the classes are very similar in shape and color, but with important differences in meaning, such as speed limit signs. Deep learning has recently started revolutionizing this domain, with many models achieving accuracies of over 95% in research settings [47]. This means that any differences in predictive performance between the models are likely to be small in absolute terms, and performance benefits might instead lie in the performance of the models on difficult examples such as short sequences or obscured scenes. 3.2 Software libraries The implementation of the data handling, the models, and the auxiliary code for the thesis was all coded in Python. The code for handling the data, training, evaluation, and storage of trained models was built using the PyTorch software library [48], an open-source project for deep learning in Python. This allowed for easily and quickly building modular code for purposes of deep learning. The base deep neural networks used are from the torchvision library, which is also part of the PyTorch project. This means that the networks could swiftly be implemented as part of the machine learning system. The models were trained on a computational cluster by using a container environment and experiments were tracked using the Tensorboard library for Python. The use of the container environment ensured compatibility across computational devices. Additionally, Pandas and Numpy are used for handling tabular data such as reading annotations files in CSV or JSON format or doing matrix computations. 3.3 Data The data used in this thesis comes from a subset of the publicly available Zenseact open dataset (ZOD) [49], for which a dedicated development kit is available as a pip install. The data was collected, consolidated, and released by Zenseact in order to further developments and research within AD. It was collected over two years by the company’s fleet of cars in 14 countries in Europe. The dataset consists of three separate parts – Frames, Sequences, and Drives. In this thesis, only the Frames were used, and in these, only the images from the 120° field of view front-facing 3848 × 2168 pixel camera were used. An example of a frame from ZOD can be seen in Figure 3.1, where annotation boxes for the traffic signs have been overlaid. The Frames subset contains 100,000 still images and was employed as training data for the models, as well as for validation of single-frame classification performance. 20 3. Methodology Figure 3.1: An example of a frame from the ZOD frames dataset with 2D annotation boxes for traffic signs drawn on the image. 3.3.1 Annotations Each image in Frames contains annotations for a range of different dynamic and static objects, which includes 2D- and 3D bounding boxes for pedestrians, traffic signs, poles, lane markings, and other relevant objects (see Figure 3.1 for an exam- ple). In the case of the Sequences, only the middle frame is annotated, and the rest of the frames, both before and after, are not. For the purposes of this thesis, the 2D bounding boxes for objects of the static objects subclass Traffic signs were of interest. In total, there are around 446,000 distinct, annotated traffic signs in the 100,000 images in Frames. The annotations for the traffic signs have been created by a team of professional annotators, and should not contain any annotations for signs that are not relevant to traffic, such as advertisements or billboards. The data contains 156 distinct classes of traffic signs, including two specialty classes – NotListed and unclear. The NotListed class is used by the annotators when the traffic sign in question does not fall into any of the other 155 classes. Such signs might for example include destination signs. The unclear class is used when the class for some reason is difficult to determine. This might for example include cases when the sign is heavily occluded or the sign is otherwise difficult for a human an- notator to classify. The class distribution of the single-frame traffic sign dataset is distinctly long-tailed, where some classes have in excess of 10,000 unique instances, while some classes have fewer than ten examples. 21 3. Methodology Figure 3.2: Random sample of still images from the training data, taken from the Frames subset of the ZOD. Figure 3.3: Random sample of eleven sequences taken from the sequences dataset that is used for comparing the tested models. 22 3. Methodology 3.3.2 Data preprocessing and datasets In order to make the data from ZOD usable for the purpose of the thesis, the 2D annotation boxes were used for cropping out the traffic signs from the single-frame dataset. A padding of ten percent was added to each side of the 2D annotation box, and the crops were then saved in their raw format. In addition to the raw files, a separate JSON file with some information on the annotations was created for each data set derived from the ZOD, which each included the image and annotation ID, the traffic sign class, and the height and width of the crop. This is useful further down the data preprocessing pipeline since it allows for filtering of the data based on some different criteria. The standard train-test split of 90-10 from the ZOD development kit was used to divide the Frames dataset into a training and a validation set (see Figure 3.2), with a share of the frames being reserved for a sequence set (see Figure 3.3). The sequences dataset used for the thesis (note the distinction from the ZOD Sequences) was created by taking a subset of the frames from ZOD Frames and extending them using the raw video feed from internal, unpublished, and unannotated data, to create a sequence of consecutive images. The annotations from ZOD single frames were then used with an implementation of ByteTrack [50] to create 2D annotation boxes for each frame in the sequences. These tracked annotations were then processed in the same way as the training and validation sets by cropping with padding and saving each crop as an image. Care was taken to ensure no overlap between the single frames used for training, validation, and the single frames used to extract sequences from additional internal data. Including the final frame that is part of the ZOD, the tracked sequences are 11 frames long. In total, there are four distinct (sub)sets of data used for the thesis (see Figure 3.4). The tracking used for creating the sequences dataset is not perfect, and on average the tracking quality degrades as the distance from the annotated frame, i.e., the last one in each sequence. A more thorough discussion of data quality is included in section A.1 in the appendix. There are a few reasons for using the additional internal data instead of the ZOD Sequences. First of all, the ZOD Sequences were not available at the start of the work on the thesis. The use of additional data also increases the available sequences for model evaluation from the 1429 independent sequences in the ZOD Sequences to 29,359 independent sequences, meaning more statistically sound estimates of model performance can be made. As previously mentioned, TSR is distinctly long-tailed, which means that more data efficiently allows for higher coverage of the space of possible input data. 23 3. Methodology testtrain validation ZOD Frames sequences time t11 t10 t11 Figure 3.4: Visualisation of the datasets used for the project. The three datasets train, validation, and test are all mutually exclusive subsets of the ZOD Frames dataset. The datasets sequences is an extension from the frames in the test dataset where the frames prior have been tracked and cropped. Note that ti denotes the ith frame of a sequence, i ∈ {1, ..., 11}, each frame in such a sequence originating from the same video as the cor- responding t11-frame. 3.3.3 Dataset implementation details The datasets and data loaders were implemented in PyTorch. As previously men- tioned, the implementation allows for data filtering. This is done in Pandas during dataset initialization. The filtering is based on explicitly excluded classes, which in practice was mainly used to filter out the data points labeled as unclear, but also classes with too few occurrences. This helped with training since the unclear class is uninformative in its nature, and classes with too few occurrences are uninformative due to a lack of variety. The filtering functionality also allows for sorting away crops based on size (in pixels). This was useful when training the models, but also for evaluation when running experiments and diagnostics. When the dataset is queried for a new item by the data loader, a transform is first applied to the crop. For training, all crops were resized to 64 × 64 pixels and then a random crop was made which reduced the size to 56 × 56 pixels. Then, normalization was applied, which maps the RGB values from their range [0, 255] to the range [−1, 1], which in practice speeds up training. The transform used for testing and validation was the same as for training but without random cropping. This was to ensure that the results are deterministic. 3.4 Experimental approach The core approach of the thesis is comparing a single model and a standard en- semble consisting of M members with an ensemble model of the same number of members, but where the inference is spread among the members and images over time. Thus, the thesis extends previous literature conducted on still-image single frames to sequences of single frames and proposes a simple and novel approach to 24 3. Methodology applying ensembles on such sequence data. The single model serves as an expected lower bound on performance and a deep ensemble [6] serves as the performance tar- get, referred to as the upper bound. Do note that the upper bound is not actually tractable to be used in car due to a limited computational budget. It is simply included as a theoretical maximum of the performance of DESOT that we might hope to observe. We have chosen to refer to these two bounds as the baselines. The thesis investigates where between the two performance bounds the proposed model falls. In practice, this was done by comparing the DESOT to the two baseline models. MC-dropout [44] was used as an extra model of comparison, chosen because of its simple implementation and common use by practitioners. 3.4.1 Formal model definition Define a sequence x ∈ RT ×H×W ×3 as a vector of T distinct still images, each with a height of H pixels, a width of W pixels, and three separate color channels. Now, we have a classification problem where a model must produce a categorical output distribution across C classes for such a sequence x. Assume that there is a set of M different deep NN models that can each conduct this classification. A single model m ∈ {1, ..., M} produces a categorical output distribution pm(xt) for each single image xt at timestep t ∈ {1, ..., T} in the sequence x. Then, the final output distribution for model m is defined as pm(x) = 1 T T∑ t=1 pm(xt), (3.1) which is the element-wise (class-wise) average across the output distributions for each image in the sequence at different time steps. This setup is the lower baseline we use for this thesis – a single model that produces a prediction at each time step (see Figure 1.1 for a visualization). In practice, one would use a window size such that Equation 3.1 constitutes a moving average across some images. Due to the shortness of our sequences, only 11 frames, we have chosen to use the average across the entire sequence for a model. Now imagine that all M models are used for classifying each image in the sequence, such that the final output distribution for the sequence is pDE(x) = 1 M M∑ m=1 pm(x) = 1 MT M∑ m=1 T∑ t=1 pm(xt), (3.2) which is how we choose to apply deep ensembles [6] to sequences – averaged across the images of the sequence x. See Figure 1.2 for a visualization of a DE applied to a sequence of images. DEM will be used to denote an M -member deep ensemble. 25 3. Methodology The proposed model, which we call deep ensemble spread over time (DESOT), instead only uses one model m ∈ {1, ..., M} for any given image xt, t ∈ {1, ..., T} to produce a categorical output distribution, but the models are alternated such that any given model m is used on average T/M times for a certain sequence of T images. Analogously to the notation used for DEs, DESOTM will denote an M -member deep ensemble spread over time. Assume that the order that the models are used on the sequence is defined by an ordered list O, |O| = T . For any sequence, the final output distribution of the DESOT across the C classes is pDESOT (x) = 1 |O| |O|∑ t=1 pt(xt), (3.3) where t is some time step t ∈ {1, ..., T}, and pt(xt) is the posterior distribution of the ensemble indicated at the tth position in O. This definition of DESOT perhaps trivially, but also importantly, means that they can only be applied to sequences because they fundamentally rely on the alternation of the members on neighboring frames. If the DESOT model were to be used on a sequence length of one, the model would be equivalent to a standard single model. See Figure 1.3 for a visualization of a DESOT. 3.4.2 Computational footprint One practical aspect that is key to if an ML model can at all be employed for a certain problem and situation is the size of the computational footprint of the model. If a model’s computational footprint is larger than what the computational capacity of the system running the model can handle within the latency requirements of the specified task, then it simply cannot be used. This is one of the main motivations of DESOT, i.e., that it limits the computational resources needed to run the model compared to a traditional deep ensemble. Assume that running inference on an input x requires T computations when run on a single model. Then, performing inference on that same input will require MT computations for a DE with M members, meaning that the computational footprint scales linearly with the number of ensemble members for a DE. This is a problem in the AD space since all computations have to be performed in the car in real-time with minimal latency. Furthermore, there are many tasks that have to be performed by the system at each timestep other than just traffic sign recognition, which means that each subsystem has even stricter limits for computational footprint. This mo- tivates the investigation of if DESOTs that only perform inference on one model each timestep can perform well while limiting the computational footprint of the ensemble. 3.4.3 Evaluating predictive performance In connection to Research questions A and B, the main evaluation criteria for predic- tive performance is to evaluate the different models on the sequences and compare 26 3. Methodology them in terms of F1-score. As previously mentioned, TSR is a domain that typi- cally has many classes and unequal class distribution. Since all classes are important, evaluating the performance of DESOT explicitly on rare classes is of value since the model performance on these is otherwise made ambiguous by the majority classes. Additionally, a key aspect of answering research question B is the effect of ensemble size and the length of the sequences on the observed performance of the models. 3.4.4 Uncertainty quantification and the difficulties of mea- suring calibration Regarding Research question C, a key point of interest is to investigate if the high UQ performance of deep ensembles that have been noted by a number of authors, [6], [12], [13], extends to DESOT. In connection with the same research question, the effects of temperature scaling on the calibration of single models and ensembles are of interest. A discussion regarding the difficulty of measuring and quantifying calibration is in order. As is accounted for in section 2.2, finding the calibration of a model requires knowing the long-term distribution ν(p) of prediction p for each class c as well the long-term distribution ρ(y = c|p) of the probability of the class being true given the model prediction. In theory, these distributions should be stationary, and an infinite number of observations should have been made. In practice, of course, these assumptions never hold, and so we have to make do with empirical estimates. These estimates are in practice made by binning the predictions into K bins, which is necessary since the number of observations is limited. Metrics based on reliability diagrams such as ECE or MCE only take the probability assigned to the top class into account, what we have chosen to refer to as confidence. Good calibration by these metrics only means that the confidence is calibrated, not the probabilities assigned to the other C − 1 classes. The uneven class distribution in TSR means that confidences for minority classes are sparse, meaning the quality of the empirical estimate of the aforementioned distributions is poor, which only compounds the issue. The equal weighting of confidence buckets despite varying population sizes also worsens the problems. For a review of issues with ECE, see subsection 2.2.2. For real-life applications such as TSR, higher requirements are usually placed on the outputs from AI models. This includes that not only the top class confidence should be calibrated, but also the probabilities assigned to the other C − 1 classes [39]. Brier reliability, as implemented for this thesis, takes these other C − 1 class probabilities into account. This means that for some classes, the number of samples that can be used for estimating calibration increases from less than ten to the full size of the dataset compared to if only confidence is used for measuring calibration. This raises the validity of the calibration estimates included in this thesis. However, due to the widespread use of ECE, results in ECE are also included but should be interpreted warily. 27 3. Methodology 3.4.5 Evaluating performance on OOD data An essential aspect of any ML task is to have a generalized model that can handle OOD data. The Zenseact open dataset (ZOD) is used for the project, and more information about it can be found in section 3.3. In the ZOD, there are many examples of traffic signs labeled as NotListed, as well as unclear images that are hard to classify even for a human. These have their own class labels in the ZOD. This allowed us to qualitatively test how well the model(s) perform on OOD data in the sense that we would like the models to exhibit high uncertainty, measured using entropy, for OOD examples. It has previously been shown that deep ensembles perform well for this kind of qualitative OOD evaluation [6], [12]. However, their performance has not been tested when applied to sequences of single frames, and has not been compared to ensembles spread over time. This means that there is a gap in research that is interesting to explore. Another way of testing the OOD performance of ML models is to use augmentations of varying intensity to see how the accuracy, confidence, and uncertainty displayed by the models change with the augmentation intensity. This approach is employed by Ovadia et al. [13], who use a set of augmentations including rotations and blur to test just this. We have chosen to refer to this kind of OOD generation as progressive OODness due to the increasing intensity of augmentation, but Ovadia et al. [13] refer to it as shifted OOD data. They show that DEs are SotA on this sort of progressive OOD [13]. This kind of comparison of the models on OOD data is interesting since it mimics some of the edge cases that TSR systems might encounter in the real world, such as rotated signs. It also allows for other comparisons to be made, since one of the conclusions of Ovadia et al. [13] is that in-distribution UQ performance is not a good indicator of OOD UQ performance. They found this to be especially true for MC-dropout and temperature scaling of single models. What they did not test was the performance of temperature scaling applied to ensembles or MC-dropout. We employ this same approach to test the behavior of the models on progressive OOD in order to characterize their behavior, as a complement to their performance on complete OOD. We also extend the literature by extending the discussion to sequence data and DESOTs. 28 4 Empirical Findings To answer the research questions, a set of experiments have been performed. In this chapter, the experimental setup for these experiments, as well as the empirical findings, are presented. After the results of the experiments for one aspect of model performance have been shown, these results are discussed. 4.1 Experimental setup In this section, the experimental setup that was used to obtain the results is intro- duced. The training and evaluation procedures are also accounted for. 4.1.1 Choice of model architecture and size All models used for this thesis are based on the ResNet [26] CNN architecture. There were a few reasons for this choice. First of all, most literature testing the performance of deep ensembles include variants of this model architecture, e.g. [12], [13], [38]. The influential paper by Guo et al. [1] that introduced temperature scaling as a means to combat the overconfidence of modern neural networks also used a variant of this model for empirically proving their claims. Thus, its use in the closest related literature makes it suitable for our investigation. From a practical point of view, the ResNet models have been shown to provide great performance for image classification on many datasets, including ImageNet and CIFAR-10 [26]. They are also known to work well for TSR [17], [18], [51]. A small study was conducted in order to decide what size of ResNet model to use (see section A.2 in the appendix), the conclusion of which was that the performance difference between ResNets of different sizes is negligible for the chosen problem. Therefore, the smaller ResNet18 version was chosen as the final base model for all experiments. MC-dropout was implemented on the same ResNet18 model ar- chitecture as the vanilla models, but with an additional dropout layer after each non-linearity that was active during testing. A dropout rate of 0.2 was used. Due to time constraints, the dropout rate was not tuned to achieve optimal performance. 29 4. Empirical Findings 4.1.2 Training and evaluation All models are trained from scratch on the single-frame training dataset described in subsection 3.3.2 using a base learning rate of 0.0005 on the AdamW optimizer [52]. The PyTorch implementation of cosine annealing, first introduced by Loshchilov and Hutter [53], is used to schedule the learning rate to progressively decrease dur- ing training for faster and more stable convergence. A batch size of 256 is used. The NotListed and unclear classes are excluded from the training set, along with classes with fewer than 10 occurrences in total between the training and validation datasets. Crops that are smaller than 16 pixels along the smallest dimension are also excluded from the data. In line with the definition of deep ensembles by Laksh- minarayanan et al. [6], all members were trained separately on the same data with random weight initialization. The optional adversarial training was not employed. The cross-entropy loss function was used since it’s a proper scoring rule. In total, 25 independent vanilla ResNet18 models were trained, along with five MC- dropout versions of the same. In order to ensure statistically sound performance estimates, all 25 models were run to obtain single model performance, and five independent 5-member ensembles were created, as well as five 10-member ensembles. For the evaluation of MC-dropout, all 5 models were separately evaluated. Here follows a glossary of the notation used for the different models throughout the rest of the report. When evaluated on the sequences dataset, all models use a combination rule of simple averaging for combining predictions across frames. • DEM: This denotes an M -member deep ensemble operating on an image. When evaluated on the sequences dataset, DEM means that a full M -member deep ensemble is run on each frame. • DESOTM: This denotes an M -member deep ensemble spread over time, where each member operates on a different frame in the sequence. For a more formal definition of DESOTs, see subsection 3.4.1. • Single model (SM): This denotes a standard single ResNet18-model. Note that this is conceptually the same as a one-member ensemble, and is therefore equivalent to both a DE1 and a DESOT1. • MC-dropout: A single ResNet18-model trained and evaluated with an extra dropout layer after each non-linearity. The implementation used is similar to a single model in the sense that one forward pass is conducted for each time step in the sequence. The outputs are then averaged across time. • + T: A suffix added to any of the previous model names that is used to denote that temperature scaling has been applied to that model. 30 4. Empirical Findings Table 4.1: Predictive performance for each model tested on the sequence dataset in terms of accuracy and F1-score. The results include plus and minus one standard deviation of performance be- tween runs. Model Accuracy F1-score SM 0.9734 ± 0.0006 0.8112 ± 0.0175 DESOT5 0.9760 ± 0.0003 0.8326 ± 0.0093 DE5 0.9764 ± 0.0001 0.8273 ± 0.0101 MC-dropout 0.9710 ± 0.0009 0.7679 ± 0.0271 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Epoch 0.86 0.88 0.90 0.92 0.94 0.96 0.98 Ac cu ra cy SM DESOT5 DE5 MC-dropout Figure 4.1: Graph comparing the predicting performance on the sequences dataset in terms of accuracy of a 5-member DESOT with a 5-member DE, a single model, as well as a MC-dropout model. The error bars are drawn for ±1 std. The ensemble spread over time (DESOT5) performs on par with the deep ensemble (DE5) despite requiring only 20% as much computation, while outperforming the single model. 4.2 Predictive performance We evaluate the predictive performance of the different models on the sequence dataset. Note that temperature scaling does not affect the ordering of the predicted classes, and thus not the final model prediction. Therefore, results for temperature scaling are omitted for this part of the results. As can be seen in Figure 4.1, all mod- els reach high performance as measured in accuracy for the later epochs, DESOT5 and DE5 performing markedly better than the other models at early epochs. No- tably, DESOT5 performs on par with traditional DE, and these two models are slightly better than other models, even after single model convergence. Final-epoch performance is summarized in Figure 4.2. Compared using F1-score, the difference between models is greater in absolute terms, with DESOT performing about as well as DEs. For both metrics, MC-dropout performs decidedly worse than other models. 31 4. Empirical Findings SM DE SO T 5 DE 5 M C- dr op ou t 0.971 0.972 0.973 0.974 0.975 0.976 Ac cu ra cy SM DE SO T 5 DE 5 M C- dr op ou t 0.74 0.76 0.78 0.80 0.82 0.84 F1 Figure 4.2: Final-epoch predictive performance on the sequences dataset, comparing DESOT5 with a single model, DE5 and MC-dropout. The DESOT5 performs about as well as the DE5 on accuracy and F1-score, and both of these perform better than the single model and MC-dropout. Again, note that DESOT5 uses the same amount of computations as a SM or MC-dropout. 4.2.1 Evaluation on classes with few samples Because of the high performance of DESOT when measured using F1-score, which weighs performance on all classes equally, an additional experiment on rare classes was conducted. This was done by filtering out any classes with more than 500 occur- rences in the training and validation datasets, which resulted in 625 sequences. The remaining classes are what we consider rare. These were used to evaluate the models on rare class performance. The predictive performance results for this subset of the sequences dataset are shown in Figure 4.3. DESOT performs very favorably in this comparison, outperforming all other models, including DE. MC-dropout performs by far the worst. Just as for the predictive performance on the whole sequences dataset, the ensembles decrease the variance of the model performance compared to single models. 32 4. Empirical Findings SM DE SO T 5 DE 5 M C- dr op ou t 0.86 0.87 0.88 0.89 0.90 Ac cu ra cy SM DE SO T 5 DE 5 M C- dr op ou t 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 F1 Figure 4.3: Final-epoch predictive performance on a minority class version of the sequences dataset, comparing DESOT5 with a single model, DE5 and MC-dropout. The DESOT5 outperforms the DE5 on both accuracy and F1-score. Additionally, it outperforms single models and MC-dropout by a large margin in both metrics. 4.2.2 Discussion of predictive performance For the task of classification of traffic signs, one of the most important aspects is the predictive performance, or how well the model can classify the different signs. The results displayed in Figure 4.1 indicate that TSR might be a task that is easy to achieve high levels of predictive performance on. It can be observed that all models, including single models, achieve an accuracy exceeding 97%. However, the results also show that our model, DESOT5, performs very well compared to the baseline models. The lower baseline, SM, achieves a slightly lower accuracy than our model. The upper baseline, DE5, achieves a marginally higher accuracy than our model, but inside one standard deviation. Looking at the results in terms of F1-scores, where the performance on minority classes is weighted higher somewhat changes the story, with larger absolute differ- ences in performance between models. DESOT5 and DE5 both have significantly higher average scores than single models and MC-dropout. The single models display a large variance in performance, while both ensembling techniques have a smaller variance. This seems to suggest that performance in classes that are rare occurrences in the training data is where the main predictive performance benefits of DESOT over single models might lie. In Figure 4.3, the DESOT model seem to outperform the standard DE, this is unexpected as both models have the same information in the sequences. One reason for this might be due to the inherent noise of the pre- diction on rare classes, while the ensemble will average the prediction from more inferences, the DESOT will not be as diluted, resulting in higher probabilities for the correct rare class. 33 4. Empirical Findings MC-dropout is an interesting method to compare against as the dropout creates a slightly different model on each forward pass, which is a sort of ensembling, as shown by Gal and Ghahramani [44]. The MC-dropout model does not perform as well as the other models. This might be due to a too-high dropout rate, a hyperparameter that was not thoroughly tuned to maximize performance. All in all, these results show that our method performs significantly better than a single model that has a similar computational footprint and equivalently to a DE5, which has a computational footprint five times larger than our model. The reason for the increased performance over single models might be that the members in an ensemble together have a more expressive representation of the space of possible traffic signs than a single model. The benefit of DESOTs is that they allow for using this more expressive representation while limiting computations. Still, the extra computations performed for DEs do not seem to benefit predictive performance. Perhaps this is because of the relative simplicity of the traffic sign classification task, which means that the extra computations of DE yield minimal benefits. If these results can be replicated for other tasks and datasets, this is a significant finding. 4.3 In-domain uncertainty quantification When considering the results in this section, keep in mind the discussion regarding the difficulties in measuring calibration in subsection 3.4.4. Now, for in-domain un- certainty quantification, all models are not only compared against each other but also against their temperature scaled counterparts. The temperature scaling is opti- mized on the single frame validation set. In section A.3 in the appendix, additional results and observations about temperature scaling are presented. The results for in-domain calibration measured in Brier reliability are shown in Figure 4.4. The results when measured in ECE are shown in Figure 4.5. 34 4. Empirical Findings SM SM + T DE 5 DE 5 + T DE 10 DE 10 + T M C- dr op ou t 0.005 0.006 0.007 0.008 0.009 Br ie r r el ia bi lit y (a) Single-frame test dataset. SM SM + T DE SO T 5 DE SO T 5 + T DE SO T 1 0 DE SO T 1 0 + T DE 5 DE 5 + T M C- dr op ou t 0.010 0.011 0.012 0.013 0.014 0.015 0.016 0.017 Br ie r r el ia bi lit y (b) Sequences dataset. Figure 4.4: Uncertainty quantification performance for each model on in-distribution data measured in Brier reliability. Lower is better. Temper- ature scaling seems to significantly improve the calibration for ensembles both on the test dataset and sequences dataset. For single models, it instead seems to worsen calibration. Overall, MC-dropout is the worst calibrated out of all the models. SM SM + T DE 5 DE 5 + T DE 10 DE 10 + T M C- dr op ou t 0.002 0.004 0.006 0.008 0.010 EC E (a) Single-frame test dataset. SM SM + T DE SO T 5 DE SO T 5 + T DE SO T 1 0 DE SO T 1 0 + T DE 5 DE 5 + T M C- dr op ou t 0.020 0.025 0.030 0.035 0.040 EC E (b) Sequences dataset. Figure 4.5: Uncertainty quantification performance for each model on in- distribution data measured in ECE. Lower is better. Temperature scaling seems to significantly improve the calibration for both single models and ensembles on the test dataset. However, for the sequence dataset, tem- perature scaling seems to increase ECE. Overall, MC-dropout is the worst calibrated out of all the models. 35 4. Empirical Findings 4.3.1 Discussion on in-domain uncertainty quantification The ca