img₁

model₁ model₃model₂ model₄

img₂

model₁ model₃model₂ model₄

img₃

model₁ model₃model₂ model₄

img₄

model₁ model₃model₂ model₄

time, t

output₁ output₂ output₃ output₄

Predictive Performance and Calibration
of Deep Ensembles Spread Over Time
A simple way of limiting the computational load of deep en-
sembles when applied to sequence data

Master’s thesis in Data Science and AI

ALEXANDER BODIN
ISAK MEDING

DEPARTMENT OF ELECTRICAL ENGINEERING
DIVISION OF SIGNAL PROCESSING AND BIOMEDICAL ENGINEERING

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2023
www.chalmers.se

www.chalmers.se


Master’s thesis 2023

Predictive Performance and Calibration
of Deep Ensembles Spread Over Time

A simple way of limiting the computational load of deep ensembles
when applied to sequence data

ALEXANDER BODIN
ISAK MEDING

Department of Electrical Engineering
Division of Signal Processing and Biomedical Engineering

Chalmers University of Technology
Gothenburg, Sweden 2023


Predictive Performance and Calibration of Deep Ensembles Spread Over Time
A simple way of limiting the computational load of deep ensembles when applied to
sequence data
ALEXANDER BODIN
ISAK MEDING

© ALEXANDER BODIN, 2023.
© ISAK MEDING, 2023.

Examiner: Lennart Svensson, Chalmers University of Technology
Industrial Supervisors at Zenseact: Joakim Johnander, Christoffer Petersson, and
Adam Tonderski

Master’s Thesis 2023
Department of Electrical Engineering
Division of Signal Processing and Biomedical Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Schematic illustrating the proposed approach – an ensemble spread over
time. Here the previous prediction from models that do not run inference on this
timestep also count towards the output of the model, as described in section 1.1.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2023

iv


Predictive Performance and Calibration of Deep Ensembles Spread Over Time
A simple way of limiting the computational load of deep ensembles when applied to
sequence data
Alexander Bodin & Isak Meding
Department of Electrical Engineering
Division of Signal Processing and Biomedical Engineering
Chalmers University of Technology

Abstract
In recent years, machine learning models that can provide uncertainty estimates
that match their observed accuracy have seen an increased interest in academia.
Such models are called calibrated, a quality essential for the safe application of neu-
ral networks in high-stakes situations. However, good calibration is not enough –
high predictive performance is also essential. Autonomous driving (AD) is a setting
where this combination of model qualities is much-needed, with the additional re-
quirements of real-time processing of sensor inputs such as camera video sequences.
Deep ensembles (DEs) are state-of-the-art for non-Bayesian uncertainty quantifica-
tion with high predictive performance. However, their deployment in AD has been
limited due to their high computational load.

We propose the deep ensemble spread over time (DESOT), a simple modification to
DEs that seeks to limit their computational load on image sequence data by letting a
single ensemble member perform inference on each frame of the sequence. We apply
this proposed system to the problem of traffic sign recognition (TSR), a subfield
of AD with a distinctly long-tailed class distribution. DESOTs display predictive
performance competitive with DEs for traffic sign classification, using only a fraction
of the computational power. For in-distribution uncertainty performance, DESOTs
outperform MC-dropout and perform on par with DEs. We conduct two out-of-
distribution (OOD) experiments. First, we show that DESOTs increase calibration
robustness to common augmentations compared to single models while matching
DEs. Second, we test performance on a completely unseen class, for which all
models increase their uncertainty in terms of output distribution entropy. Post-hoc
calibration using temperature scaling is also evaluated and is shown to improve the
uncertainty quantification performance of DESOTs, both in and out of distribution.

Keywords: Machine learning, artificial intelligence, computer vision, deep ensemble,
deep neural network, uncertainty quantification, calibration, traffic sign recognition.

v


Acknowledgements
First of all, we would like to thank our industrial supervisors Joakim Johnander,
Christoffer Petersson, and Adam Tonderski for their never-ending support and en-
couragement throughout our work with this thesis. Without the interesting dis-
cussions we had, working on this project would not have been nearly as fun and
rewarding. We would also like to thank Zenseact for allowing us to use their facili-
ties and computational resources, without which this thesis would have taken a lot
longer to finish. Lastly, we would like to thank our examiner at Chalmers, Lennart
Svensson, for facilitating this thesis project.

Alexander Bodin and Isak Meding, Gothenburg, June 2023

vii


viii


Acronyms

+ T temperature scaling applied to model.
AD autonomus driving.
CNN convolutional neural network.
DE deep ensemble.
DESOT deep ensemble spread over time.
ECE expected calibration error.
MC-dropout Monte Carlo dropout.
MCE maximum calibration error.
ML machine learning.
NN neural network.
OOD out-of-distribution.
pp percentage point(s).
px pixel(s).
SM single model.
SotA state of the art.
TSR traffic sign recognition.
UQ uncertainty quantification.
ZOD Zenseact open dataset.

ix


Contents

List of Acronyms viii

List of Figures xiii

List of Tables xvii

1 Introduction 1
1.1 Background and context . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Discussion of ethical and sustainability aspects . . . . . . . . . . . . . 6

2 Theory 7
2.1 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Residual networks . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Evaluating the predictive performance of neural networks . . . 8
2.1.3 Interpreting neural network outputs as probabilities for clas-

sification tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Uncertainty quantification and calibration . . . . . . . . . . . . . . . 10

2.2.1 Expanding the concepts of predictive uncertainty and calibra-
tion to neural networks . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Additional ways to measure calibration . . . . . . . . . . . . . 13
2.2.3 Temperature scaling . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Aleatoric and epistemic uncertainty . . . . . . . . . . . . . . . 16

2.3 Neural network ensembles . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Deep ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Monte Carlo dropout . . . . . . . . . . . . . . . . . . . . . . . 17

3 Methodology 19
3.1 The machine learning task at hand . . . . . . . . . . . . . . . . . . . 19
3.2 Software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.1 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Data preprocessing and datasets . . . . . . . . . . . . . . . . . 23
3.3.3 Dataset implementation details . . . . . . . . . . . . . . . . . 24

xi


Contents

3.4 Experimental approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Formal model definition . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 Computational footprint . . . . . . . . . . . . . . . . . . . . . 26
3.4.3 Evaluating predictive performance . . . . . . . . . . . . . . . . 26
3.4.4 Uncertainty quantification and the difficulties of measuring

calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.5 Evaluating performance on OOD data . . . . . . . . . . . . . 28

4 Empirical Findings 29
4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Choice of model architecture and size . . . . . . . . . . . . . . 29
4.1.2 Training and evaluation . . . . . . . . . . . . . . . . . . . . . 30

4.2 Predictive performance . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Evaluation on classes with few samples . . . . . . . . . . . . . 32
4.2.2 Discussion of predictive performance . . . . . . . . . . . . . . 33

4.3 In-domain uncertainty quantification . . . . . . . . . . . . . . . . . . 34
4.3.1 Discussion on in-domain uncertainty quantification . . . . . . 36

4.4 Out-of-distribution uncertainty quantification . . . . . . . . . . . . . 37
4.4.1 Experiments on gradually augmented OOD data . . . . . . . . 37
4.4.2 Discussion of performance on gradually augmented OOD data 41
4.4.3 Experiments on complete OOD data . . . . . . . . . . . . . . 41
4.4.4 Discussion of model performance on complete OOD data . . . 46

5 Conclusion 51
5.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Bibliography 53

A Additional experiments I
A.1 Sequence quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Comparing model architecture size . . . . . . . . . . . . . . . . . . . III
A.3 Temperature scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . IV

xii


List of Figures

1.1 Single model architecture. One model performs inference on the input
image at each time step. . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Traditional ensemble architecture. Every ensemble member performs
inference on the input image in each time step. . . . . . . . . . . . . . 3

1.3 Schematic illustrating the proposed approach – an ensemble spread
over time. Note that the number of forward passes is the same as for
the single model in Figure 1.1. The previous prediction from models
that do not run inference on this timestep also count towards the
model output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Example of a reliability diagram. The height of the blue bars is the
average accuracy in each bin, and the height of the pink bars is the
average confidence in each bin. The dotted gray line is the identity
function that a reliable model’s predictions follow. . . . . . . . . . . . 15

3.1 An example of a frame from the ZOD frames dataset with 2D anno-
tation boxes for traffic signs drawn on the image. . . . . . . . . . . . 21

3.2 Random sample of still images from the training data, taken from the
Frames subset of the ZOD. . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Random sample of eleven sequences taken from the sequences dataset
that is used for comparing the tested models. . . . . . . . . . . . . . 22

3.4 Visualisation of the datasets used for the project. The three datasets
train, validation, and test are all mutually exclusive subsets of the
ZOD Frames dataset. The datasets sequences is an extension from the
frames in the test dataset where the frames prior have been tracked
and cropped. Note that ti denotes the ith frame of a sequence, i ∈
{1, ..., 11}, each frame in such a sequence originating from the same
video as the corresponding t11-frame. . . . . . . . . . . . . . . . . . . 24

4.1 Graph comparing the predicting performance on the sequences dataset
in terms of accuracy of a 5-member DESOT with a 5-member DE,
a single model, as well as a MC-dropout model. The error bars are
drawn for ±1 std. The ensemble spread over time (DESOT5) per-
forms on par with the deep ensemble (DE5) despite requiring only
20% as much computation, while outperforming the single model. . . 31

xiii


List of Figures

4.2 Final-epoch predictive performance on the sequences dataset, com-
paring DESOT5 with a single model, DE5 and MC-dropout. The
DESOT5 performs about as well as the DE5 on accuracy and F1-
score, and both of these perform better than the single model and
MC-dropout. Again, note that DESOT5 uses the same amount of
computations as a SM or MC-dropout. . . . . . . . . . . . . . . . . . 32

4.3 Final-epoch predictive performance on a minority class version of the
sequences dataset, comparing DESOT5 with a single model, DE5 and
MC-dropout. The DESOT5 outperforms the DE5 on both accuracy
and F1-score. Additionally, it outperforms single models and MC-
dropout by a large margin in both metrics. . . . . . . . . . . . . . . . 33

4.4 Uncertainty quantification performance for each model on in-distribution
data measured in Brier reliability. Lower is better. Temperature scal-
ing seems to significantly improve the calibration for ensembles both
on the test dataset and sequences dataset. For single models, it in-
stead seems to worsen calibration. Overall, MC-dropout is the worst
calibrated out of all the models. . . . . . . . . . . . . . . . . . . . . 35

4.5 Uncertainty quantification performance for each model on in-distribution
data measured in ECE. Lower is better. Temperature scaling seems
to significantly improve the calibration for both single models and
ensembles on the test dataset. However, for the sequence dataset,
temperature scaling seems to increase ECE. Overall, MC-dropout is
the worst calibrated out of all the models. . . . . . . . . . . . . . . . 35

4.6 Illustration of the different augmentations used at various intensities,
from no augmentation to maximal intensity. . . . . . . . . . . . . . . 38

4.7 Uncertainty quantification performance for each model on augmented
data of increasing intensity. The performance is measured in accuracy,
Brier reliability, and mean entropy. Lower Brier reliability is better.
Tested on the sequences dataset. No model is clearly better or worse,
though single models and MC-dropout are outliers in some respects. . 40

4.8 Entropy for OOD-data (red) compared to in-distribution data (blue)
for the single-frame test and sequence datasets. The tests are run
for various ensemble sizes M ∈ {1, 5, 10}, which are differentiated by
color shade. Again, note that DE1 and DESOT1 are special cases,
which are in effect both a single model (SM). The vertical dashed
lines are the mean entropy for the model of the same color. . . . . . . 43

4.9 Entropy for in-distribution data (top row) compared to entropy for
OOD data (bottom row) for the single-frame test dataset (left) and
the sequences dataset (right). . . . . . . . . . . . . . . . . . . . . . . 45

xiv


List of Figures

4.10 Illustration of the thresholding strategy applied to the entropy of a
single model with and without temperature scaling. The solid line
is the in-distribution entropy, and the dashed line is the OOD en-
tropy. The vertical gray dotted line is the optimal threshold for that
particular model, which is the same as in Table 4.4. All data points
to the right of the threshold, inside the red area, are classified as
OOD. Note that the increased separation between the in- and out-of-
distribution lines for the temperature scaled version allows for higher
OOD-detection performance. . . . . . . . . . . . . . . . . . . . . . . . 46

4.11 Examples of images with low entropy from the NotListed class. These
are very similar to the in-distribution data and may be incorrectly
annotated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.12 Examples of images in the NotListed class that are similar to in-
distribution data. The top row contains samples from the training
set (in-distribution) and the bottom row is similar signs that can be
found in the NotListed class (out-of-distribution). . . . . . . . . . . . 48

4.13 Examples of images with high entropy from the in-distribution and
the OOD data. For the in-distribution data, we see the most prob-
lematic images for the model to classify. For the OOD data, these
are the clearest examples of images of OOD data. . . . . . . . . . . . 49

A.1 Accuracy per frame at each timestep of the sequence dataset for a
vanilla ResNet18 model and a 5-member ResNet18 ensemble. . . . . . I

A.2 The crop size distribution for frames 1 and 11 across all sequences,
plotted with 100 bins. The crop size is defined as the size (in pixels)
of the smallest crop dimension. . . . . . . . . . . . . . . . . . . . . . II

A.3 Comparing the effects from the different methods of creating a sin-
gle frame dataset. The sequence frames are all the frames from the
sequence randomly sampled without replacement. The Test frames
are the last image in the sequence, which contains the most high-
quality information. For the sequences dataset, a DESOT is applied.
It seems that average input quality dominates quantity. . . . . . . . . III

A.4 ResNet18 and ResNet50 compared on accuracy on the validation
dataset across training epochs. . . . . . . . . . . . . . . . . . . . . . . IV

A.5 The optimal temperature for models trained for different numbers of
epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

A.6 The reliability plots comparing temperature scaling for a DE5 using
two different techniques – individual temperature scaling and joint
temperature scaling. The results are shown for the single-frame vali-
dation dataset that the models are temperature scaled using. . . . . . VI

xv


List of Figures

xvi


List of Tables

2.1 Confusion matrix across a set of predictions for a binary classification
problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.1 Predictive performance for each model tested on the sequence dataset
in terms of accuracy and F1-score. The results include plus and minus
one standard deviation of performance between runs. . . . . . . . . . 31

4.2 Augmentations and values used for OOD data creation. Brightness,
saturation, and contrast all gradually decrease from their original
values with increasing intensity. . . . . . . . . . . . . . . . . . . . . . 37

4.3 Mean entropy on OOD data (NotListed class). . . . . . . . . . . . . 44
4.4 Table of results from applying an entropy threshold for OOD detec-

tion on the sequences dataset. . . . . . . . . . . . . . . . . . . . . . . 45

xvii


List of Tables

xviii


1
Introduction

Recently deep machine learning has irrevocably changed the landscape for many
industries, not least the automotive industry. The quest for autonomus driving
(AD) vehicles is part of a big push, where machine learning (ML) techniques are
key. A problem with the application of ML is that it has been shown that modern
machine learning models are over-confident in their predictions, and have become
more so amid the performance developments in recent decades [1].

In safety-critical applications, such as AD, the overconfidence of modern neural net-
works is particularly troublesome since the model’s certainty in its output greatly
influences what actions are sensible to take [2]. Naturally, it is then of great im-
portance that the probability estimates that the model reports correspond to the
actual predictive performance observed in the model’s output over time. This is
typically referred to as calibration and is the measure of how well the subjective
output probabilities and the observed long-term prediction performance match [3].
A model that closely and precisely assesses its own uncertainty is referred to as well
calibrated [4], [3].

There are two main paradigms of machine learning models: Bayesian and non-
Bayesian models [5]. A Bayesian model treats model parameters as stochastic vari-
ables, each with a probability distribution that is updated during training based on
the input data. This enables high-quality posterior distributions over the output
space, with great uncertainty quantification performance. One such model is the
Bayesian neural network. The disadvantage of this Bayesian approach is that the
models are typically difficult to implement and slow to train compared to a nor-
mal neural network (NN) [6]. This limits the Bayesian models to smaller networks
or various approaches for approximating some other Bayesian model. Therefore,
non-Bayesian approaches such as ensembles are more common in practice. For an
ensemble, a number of member models are trained and are all applied to each data
point at the time of inference. Ensembles have been shown to produce an improved
predictive performance in machine learning [7], [8], with the disadvantage of in-
creased computational load during training- and inference time since it typically
scales linearly with the number of ensemble members.

It has long been known that ensembles can quantify uncertainty in their predic-
tions [9]. However, the ability of neural network ensembles to produce uncertainty

1


1. Introduction

estimates of quality that rivals Bayesian models, while also achieving high predictive
performance was first demonstrated in 2017 by Lakshminarayanan et al. [6]. This
allowed for a practical and high-performance alternative to the Bayesian approach.
This type of ensemble has since been called deep ensemble (DE) [10]–[12], and con-
tinues to be the state of the art (SotA) in the field of uncertainty quantification
(UQ) for machine learning [12], [13]. As previously mentioned, uncertainty estima-
tion performance in terms of good calibration is essential in safety-critical tasks such
as AD. Thus, the benefits of DEs are of interest in the AD space, but their large
computational load limits deployment.

This thesis proposes a new model that we call a deep ensemble spread over time
(DESOT), an augmented version of deep ensembles. This new model limits compu-
tational load while hopefully upholding the benefits of deep ensembles concerning
predictive efficacy and uncertainty quantification performance.

1.1 Background and context
Autonomous driving systems employ a wide array of sensors that allow the vehicle
to perceive its surroundings and take appropriate action [14], [15]. These many
sensors are then used for localization and mapping, path planning, decision making,
and ultimately vehicle control [14]. One of the most important of these sensors is
the camera, and AD systems use a suite of them to gain a full surround view of
the environment. These many cameras produce continuous video feeds that the car
has to process in real time, which requires a lot of computational power due to the
additional requirements of low latency.

One subtask of AD that uses the cameras of the vehicle is traffic sign recognition
(TSR), which is all about detecting and classifying the traffic signs that the vehicle
encounters on the road [16]. Modern TSR systems take the image sequences pro-
duced by the vehicle’s cameras and use advanced machine learning models to classify
the signs [17], [18]. Just like any deep machine learning system, these systems could
benefit from using ensembles to boost performance. However, their application is
limited by the computational resources available in the car, which have to be shared
across all the functions previously mentioned.

This project is commissioned by Zenseact, a software company developing autonomous
driving solutions for major car makers. Zenseact wants to explore the use case for a
system that integrates the aforementioned aspects by investigating whether ensem-
bles can be implemented in a way that improves performance compared to individual
machine learning models, but which is more resource-efficient than traditional deep
ensembles. The machine learning setting that is in focus is that which works with
temporal information in the form of sequences of images. Each sequence tracks an
object across time, meaning the informational content is slightly augmented versions
of the same object.

2


1. Introduction

img₁ img₃img₂ img₄

model₁ model₁model₁ model₁

output₁ output₃output₂ output₄

time, t

Figure 1.1: Single model architecture. One model performs inference on
the input image at each time step.

img₁

model₁ model₃model₂ model₄

img₂

model₁ model₃model₂ model₄

img₃

model₁ model₃model₂ model₄

img₄

model₁ model₃model₂ model₄

time, t

output₁ output₂ output₃ output₄

Figure 1.2: Traditional ensemble architecture. Every ensemble member
performs inference on the input image in each time step.

img₁

model₁ model₃model₂ model₄

img₂

model₁ model₃model₂ model₄

img₃

model₁ model₃model₂ model₄

img₄

model₁ model₃model₂ model₄

time, t

output₁ output₂ output₃ output₄

Figure 1.3: Schematic illustrating the proposed approach – an ensemble
spread over time. Note that the number of forward passes is the same as for
the single model in Figure 1.1. The previous prediction from models that
do not run inference on this timestep also count towards the model output.

All machine learning models considered in this thesis are trained and produce their
outputs in the single-frame setting, but their predictions are then aggregated across
time steps in the sequence. In this manner, the typical way to apply a single model
is to let it produce outputs for the single frame in each time step in the sequence,
and then aggregate the predictions across time (see Figure 1.1). In a similar manner,
a deep ensemble would be applied by letting every member produce their prediction
for each time step. The outputs across members are then aggregated for the current

3


1. Introduction

frame before it is combined across time steps (see Figure 1.2). The proposed system
combines the outputs of each member, just as for the traditional ensemble (see
Figure 1.2), but only one member runs inference in each time step (see Figure 1.3).
This means that the same number of forward passes are made as in the case of the
single model approach, a fact that limits the computational load of the system.

1.2 Aim
The aim of this thesis is to implement an ensemble spread over time and evaluate
its performance against some high-performing models in uncertainty quantification.
Baseline models are established and the performance of ensembles spread over time
is compared to the performance of these. The system’s uncertainty quantification
performance in and out of distribution is also compared to the baselines. Addition-
ally, temperature scaling [1], a post-training calibration method, and its effects on
UQ performance are evaluated.

1.3 Research questions
Here follow the research questions that we aim to answer.

Is it possible to achieve the benefits of a traditional deep ensem-
ble in terms of predictive- and uncertainty quantification perfor-
mance using a deep ensemble spread over time?

(A)

This is the main research question of the thesis, but it is quite broad. To answer
this question we need to also answer the following questions.

How does the predictive performance for deep ensembles spread
over time compare to that of the chosen baselines in terms of
F1-score and accuracy?

(B)

How does the UQ performance of deep ensembles spread over
time compare to the chosen baselines in terms of ECE, MCE,
and Brier reliability?

(C)

Do deep ensembles spread over time successfully adjust their con-
fidence on OOD data by increased entropy, and how does their
performance compare to the chosen baselines?

(D)

4


1. Introduction

1.4 Delimitations
Since the goal of this thesis is to investigate whether the benefits of using ensembles
can be achieved using the proposed ensemble spread over time, only relatively stan-
dard machine learning architectures are used. This increases the generalizability of
the results, and in extension also means that obtaining SotA predictive performance
was not a priority of this thesis.

One of the main motivations of DESOT is to limit the computational load at the time
of inference in order to enable the use of deep ensembles in settings with limited
computational budgets. However, directly measuring the computational load of
different ML models is highly non-trivial since many factors influence performance.
Such factors include but are not limited to the speed of data transfer, memory
capacity, as well as what other processes are running simultaneously. Because of
this, it was decided that the number of model forward passes was to be used as a
proxy for the computational load. This proxy is relatively useful since all models
tested have the same basic architecture. Before actually employing DESOT or any
other ML model on a real system, a thorough performance test would have to be
conducted on the machine in question.

While spatiotemporal fusion models such as those proposed by Ji et al. [19] and
Tran et al. [20] are promising, they are challenging to design, obtain annotated data
for, and train. Due to this, it is common to design detection, segmentation, and
classification models for the single frame setting. Then, the single-frame predictions
are combined in some way. This is how the system proposed in this thesis works,
and such systems have proved to work fairly well. Furthermore, the spatiotempo-
ral fusion approach is in some ways orthogonal to the ensembling scheme proposed
for this thesis, where the two could conceivably be combined. For example, a spa-
tiotemporal convolution similar to that proposed by Tran et al. [20] might be used to
create an optimal combination rule between the outputs from individual ensemble
members, instead of using simple averaging. For these reasons, we chose to limit the
scope of the thesis to models that do not explicitly model the additional information
that comes from the temporal aspects of a sequence of frames.

1.5 Contributions
The main contribution of this thesis is to propose the idea of ensembles spread over
time, an idea that to the best of our knowledge is novel. The method is compared
to some of the most common non-Bayesian alternatives. DESOTs are shown to be
competitive with traditional deep ensembles, both on pure predictive performance
and uncertainty quantification performance, while using a fraction of the computa-
tions. Experiments for predictive performance on rare classes are conducted, which
seems to be a strong point of DESOTs. While more work is needed to investigate
whether the results generalize to other tasks, datasets, and model architectures,
we believe that this project serves as a strong baseline for further research into
ensembles spread over time.

5


1. Introduction

1.6 Discussion of ethical and sustainability aspects
The advent of autonomous vehicles and more intelligent artificial intelligence will
radically change our society in a multitude of ways. This project is a small part of
the journey to fully autonomous driving cars and although it might not have radical
effects in isolation, the combined work of many related projects will. Therefore,
potential issues related to ethics and sustainability must be taken into account.

The safety aspects of AD are an important reason why it is such an important
innovation. Around the world, there are 1.35 million fatal accidents every year [21],
and more than 90 % of these are due to human error [22]. Vehicle safety is identified
as a key factor in lowering these losses [21]. With AD software the driving style of
the world’s safest human can hopefully be replicated and even exceeded, reducing
the accident count greatly and in the best case lowering it to zero. The technology
is not yet there but this project might be a step on the way there.

Having millions of cameras on the road, and recording every second being necessary
for a working AD system comes with the risk of these cameras being used for malev-
olent purposes, making it essential to have a significant level of security around the
AD software and the data generated [22]. Personal integrity is also a key concern in
this regard, as well as the impact on the greater society. This means that companies
working in AD must have strict ethical guidelines in order to ensure that the data
collected by their systems is handled in a correct manner.

Autonomous vehicles will most likely change the way we travel, providing convenient,
fast, and cheap transport [23]. This will most likely come to the detriment of having
more vehicles on the roads creating more pollution in the form of air particulates
and noise pollution. Part of this issue will be reduced by converting the fleet of cars
to electric, but there will still be pollution from the road surface and rubber tires.

Vehicle utilization with AD is likely to increase [24] as services for sharing vehicles or
utilizing fleets like public transport are created, which will potentially increase the
practicality of not owning a private vehicle. The result of this is a better utilization
of Earth’s limited resources which will help reduce the environmental strain from
the production of new vehicles, ultimately helping the world reach the UN’s climate
goals.

Equity and equality will be affected greatly around the world, especially in poorer
communities. Transportation can be a big financial strain on families with low
income, but in the long term with AD, the cost of transportation can come down
to be as low as ten times less than owning a private vehicle [24], close to the cost
of public transport. This allows for more equity and equality as opportunities for
work, education, and leisure can be more accessible to everyone.

6


2
Theory

In this section, some of the most relevant previous work and the theory behind the
concepts used will be presented with the aim of placing the thesis into a context in
literature.

2.1 Convolutional neural networks
Convolutional neural networks (CNNs) are used for increasingly complex tasks,
which in turn require an increasing number of parameters to be added. The added
parameters come with a cost both in terms of memory and in terms of computa-
tional footprint for training and inference. However, it has been shown [25] that
when the number of parameters increases, using ensembles will speed up training
and inference, as well as increase prediction accuracy compared to individual models
with the same total number of parameters.

A CNN model consists of convolutional layers and many fully connected neural
layers. The outputs from the final fully connected layer in a neural network are called
logits. These are real-valued numbers that communicate the network’s prediction.

2.1.1 Residual networks
Residual networks (ResNets) [26] is a type of deep convolutional neural network
containing residual connections over blocks. A standard deep neural network can be
hard to train since vanishing gradients start to become a problem for deeper neural
networks [26]. Vanishing gradients is a problem that is caused by the products of
gradients of the loss function getting increasingly small. In a deep network, the
gradient may approach zero, leaving the next layers unable to update the weights.
This can be mitigated by adding residual connections which allow the gradient to be
passed through allowing for deeper CNNs. There are many different sizes of ResNets
to choose from, for example, ResNet18 or ResNet50, which contain 18 layers and 50
layers respectively.

7


2. Theory

2.1.2 Evaluating the predictive performance of neural net-
works

There are many different ways of evaluating the predictive performance of a machine
learning model in a classification task. Given a binary classification problem, a model
prediction can be characterized as either a true positive (TP), a false negative (FN),
a false positive (FP), or a true negative (TN) (see Table 2.1).

Table 2.1: Confusion matrix across a set of predictions for a binary
classification problem.

Prediction
True False

Ground Truth True True positive False negative
False False positive True negative

Over a data set for which the model produces predictions, the counts for each of
these four positions of the confusion matrix can be computed. Then, a number of
metrics can be computed. The first of which (and most commonly used [27]) is
accuracy, which is defined as

accuracy = TP + TN

TP + FP + FN + TN
. (2.1)

One well-known problem with accuracy is that it is insensitive to class imbalances
in the dataset, which for imbalanced datasets can lead to deceptively high perceived
performance. To combat this problem, two additional metrics and a derivative of
them might be employed. These are precision and recall, which are defined as [27]

precision = TP

TP + FP
(2.2)

and

recall = TP

TP + FN
. (2.3)

Intuitively, precision is the share of positive predictions that were actually positive,
and recall is the share of true positives that were captured in the prediction. From
these two metrics, the Fβ-score can be computed. It is defined as

Fβ = (β2 + 1) · precision · recall

β2 · precision + recall
(2.4)

8


2. Theory

where β ∈ R, β ≥ 0 [27]. The Fβ-score ranges from 0 to 1, where higher is better. If
β = 1, the score equally weights precision and recall and is then called the F1-score,
often written without a subscript. This is the most common version of the Fβ-score.

There are some extensions of the binary F1-score to the multiclass classification
domain. For these, we use the naming employed in their implementation in scikit
learn. One extension is the F1micro-score, which computes the different fields in
the confusion matrix sample-wise one versus all, such that all samples are equally
weighted in the resulting score. This is equivalent to accuracy. Another is F1macro,
which uses this same operation class-wise and takes the unweighted average across
classes. The former approach will allow the majority classes to dominate the score,
while the latter will relatively over-weight the importance of prediction performance
on samples belonging to rare classes. F1macro is useful in cases when performance
on each class is equally important, and is usually what is meant when referring to
F1-score in a multiclass setting. This is also the naming convention we use.

2.1.3 Interpreting neural network outputs as probabilities
for classification tasks

A classification neural network is a function that maps an input example i to an
output vector of dimension C, which is the number of classes. In order to interpret
neural network outputs as probabilities, it is common to use the function that in
machine learning is colloquially called the Softmax function. It is applied to the
final neural network output logits vector z ∈ RC and transforms each element zj

into the range [0, 1] s.t.
∑C

j=1 softmax(zj) = 1. Its mathematical expression was first
introduced by Boltzmann [28] in 1868 as the Boltzmann distribution, and later used
by Bridle [29] for interpreting neural net outputs as probabilities. Given an output
logit vector z, the softmax value for the element zj ∈ z is computed as

softmax(zj) = ezj∑C
k=1 ezk

. (2.5)

Softmax applied to a vector is defined element-wise. The prediction ŷ for a neural
network given an input example x is typically said to be

ŷ = arg max
j

{softmax(z1), softmax(z2), ..., softmax(zj), ..., softmax(zC)} (2.6)

where j is an index in the logit vector. The model’s confidence in that prediction is
then simply conf(ŷ) = softmax(zŷ). One notable property of the softmax function is
that it is scale-variant due to the nature of exponentials, meaning that softmax(z) ̸=
softmax(z/r) ∀ r ∈ R, r ̸= 1.

9


2. Theory

2.2 Uncertainty quantification and calibration
The theory of uncertainty quantification has long been of interest in academic circles
but it came to be a more closely studied subject during the development of weather
forecasting when meteorological forecasters give subjective estimates of the prob-
ability of various weather phenomena. As we shall later see, these concepts quite
naturally generalize to the machine learning context.

One key aspect of the quality of a forecaster’s predictions is their calibration, a
concept which is commonly defined as the deviation of the forecaster’s subjective
probability assigned to an event from the long-term empirical probability of that
event [3], [4], [30]. Calibration is also commonly referred to as reliability [31]. For
the sake of clarity and mathematical correctness, consider an example adapted from
DeGroot & Fienburg (1983) [3]. Assume that there are two possible events for a given
day: rain, and no rain. For a given day i, a forecaster gives a subjective probability pi

of the chance of rain from a set of K possible values in the set P , s.t. pi ∈ {P1, ..., PK}.
Why limiting the number of possible values is necessary will become apparent later.
At the end of the day, he observes the outcome yi of the event, where yi is an
indicator variable that is 1 in the case of rain and 0 otherwise. Let ν(p) be the
long-term distribution of the forecaster’s predictions of rain. Next, let ρ(yi = 1|pi)
be the long-term empirical probability of rain given the forecaster’s prediction of
rain pi. That is, given that the forecaster predicts a probability pi of rain on day i,
what is the actual probability that it will rain based on historical predictions. With
this background, a forecaster is said to be well calibrated if ρ(y = 1|p) = p for all
values of p ∈ P such that ν(p) > 0 [4]. For example, this means that out of 10,000
days that a well calibrated forecaster claimed that the probability of rain was 0.4,
it actually rained 4,000 of them. This also gives insight into why the number of
allowed values that the predictions can take on has to be limited — it is necessary
to allow for statistically sound empirical estimates of ν(p) and ρ(y = 1|p).

Interestingly and importantly, a forecaster can be well calibrated but produce com-
pletely uninformative predictions. The forecaster might simply observe the climato-
logical base rate (long-term probability) for rain on any given day, and give that as
his prediction of the chance of rain every day. Using the definition of calibration from
earlier, this will lead to well calibrated but useless predictions. Thus, the notion of
refinement is given as a complement to calibration [3]. Assume that forecasters A
and B are both well calibrated. Let forecaster A give the climatological base rate
as his prediction on any given day, and let forecaster B give either p = 0 or p = 1
on any day and always be correct. Then A is least refined, since any well calibrated
forecaster is at least as refined as him, and B is most refined. See DeGroot and
Fienberg [3] for a more mathematically expressive definition of refinement. Using
this intuition regarding forecasters A and B, refinement can be seen as a measure of
the usability of the forecasts given by the forecaster.

One important point of contention in meteorological circles early in the field’s life
cycle was how to evaluate the quality of a forecast. This created a need for metrics
that are not possible for the forecaster to artificially game by adjusting their forecast

10


2. Theory

to optimize for a higher score [30], such as by simply mimicking the known base rate
of the events [3]. This prompted developments in the theory surrounding proper
scoring rules, which are reward functions that encourage forecasters to give their
actual subjective forecast probabilities rather than trying to game the system. More
formally, suppose a forecaster believes the probability of rain to be pa. A scoring
rule is one which rewards the forecaster S(pa) if it rains, and S(1 − pa) if it doesn’t.
Assume that the forecaster wishes to maximize his reward. If the forecaster instead
gives pb as his prediction of rain, contrary to his actual subjective prediction pa,
his expected reward is paS(pb) + (1 − pa)S(1 − pb). If the scoring rule is proper,
the choice to give pb = pa maximizes the forecaster’s reward. If the scoring rule is
strictly proper, then pb = pa is the only prediction that maximizes the reward [3],
[32]. Here, we have assumed that the forecaster wishes to maximize his reward, but
the same of course holds for the negation of a score that the forecaster wishes to
minimize.

Many classification problems, such as weather forecasting, are not binary but multi-
class, where an event can fall into any class c ∈ {1, ..., C}. A prediction of subjective
probabilities for an event xi, i ∈ {1, ..., N}, where N is the number of events, are
to be given by the forecaster. Following the notation from the previous paragraph
but expanded to a multi-class problem, we denote the subjective probabilities given
by the forecaster as a categorical distribution p(xi) = (pi1, pi2, pi3, ..., piC) across the
classes, such that for any instance xi,

∑C
j=1 pij = 1. With this background, a score

was developed by Brier [30], that has later become known as the Brier score and is
defined as

BS = 1
N

C∑
j=1

N∑
i=1

(pij − yij)2 (2.7)

where yij is an indicator variable that is 1 if instance i falls into class j, and 0
otherwise. The Brier score of a perfect forecaster (forecaster B previously mentioned)
will be 0. Forecasts of lower quality receive a higher Brier score.

An important feature of the Brier score is that it is a strictly proper scoring rule
[3]. More correctly, its negative is a proper scoring rule. However, as previously
discussed, this makes no difference in practice since the same argument can be
made for the negative of a score that the forecaster wishes to minimize. Murphy
[31] showed that the Brier score can be decomposed into three distinct parts

BS = Reliability − Resolution + Uncertainty︸ ︷︷ ︸
Refinement

(2.8)

and Bröcker [33] later showed more generally that any strictly proper scoring rule
can be decomposed in such a way.

11


2. Theory

The reliability part of the Brier score decomposition is often referred to as Brier
reliability and is in itself a measure of calibration and can be used as such. In the
binary case, it is defined by DeGroot and Fienberg [3] as

Brier reliability =
∑
p∈P

ν(p)(p − ρ(p))2. (2.9)

Remember from before that P is the set of allowed values that the subjective forecast
probabilities p can take on, ρ(p) is the actual probability of rain given the forecaster’s
prediction, and ν(p) is the long-term distribution of the forecaster’s prediction of
rain. The Brier reliability score is non-negative, and a perfectly reliable forecaster
gets a score of zero.

The Brier reliability can also be expanded to the multi-class setting, where Brier
[30] first defined his score

Brier reliability =
C∑

j=1

K∑
k=1

njk

Nj

(
pjk − ojk

njk

)2

(2.10)

where K is the number of different values that the subjective probabilities can take
on, njk is the number of times that the kth forecast was issued for class j, Nj is the
number of occurrences in class j, and ojk is the total number of events that occurred
before the jkth forecast was issued [34]. Note that Equation 2.10 is the empirical
decomposition of the Brier score, which is why the term ojk

njk
in Equation 2.10 is

used as an estimate for ρ(x). By the same reasoning, the term njk

Nj
is used as an

estimate for ν(p). This is also why the number of different values that the subjective
probabilities are allowed to take on must be limited to only K. If the probabilities pi

were allowed to take on any real number in the range [0, 1], it would not be possible
to calculate the empirical long-term distributions ν(p) and ρ(p) since the number
of observations is finite and the possible output space is continuous and therefore
infinite. Thus, in practice, binning of the forecast probabilities into K bins is used
for such scenarios.

2.2.1 Expanding the concepts of predictive uncertainty and
calibration to neural networks

We introduced the concept of calibration in the meteorological setting in section 2.2,
and it is now time to expand it to the neural network setting. When introducing the
concept, we used the definition that calibration is ”the deviation of the forecaster’s
subjective probability assigned to an event from the long-term empirical probability
of that event”. Thus, we need to expand the meaning of subjective probability and
long-term empirical probability. In the neural network setting of this thesis, we
lean on the theory from subsection 2.1.3 where the softmax output distribution is
interpreted as class-wise output probabilities. This is the subjective probability in

12


2. Theory

the definition above. With ”the long-term empirical probability of that event”, we
refer to the actual probability, across the whole dataset, that a given class was the
ground-truth class given the probability that was reported in the categorical output
distribution.

One interpretation of the uncertainty of the neural network in its output is the
entropy of the predictive distribution. Here, entropy is meant in the information
theoretical sense, which has intuitive connections to the physical interpretation.
First introduced by Shannon [35] for digital communications, the idea of entropy
in information theory is closely related to the average information content of the
message to be relayed. Assume a message of length N , with C distinct characters,
or values. If the fractions of the different characters’ prevalence in the message
are denoted pj = nj

N
, j ∈ {1, ..., C}, where nj is the count of the jth character, the

informational entropy of the message can be expressed by

entropy = −
C∑

j=1
pj log pj (2.11)

which has later become known as Shannon’s entropy. If the logarithm used is of
base 2, then the unit is bits. If the natural logarithm is used, the unit is nats [35].
In information theory, Shannon entropy acts as a lower bound on the number of
bits, or nats, needed to relay the message in its theoretically most compressed state.

Applying Shannon’s entropy to the probability vector produced by the softmax
function allows for an interpretation where the entropy of the categorical output
distribution p(xi), for an input example xi, is a measure of model uncertainty. The
intuition is that if the network is uncertain, we expect a flat output distribution
across classes, which yields a high entropy. Conversely, if the network is very cer-
tain in its prediction, we expect a very sharp distribution, which results in a low
entropy. The output distribution is a frequentist point estimate of the distribu-
tion over classes, and the uncertainty is therefore not the uncertainty across model
parameters in the Bayesian sense.

2.2.2 Additional ways to measure calibration
One common way to visualize the degree of calibration for any model is a reliability
diagram [3], [36] (see Figure 2.1 for an example). It plots the observed accuracy of the
model against the self-reported confidence that the model had in the prediction. For
a perfectly calibrated model, the reliability diagram would be the identity function
[1]. Typically, the reliability diagram is drawn as a histogram with B equal-width
bins, in which the height of any bin b ∈ B = {1, ..., B}, is the expected (average)
accuracy in that bin. The interpretation of model prediction and confidence from
subsection 2.1.3 can be used to place each sample into the bin representing the
confidence for the top prediction of the sample. Let Lb be the set of indices of the
predictions in the bth bin, the accuracy and the confidence for that bin [1], [10] can

13


2. Theory

be defined as
accavg(Lb) = 1

|Lb|
∑
i∈Lb

1(ŷi = yi) (2.12)

and
confavg(Lb) = 1

|Lb|
∑
i∈Lb

conf(ŷi). (2.13)

Remember from before that we define model confidence as the probability assigned
to the top prediction. Then, two numerical metrics can be computed based on
the reliability diagram. These are expected calibration error (ECE) and maximum
calibration error (MCE) [37]. ECE is computed as

ECE =
∑
b∈B

|Lb|
N

|accavg(Lb) − confavg(Lb)| (2.14)

where N is the total number of predictions, and it is the weighted average of the
expected calibration error across all B bins. Intuitively, this is the average difference
between the observed expected calibration error and the identity function across bins
in the reliability diagram. MCE is computed as

MCE = max
b∈B

|accavg(Lb) − confavg(Lb)| (2.15)

and is the maximum expected calibration error across all B bins or the largest gap
between the average bin accuracy and average bin confidence.

Reliability diagrams and the metrics derived from them have in recent years come
to be criticized [38], [39] for some quite serious flaws. This is despite their relatively
wide use in deep learning literature. The criticisms include the fact that ECE, as
it was first introduced in [37], is a metric intended for binary classification. It has
since often been used for multi-class problems, and in that case, it only considers
the most probable class for calibration evaluation. Vaicenavicius et al. [39] calls this
induced binary classification since it relies on an implied true-class-versus-the-rest
classification problem. This means that a large part of the information in the output
distribution is quietly discarded, leading to a possibly uninformative score.

Many practical applications of neural networks also require all probabilities to be
calibrated, not just the top prediction [39]. Additionally, the B bins are all equally
wide which leads to an unequal distribution of predictions in the bins since most
modern NNs are over-confident [1], often leading to a heavy right-skew in the bin
count distribution in the reliability diagram. This means that despite large numbers
of validation samples the bins of lower confidence might be close to empty, leading
to the performance on a few samples largely dominating the score since the absolute
difference in all bins is weighted equally. Furthermore, the choice of the number
of bins is a hyperparameter to which the reliability diagram is highly sensitive [40],

14


2. Theory

0.0 0.2 0.4 0.6 0.8 1.0
Confidence

0.0

0.2

0.4

0.6

0.8

1.0

Ac
cu

ra
cy

ECE: 0.0049
MCE: 0.17

Outputs
Gap

Figure 2.1: Example of a reliability diagram. The height of the
blue bars is the average accuracy in each bin, and the height of
the pink bars is the average confidence in each bin. The dotted
gray line is the identity function that a reliable model’s predictions
follow.

making comparisons difficult and leading to a bias-variance tradeoff in the number of
bins. Lastly, the predictions within a bin may have a high variance but close to zero
mean in confidence, making the reliability diagram deceptive. For example, a bin
may have a distribution of samples resembling a U shape or a uniform distribution,
giving the bin the same mean but a different variance. For all these reasons, some
modifications to ECE are proposed in [38]. These include adaptive bin widths to
make the bin distribution uniform and other measures to increase how well the
metrics reflect the calibration of the model. With these shortcomings in mind,
reliability diagrams, and their derivative metrics remain widely used in literature
and practice, perhaps due to their intuitive nature.

2.2.3 Temperature scaling
Because of the properties of proper scoring rules outlined earlier in this section, it is
to be expected that any neural network that is trained for a classification task with
the use of a proper scoring rule as the loss function will be well calibrated by default
since this minimizes loss. However, this is not what is observed empirically. Guo et
al. [1] show that modern neural networks are over-confident when evaluated on an
unseen validation data set. To combat this problem, they propose temperature scal-
ing, a simple method where a single scalar T ∈ R, T > 0 is used to scale the output
logits before softmax is applied. This affects the entropy of the output distribution,
and if used in conjunction with the prediction interpretation in Equation 2.6 adjusts
the network’s confidence. Importantly, the class order in the predicted distribution
is not altered by this augmentation. In this framework, the temperature T that
optimizes calibration is found by minimizing a proper scoring rule with respect to
T on a separate validation data set.

15


2. Theory

Guo et al. [1] claim that temperature scaling is the best-performing calibration
method in most cases, while also being the fastest and simplest to implement. It
has later been shown that a temperature scaled single model performs poorly for out-
of-distribution data [13]. It should be noted that deep ensembles and temperature
scaling are not mutually exclusive, but can be combined. Ashukha et al. [12] even
go as far as to claim that all models, including ensembles, should be temperature
scaled before they are compared on UQ performance since some models might be
uncalibrated by default. This notion is corroborated by Minderer et al. [41], who
claim that temperature scaling helps unveil the underlying differences in calibration
that are otherwise obscured by simple average under- or overconfidence.

2.2.4 Aleatoric and epistemic uncertainty
The total predictive uncertainty for a model’s output is often partitioned into
aleatoric uncertainty and epistemic uncertainty. Epistemic uncertainty is that which
is due to a lack of knowledge about the world and therefore can be mitigated by the
collection of additional information. An example of epistemic uncertainty might be
out-of-distribution data where the network has not trained on a given class. On the
other hand, aleatoric uncertainty is uncertainty that is inherent to the world, usually
because of some stochasticity in the data generation, and it is therefore impossible
to incorporate and compensate for in the model [42]. An example of aleatoric uncer-
tainty is the noise generated by a sensor or the inherent stochasticity when rolling
dice. The distinction between aleatoric and epistemic uncertainty is useful since it
allows for the formalization of which parts of the uncertainty can be reduced by way
of model augmentation or extension of the dataset and which cannot [42]. Conse-
quently, this thesis concerns itself with how well the chosen models perform for both
the cases of epistemic uncertainty as well as aleatoric uncertainty.

2.3 Neural network ensembles
Ensembles are a collection of models which together help predict the outcome of
the model. Ensembling is a technique used in all fields of machine learning that
obtains a higher-performing model from a diverse set of worse-performing ones [8].
This concept has been extended to neural networks where the same effects can be
seen when aggregating the results. There are a multitude of different ensembling
techniques available to practitioners [12].

2.3.1 Deep ensembles
One of the many ensemble approaches that serves to promote diversity among ensem-
ble members is random weight initialization of the neural network. This approach is
used by Lakshminarayanan et al. [6] in combination with random shuffling of train-
ing data in their 2017 paper on uncertainty estimation using ensembles, and it is
one of the most commonly used in practice due to its simplicity. This type of model
is often called deep ensemble (DE).

16


2. Theory

DEs were shown to have good predictive performance on par with other techniques,
as well as high UQ performance both for in-distribution and out-of-distribution
(OOD) data [6]. As described in the original paper, training DEs is quite simple,
with three steps involved. The first is to use a proper scoring rule as the loss
function, the second is to optionally use adversarial training to increase robustness,
and lastly to train the ensemble using randomized initialization of model parameters
to increase variety in the ensemble [6]. Many common loss functions, such as cross-
entropy loss, are strictly proper scoring rules and can therefore be used in the deep
ensemble framework. In practice, adversarial training is often omitted if improved
robustness is not strictly necessary. It has been shown that deep ensembles are some
of the best-performing models for uncertainty estimation [6], [12]. Ashukha et al.
[12] find that deep ensembles are superior to any other model tested for uncertainty
quantification given a fixed test-time budget. The intuition behind why this might
be is that since each model is trained independently, each model will find different
local minima in the high dimensional loss space, which makes the model’s ability to
quantify its uncertainty more robust [12].

Lakshminarayanan et al. [6] could also demonstrate that the model had the attrac-
tive property of decreasing its certainty of prediction in out-of-distribution examples,
which was demonstrated using an ensemble trained on the MNIST dataset on ex-
amples from the NotMNIST dataset which contains letters instead of digits. It has
later been verified that DEs are the SotA for UQ on OOD data [12], [13].

Now, let’s more formally define deep ensembles. Assume we have a deep ensemble
of M members. Following the notation used earlier, let pm(xi) be the output dis-
tribution of the mth member of the ensemble, m ∈ {1, ..., M}, on an input example
xi that is to be classified into one of C distinct classes. This distribution is rep-
resented as a C-dimensional vector. Then, the output distribution p

DE
(xi) of the

deep ensemble on input example xi is the element-wise average over the individual
members’ output distribution pm, such that

p
DE

(xi) = 1
M

M∑
m=1

pm(xi), (2.16)

which is the element-wise average across the categorical output distributions for all
of the M member models for input example xi.

2.3.2 Monte Carlo dropout
Dropout was first introduced by Srivastava et al. [43] as a regularization measure
during training to limit overfitting and increase the generalizability of the learned
representation. With dropout, each neuron is turned off at random during training
according to a pre-specified probability, or dropout rate, p. This helps the network
not to overfit, and therefore generalize better, as it has to create a more robust
representation when any neuron can be dropped at any time. Recognizing that using
an ensemble of a set of models is usually beneficial for model performance, Srivastava

17


2. Theory

et al. [43] show that using dropout during inference is equivalent to sampling from an
exponential set of possible smaller models which yields higher overall performance.
Gal and Ghahramani [44] later showed that performing a number of forward passes
through a model with dropout enabled and averaging the results can be seen as
a Bayesian approximation. They chose to call this Monte Carlo dropout (MC-
dropout) and claimed that it enables superior uncertainty estimation performance
in both regression and classification tasks compared to vanilla models. Of note is
that since the introduction of MC-dropout, Lakshminarayanan et al. [6], Ashukha
et al. [12], and Ovadia et al. [13], have all claimed that deep ensembles are superior
in uncertainty quantification. However, MC-dropout remains widely used due to
its simple implementation and general improvement of performance compared to
vanilla single models.

18


3
Methodology

In this chapter the methodology chosen for the study is outlined, beginning with a
specification and motivation for the choice of machine learning problem as well as
a more formal definition of the proposed model. Then, the software libraries used
are mentioned, along with short motivations. The data used for the training and
evaluation of the models is highlighted, along with the data pipeline used for the
project.

3.1 The machine learning task at hand
One key consideration for the project was what kind of problem to apply the pro-
posed ensemble approach to. Through discussions with Zenseact, we settled on
using the model on a classification problem where the model should predict labels
for cropped patch sequences of traffic signs. The reasoning behind choosing this
problem instead of some more complex problem was that the focus of the thesis is
to evaluate the potential of ensembles spread over time. Implementing a working
system for this purpose and investigating a variety of aspects of the proposed sys-
tem was chosen over solving more complex problems that in principle are not any
more novel than classification, such as object detection or semantic segmentation.
Though the problem domain itself is not crucially important for the thesis, it framed
the project and influenced some of the choices made during the design of the in-
vestigation. It also affects how the results should be interpreted. Therefore, a brief
overview of the problem and its domain is given for the rest of the section.

The chosen problem falls into the domain of traffic sign recognition (TSR), a field
with a decades-long history of development [45]. The first systems available for
private end users were introduced in higher-end vehicles in the late-2000s or early
2010s as an aid for drivers. These systems were often limited to a few different classes
of traffic signs [46]. Lately, TSR has become an important part of AD systems, and
high and reliable performance is important for safety.

There are two main subproblems of TSR, traffic sign detection and traffic sign
classification [16]. This project concerns itself with the latter and assumes that
regions of interest have already been identified by a system (in this specific case,
human annotators) earlier in the ML pipeline. The domain is characterized by a large

19


3. Methodology

number of classes with an imbalanced class distribution. Additionally, variations
in illumination, perspective, and occlusions are common [45], making the problem
distinctly long-tailed. Furthermore, many of the classes are very similar in shape
and color, but with important differences in meaning, such as speed limit signs.
Deep learning has recently started revolutionizing this domain, with many models
achieving accuracies of over 95% in research settings [47]. This means that any
differences in predictive performance between the models are likely to be small in
absolute terms, and performance benefits might instead lie in the performance of
the models on difficult examples such as short sequences or obscured scenes.

3.2 Software libraries
The implementation of the data handling, the models, and the auxiliary code for the
thesis was all coded in Python. The code for handling the data, training, evaluation,
and storage of trained models was built using the PyTorch software library [48], an
open-source project for deep learning in Python. This allowed for easily and quickly
building modular code for purposes of deep learning. The base deep neural networks
used are from the torchvision library, which is also part of the PyTorch project.
This means that the networks could swiftly be implemented as part of the machine
learning system. The models were trained on a computational cluster by using a
container environment and experiments were tracked using the Tensorboard library
for Python. The use of the container environment ensured compatibility across
computational devices. Additionally, Pandas and Numpy are used for handling
tabular data such as reading annotations files in CSV or JSON format or doing
matrix computations.

3.3 Data
The data used in this thesis comes from a subset of the publicly available Zenseact
open dataset (ZOD) [49], for which a dedicated development kit is available as a
pip install. The data was collected, consolidated, and released by Zenseact in order
to further developments and research within AD. It was collected over two years by
the company’s fleet of cars in 14 countries in Europe. The dataset consists of three
separate parts – Frames, Sequences, and Drives. In this thesis, only the Frames
were used, and in these, only the images from the 120° field of view front-facing
3848 × 2168 pixel camera were used. An example of a frame from ZOD can be seen
in Figure 3.1, where annotation boxes for the traffic signs have been overlaid. The
Frames subset contains 100,000 still images and was employed as training data for
the models, as well as for validation of single-frame classification performance.

20


3. Methodology

Figure 3.1: An example of a frame from the ZOD frames dataset
with 2D annotation boxes for traffic signs drawn on the image.

3.3.1 Annotations
Each image in Frames contains annotations for a range of different dynamic and
static objects, which includes 2D- and 3D bounding boxes for pedestrians, traffic
signs, poles, lane markings, and other relevant objects (see Figure 3.1 for an exam-
ple). In the case of the Sequences, only the middle frame is annotated, and the rest
of the frames, both before and after, are not. For the purposes of this thesis, the
2D bounding boxes for objects of the static objects subclass Traffic signs were of
interest. In total, there are around 446,000 distinct, annotated traffic signs in the
100,000 images in Frames. The annotations for the traffic signs have been created
by a team of professional annotators, and should not contain any annotations for
signs that are not relevant to traffic, such as advertisements or billboards. The
data contains 156 distinct classes of traffic signs, including two specialty classes –
NotListed and unclear. The NotListed class is used by the annotators when the
traffic sign in question does not fall into any of the other 155 classes. Such signs
might for example include destination signs. The unclear class is used when the
class for some reason is difficult to determine. This might for example include cases
when the sign is heavily occluded or the sign is otherwise difficult for a human an-
notator to classify. The class distribution of the single-frame traffic sign dataset is
distinctly long-tailed, where some classes have in excess of 10,000 unique instances,
while some classes have fewer than ten examples.

21


3. Methodology

Figure 3.2: Random sample of still images from the training data,
taken from the Frames subset of the ZOD.

Figure 3.3: Random sample of eleven sequences taken from the
sequences dataset that is used for comparing the tested models.

22


3. Methodology

3.3.2 Data preprocessing and datasets
In order to make the data from ZOD usable for the purpose of the thesis, the 2D
annotation boxes were used for cropping out the traffic signs from the single-frame
dataset. A padding of ten percent was added to each side of the 2D annotation box,
and the crops were then saved in their raw format. In addition to the raw files, a
separate JSON file with some information on the annotations was created for each
data set derived from the ZOD, which each included the image and annotation ID,
the traffic sign class, and the height and width of the crop. This is useful further
down the data preprocessing pipeline since it allows for filtering of the data based
on some different criteria.

The standard train-test split of 90-10 from the ZOD development kit was used to
divide the Frames dataset into a training and a validation set (see Figure 3.2),
with a share of the frames being reserved for a sequence set (see Figure 3.3). The
sequences dataset used for the thesis (note the distinction from the ZOD Sequences)
was created by taking a subset of the frames from ZOD Frames and extending them
using the raw video feed from internal, unpublished, and unannotated data, to create
a sequence of consecutive images. The annotations from ZOD single frames were
then used with an implementation of ByteTrack [50] to create 2D annotation boxes
for each frame in the sequences. These tracked annotations were then processed
in the same way as the training and validation sets by cropping with padding and
saving each crop as an image. Care was taken to ensure no overlap between the
single frames used for training, validation, and the single frames used to extract
sequences from additional internal data. Including the final frame that is part of
the ZOD, the tracked sequences are 11 frames long. In total, there are four distinct
(sub)sets of data used for the thesis (see Figure 3.4). The tracking used for creating
the sequences dataset is not perfect, and on average the tracking quality degrades as
the distance from the annotated frame, i.e., the last one in each sequence. A more
thorough discussion of data quality is included in section A.1 in the appendix.

There are a few reasons for using the additional internal data instead of the ZOD
Sequences. First of all, the ZOD Sequences were not available at the start of the
work on the thesis. The use of additional data also increases the available sequences
for model evaluation from the 1429 independent sequences in the ZOD Sequences to
29,359 independent sequences, meaning more statistically sound estimates of model
performance can be made. As previously mentioned, TSR is distinctly long-tailed,
which means that more data efficiently allows for higher coverage of the space of
possible input data.

23


3. Methodology

testtrain validation

ZOD Frames
sequences

time
t11

t10

t11

Figure 3.4: Visualisation of the datasets used for the project. The three
datasets train, validation, and test are all mutually exclusive subsets of
the ZOD Frames dataset. The datasets sequences is an extension from the
frames in the test dataset where the frames prior have been tracked and
cropped. Note that ti denotes the ith frame of a sequence, i ∈ {1, ..., 11},
each frame in such a sequence originating from the same video as the cor-
responding t11-frame.

3.3.3 Dataset implementation details
The datasets and data loaders were implemented in PyTorch. As previously men-
tioned, the implementation allows for data filtering. This is done in Pandas during
dataset initialization. The filtering is based on explicitly excluded classes, which in
practice was mainly used to filter out the data points labeled as unclear, but also
classes with too few occurrences. This helped with training since the unclear class is
uninformative in its nature, and classes with too few occurrences are uninformative
due to a lack of variety. The filtering functionality also allows for sorting away crops
based on size (in pixels). This was useful when training the models, but also for
evaluation when running experiments and diagnostics. When the dataset is queried
for a new item by the data loader, a transform is first applied to the crop. For
training, all crops were resized to 64 × 64 pixels and then a random crop was made
which reduced the size to 56 × 56 pixels. Then, normalization was applied, which
maps the RGB values from their range [0, 255] to the range [−1, 1], which in practice
speeds up training. The transform used for testing and validation was the same as
for training but without random cropping. This was to ensure that the results are
deterministic.

3.4 Experimental approach
The core approach of the thesis is comparing a single model and a standard en-
semble consisting of M members with an ensemble model of the same number of
members, but where the inference is spread among the members and images over
time. Thus, the thesis extends previous literature conducted on still-image single
frames to sequences of single frames and proposes a simple and novel approach to

24


3. Methodology

applying ensembles on such sequence data. The single model serves as an expected
lower bound on performance and a deep ensemble [6] serves as the performance tar-
get, referred to as the upper bound. Do note that the upper bound is not actually
tractable to be used in car due to a limited computational budget. It is simply
included as a theoretical maximum of the performance of DESOT that we might
hope to observe. We have chosen to refer to these two bounds as the baselines.

The thesis investigates where between the two performance bounds the proposed
model falls. In practice, this was done by comparing the DESOT to the two baseline
models. MC-dropout [44] was used as an extra model of comparison, chosen because
of its simple implementation and common use by practitioners.

3.4.1 Formal model definition
Define a sequence x ∈ RT ×H×W ×3 as a vector of T distinct still images, each with
a height of H pixels, a width of W pixels, and three separate color channels. Now,
we have a classification problem where a model must produce a categorical output
distribution across C classes for such a sequence x. Assume that there is a set of
M different deep NN models that can each conduct this classification. A single
model m ∈ {1, ..., M} produces a categorical output distribution pm(xt) for each
single image xt at timestep t ∈ {1, ..., T} in the sequence x. Then, the final output
distribution for model m is defined as

pm(x) = 1
T

T∑
t=1

pm(xt), (3.1)

which is the element-wise (class-wise) average across the output distributions for
each image in the sequence at different time steps. This setup is the lower baseline
we use for this thesis – a single model that produces a prediction at each time step
(see Figure 1.1 for a visualization). In practice, one would use a window size such
that Equation 3.1 constitutes a moving average across some images. Due to the
shortness of our sequences, only 11 frames, we have chosen to use the average across
the entire sequence for a model.

Now imagine that all M models are used for classifying each image in the sequence,
such that the final output distribution for the sequence is

pDE(x) = 1
M

M∑
m=1

pm(x) = 1
MT

M∑
m=1

T∑
t=1

pm(xt), (3.2)

which is how we choose to apply deep ensembles [6] to sequences – averaged across
the images of the sequence x. See Figure 1.2 for a visualization of a DE applied to
a sequence of images. DEM will be used to denote an M -member deep ensemble.

25


3. Methodology

The proposed model, which we call deep ensemble spread over time (DESOT),
instead only uses one model m ∈ {1, ..., M} for any given image xt, t ∈ {1, ..., T} to
produce a categorical output distribution, but the models are alternated such that
any given model m is used on average T/M times for a certain sequence of T images.
Analogously to the notation used for DEs, DESOTM will denote an M -member deep
ensemble spread over time. Assume that the order that the models are used on the
sequence is defined by an ordered list O, |O| = T . For any sequence, the final output
distribution of the DESOT across the C classes is

pDESOT (x) = 1
|O|

|O|∑
t=1

pt(xt), (3.3)

where t is some time step t ∈ {1, ..., T}, and pt(xt) is the posterior distribution of
the ensemble indicated at the tth position in O. This definition of DESOT perhaps
trivially, but also importantly, means that they can only be applied to sequences
because they fundamentally rely on the alternation of the members on neighboring
frames. If the DESOT model were to be used on a sequence length of one, the model
would be equivalent to a standard single model. See Figure 1.3 for a visualization
of a DESOT.

3.4.2 Computational footprint
One practical aspect that is key to if an ML model can at all be employed for a
certain problem and situation is the size of the computational footprint of the model.
If a model’s computational footprint is larger than what the computational capacity
of the system running the model can handle within the latency requirements of the
specified task, then it simply cannot be used. This is one of the main motivations
of DESOT, i.e., that it limits the computational resources needed to run the model
compared to a traditional deep ensemble.

Assume that running inference on an input x requires T computations when run
on a single model. Then, performing inference on that same input will require MT
computations for a DE with M members, meaning that the computational footprint
scales linearly with the number of ensemble members for a DE. This is a problem
in the AD space since all computations have to be performed in the car in real-time
with minimal latency. Furthermore, there are many tasks that have to be performed
by the system at each timestep other than just traffic sign recognition, which means
that each subsystem has even stricter limits for computational footprint. This mo-
tivates the investigation of if DESOTs that only perform inference on one model
each timestep can perform well while limiting the computational footprint of the
ensemble.

3.4.3 Evaluating predictive performance
In connection to Research questions A and B, the main evaluation criteria for predic-
tive performance is to evaluate the different models on the sequences and compare

26


3. Methodology

them in terms of F1-score. As previously mentioned, TSR is a domain that typi-
cally has many classes and unequal class distribution. Since all classes are important,
evaluating the performance of DESOT explicitly on rare classes is of value since the
model performance on these is otherwise made ambiguous by the majority classes.
Additionally, a key aspect of answering research question B is the effect of ensemble
size and the length of the sequences on the observed performance of the models.

3.4.4 Uncertainty quantification and the difficulties of mea-
suring calibration

Regarding Research question C, a key point of interest is to investigate if the high
UQ performance of deep ensembles that have been noted by a number of authors,
[6], [12], [13], extends to DESOT. In connection with the same research question,
the effects of temperature scaling on the calibration of single models and ensembles
are of interest.

A discussion regarding the difficulty of measuring and quantifying calibration is in
order. As is accounted for in section 2.2, finding the calibration of a model requires
knowing the long-term distribution ν(p) of prediction p for each class c as well the
long-term distribution ρ(y = c|p) of the probability of the class being true given
the model prediction. In theory, these distributions should be stationary, and an
infinite number of observations should have been made. In practice, of course, these
assumptions never hold, and so we have to make do with empirical estimates. These
estimates are in practice made by binning the predictions into K bins, which is
necessary since the number of observations is limited.

Metrics based on reliability diagrams such as ECE or MCE only take the probability
assigned to the top class into account, what we have chosen to refer to as confidence.
Good calibration by these metrics only means that the confidence is calibrated, not
the probabilities assigned to the other C − 1 classes. The uneven class distribution
in TSR means that confidences for minority classes are sparse, meaning the quality
of the empirical estimate of the aforementioned distributions is poor, which only
compounds the issue. The equal weighting of confidence buckets despite varying
population sizes also worsens the problems. For a review of issues with ECE, see
subsection 2.2.2. For real-life applications such as TSR, higher requirements are
usually placed on the outputs from AI models. This includes that not only the
top class confidence should be calibrated, but also the probabilities assigned to the
other C − 1 classes [39]. Brier reliability, as implemented for this thesis, takes these
other C − 1 class probabilities into account. This means that for some classes, the
number of samples that can be used for estimating calibration increases from less
than ten to the full size of the dataset compared to if only confidence is used for
measuring calibration. This raises the validity of the calibration estimates included
in this thesis. However, due to the widespread use of ECE, results in ECE are also
included but should be interpreted warily.

27


3. Methodology

3.4.5 Evaluating performance on OOD data
An essential aspect of any ML task is to have a generalized model that can handle
OOD data. The Zenseact open dataset (ZOD) is used for the project, and more
information about it can be found in section 3.3. In the ZOD, there are many
examples of traffic signs labeled as NotListed, as well as unclear images that are
hard to classify even for a human. These have their own class labels in the ZOD.
This allowed us to qualitatively test how well the model(s) perform on OOD data
in the sense that we would like the models to exhibit high uncertainty, measured
using entropy, for OOD examples. It has previously been shown that deep ensembles
perform well for this kind of qualitative OOD evaluation [6], [12]. However, their
performance has not been tested when applied to sequences of single frames, and
has not been compared to ensembles spread over time. This means that there is a
gap in research that is interesting to explore.

Another way of testing the OOD performance of ML models is to use augmentations
of varying intensity to see how the accuracy, confidence, and uncertainty displayed
by the models change with the augmentation intensity. This approach is employed
by Ovadia et al. [13], who use a set of augmentations including rotations and blur to
test just this. We have chosen to refer to this kind of OOD generation as progressive
OODness due to the increasing intensity of augmentation, but Ovadia et al. [13] refer
to it as shifted OOD data. They show that DEs are SotA on this sort of progressive
OOD [13]. This kind of comparison of the models on OOD data is interesting since it
mimics some of the edge cases that TSR systems might encounter in the real world,
such as rotated signs. It also allows for other comparisons to be made, since one of
the conclusions of Ovadia et al. [13] is that in-distribution UQ performance is not
a good indicator of OOD UQ performance. They found this to be especially true
for MC-dropout and temperature scaling of single models. What they did not test
was the performance of temperature scaling applied to ensembles or MC-dropout.
We employ this same approach to test the behavior of the models on progressive
OOD in order to characterize their behavior, as a complement to their performance
on complete OOD. We also extend the literature by extending the discussion to
sequence data and DESOTs.

28


4
Empirical Findings

To answer the research questions, a set of experiments have been performed. In
this chapter, the experimental setup for these experiments, as well as the empirical
findings, are presented. After the results of the experiments for one aspect of model
performance have been shown, these results are discussed.

4.1 Experimental setup
In this section, the experimental setup that was used to obtain the results is intro-
duced. The training and evaluation procedures are also accounted for.

4.1.1 Choice of model architecture and size
All models used for this thesis are based on the ResNet [26] CNN architecture.
There were a few reasons for this choice. First of all, most literature testing the
performance of deep ensembles include variants of this model architecture, e.g. [12],
[13], [38]. The influential paper by Guo et al. [1] that introduced temperature
scaling as a means to combat the overconfidence of modern neural networks also
used a variant of this model for empirically proving their claims. Thus, its use in
the closest related literature makes it suitable for our investigation. From a practical
point of view, the ResNet models have been shown to provide great performance
for image classification on many datasets, including ImageNet and CIFAR-10 [26].
They are also known to work well for TSR [17], [18], [51].

A small study was conducted in order to decide what size of ResNet model to use
(see section A.2 in the appendix), the conclusion of which was that the performance
difference between ResNets of different sizes is negligible for the chosen problem.
Therefore, the smaller ResNet18 version was chosen as the final base model for
all experiments. MC-dropout was implemented on the same ResNet18 model ar-
chitecture as the vanilla models, but with an additional dropout layer after each
non-linearity that was active during testing. A dropout rate of 0.2 was used. Due to
time constraints, the dropout rate was not tuned to achieve optimal performance.

29


4. Empirical Findings

4.1.2 Training and evaluation
All models are trained from scratch on the single-frame training dataset described in
subsection 3.3.2 using a base learning rate of 0.0005 on the AdamW optimizer [52].
The PyTorch implementation of cosine annealing, first introduced by Loshchilov
and Hutter [53], is used to schedule the learning rate to progressively decrease dur-
ing training for faster and more stable convergence. A batch size of 256 is used.
The NotListed and unclear classes are excluded from the training set, along with
classes with fewer than 10 occurrences in total between the training and validation
datasets. Crops that are smaller than 16 pixels along the smallest dimension are
also excluded from the data. In line with the definition of deep ensembles by Laksh-
minarayanan et al. [6], all members were trained separately on the same data with
random weight initialization. The optional adversarial training was not employed.
The cross-entropy loss function was used since it’s a proper scoring rule.

In total, 25 independent vanilla ResNet18 models were trained, along with five MC-
dropout versions of the same. In order to ensure statistically sound performance
estimates, all 25 models were run to obtain single model performance, and five
independent 5-member ensembles were created, as well as five 10-member ensembles.
For the evaluation of MC-dropout, all 5 models were separately evaluated.

Here follows a glossary of the notation used for the different models throughout
the rest of the report. When evaluated on the sequences dataset, all models use a
combination rule of simple averaging for combining predictions across frames.

• DEM: This denotes an M -member deep ensemble operating on an image.
When evaluated on the sequences dataset, DEM means that a full M -member
deep ensemble is run on each frame.

• DESOTM: This denotes an M -member deep ensemble spread over time,
where each member operates on a different frame in the sequence. For a
more formal definition of DESOTs, see subsection 3.4.1.

• Single model (SM): This denotes a standard single ResNet18-model. Note
that this is conceptually the same as a one-member ensemble, and is therefore
equivalent to both a DE1 and a DESOT1.

• MC-dropout: A single ResNet18-model trained and evaluated with an extra
dropout layer after each non-linearity. The implementation used is similar to
a single model in the sense that one forward pass is conducted for each time
step in the sequence. The outputs are then averaged across time.

• + T: A suffix added to any of the previous model names that is used to denote
that temperature scaling has been applied to that model.

30


4. Empirical Findings

Table 4.1: Predictive performance for each model tested on the
sequence dataset in terms of accuracy and F1-score. The results
include plus and minus one standard deviation of performance be-
tween runs.

Model Accuracy F1-score
SM 0.9734 ± 0.0006 0.8112 ± 0.0175

DESOT5 0.9760 ± 0.0003 0.8326 ± 0.0093
DE5 0.9764 ± 0.0001 0.8273 ± 0.0101

MC-dropout 0.9710 ± 0.0009 0.7679 ± 0.0271

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Epoch

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Ac
cu

ra
cy

SM
DESOT5
DE5
MC-dropout

Figure 4.1: Graph comparing the predicting performance on the
sequences dataset in terms of accuracy of a 5-member DESOT with
a 5-member DE, a single model, as well as a MC-dropout model.
The error bars are drawn for ±1 std. The ensemble spread over time
(DESOT5) performs on par with the deep ensemble (DE5) despite
requiring only 20% as much computation, while outperforming the
single model.

4.2 Predictive performance
We evaluate the predictive performance of the different models on the sequence
dataset. Note that temperature scaling does not affect the ordering of the predicted
classes, and thus not the final model prediction. Therefore, results for temperature
scaling are omitted for this part of the results. As can be seen in Figure 4.1, all mod-
els reach high performance as measured in accuracy for the later epochs, DESOT5
and DE5 performing markedly better than the other models at early epochs. No-
tably, DESOT5 performs on par with traditional DE, and these two models are
slightly better than other models, even after single model convergence. Final-epoch
performance is summarized in Figure 4.2. Compared using F1-score, the difference
between models is greater in absolute terms, with DESOT performing about as well
as DEs. For both metrics, MC-dropout performs decidedly worse than other models.

31


4. Empirical Findings

SM

DE
SO

T 5 DE
5

M
C-

dr
op

ou
t

0.971

0.972

0.973

0.974

0.975

0.976

Ac
cu

ra
cy

SM

DE
SO

T 5 DE
5

M
C-

dr
op

ou
t

0.74

0.76

0.78

0.80

0.82

0.84

F1
Figure 4.2: Final-epoch predictive performance on the sequences dataset,
comparing DESOT5 with a single model, DE5 and MC-dropout. The
DESOT5 performs about as well as the DE5 on accuracy and F1-score,
and both of these perform better than the single model and MC-dropout.
Again, note that DESOT5 uses the same amount of computations as a SM
or MC-dropout.

4.2.1 Evaluation on classes with few samples
Because of the high performance of DESOT when measured using F1-score, which
weighs performance on all classes equally, an additional experiment on rare classes
was conducted. This was done by filtering out any classes with more than 500 occur-
rences in the training and validation datasets, which resulted in 625 sequences. The
remaining classes are what we consider rare. These were used to evaluate the models
on rare class performance. The predictive performance results for this subset of the
sequences dataset are shown in Figure 4.3. DESOT performs very favorably in this
comparison, outperforming all other models, including DE. MC-dropout performs
by far the worst. Just as for the predictive performance on the whole sequences
dataset, the ensembles decrease the variance of the model performance compared to
single models.

32


4. Empirical Findings

SM

DE
SO

T 5 DE
5

M
C-

dr
op

ou
t

0.86

0.87

0.88

0.89

0.90
Ac

cu
ra

cy

SM

DE
SO

T 5 DE
5

M
C-

dr
op

ou
t

0.36

0.38

0.40

0.42

0.44

0.46

0.48

0.50

F1
Figure 4.3: Final-epoch predictive performance on a minority class version
of the sequences dataset, comparing DESOT5 with a single model, DE5 and
MC-dropout. The DESOT5 outperforms the DE5 on both accuracy and
F1-score. Additionally, it outperforms single models and MC-dropout by a
large margin in both metrics.

4.2.2 Discussion of predictive performance
For the task of classification of traffic signs, one of the most important aspects is the
predictive performance, or how well the model can classify the different signs. The
results displayed in Figure 4.1 indicate that TSR might be a task that is easy to
achieve high levels of predictive performance on. It can be observed that all models,
including single models, achieve an accuracy exceeding 97%. However, the results
also show that our model, DESOT5, performs very well compared to the baseline
models. The lower baseline, SM, achieves a slightly lower accuracy than our model.
The upper baseline, DE5, achieves a marginally higher accuracy than our model,
but inside one standard deviation.

Looking at the results in terms of F1-scores, where the performance on minority
classes is weighted higher somewhat changes the story, with larger absolute differ-
ences in performance between models. DESOT5 and DE5 both have significantly
higher average scores than single models and MC-dropout. The single models display
a large variance in performance, while both ensembling techniques have a smaller
variance. This seems to suggest that performance in classes that are rare occurrences
in the training data is where the main predictive performance benefits of DESOT
over single models might lie. In Figure 4.3, the DESOT model seem to outperform
the standard DE, this is unexpected as both models have the same information in
the sequences. One reason for this might be due to the inherent noise of the pre-
diction on rare classes, while the ensemble will average the prediction from more
inferences, the DESOT will not be as diluted, resulting in higher probabilities for
the correct rare class.

33


4. Empirical Findings

MC-dropout is an interesting method to compare against as the dropout creates a
slightly different model on each forward pass, which is a sort of ensembling, as shown
by Gal and Ghahramani [44]. The MC-dropout model does not perform as well as
the other models. This might be due to a too-high dropout rate, a hyperparameter
that was not thoroughly tuned to maximize performance.

All in all, these results show that our method performs significantly better than a
single model that has a similar computational footprint and equivalently to a DE5,
which has a computational footprint five times larger than our model. The reason
for the increased performance over single models might be that the members in
an ensemble together have a more expressive representation of the space of possible
traffic signs than a single model. The benefit of DESOTs is that they allow for using
this more expressive representation while limiting computations. Still, the extra
computations performed for DEs do not seem to benefit predictive performance.
Perhaps this is because of the relative simplicity of the traffic sign classification
task, which means that the extra computations of DE yield minimal benefits. If
these results can be replicated for other tasks and datasets, this is a significant
finding.

4.3 In-domain uncertainty quantification
When considering the results in this section, keep in mind the discussion regarding
the difficulties in measuring calibration in subsection 3.4.4. Now, for in-domain un-
certainty quantification, all models are not only compared against each other but
also against their temperature scaled counterparts. The temperature scaling is opti-
mized on the single frame validation set. In section A.3 in the appendix, additional
results and observations about temperature scaling are presented. The results for
in-domain calibration measured in Brier reliability are shown in Figure 4.4. The
results when measured in ECE are shown in Figure 4.5.

34


4. Empirical Findings

SM

SM
 +

 T

DE
5

DE
5 +

 T

DE
10

DE
10

 +
 T

M
C-

dr
op

ou
t

0.005

0.006

0.007

0.008

0.009

Br
ie

r r
el

ia
bi

lit
y

(a) Single-frame test dataset.

SM

SM
 +

 T

DE
SO

T 5

DE
SO

T 5
 +

 T

DE
SO

T 1
0

DE
SO

T 1
0 +

 T

DE
5

DE
5 +

 T

M
C-

dr
op

ou
t

0.010

0.011

0.012

0.013

0.014

0.015

0.016

0.017

Br
ie

r r
el

ia
bi

lit
y

(b) Sequences dataset.

Figure 4.4: Uncertainty quantification performance for each model on
in-distribution data measured in Brier reliability. Lower is better. Temper-
ature scaling seems to significantly improve the calibration for ensembles
both on the test dataset and sequences dataset. For single models, it instead
seems to worsen calibration. Overall, MC-dropout is the worst calibrated
out of all the models.

SM

SM
 +

 T

DE
5

DE
5 +

 T

DE
10

DE
10

 +
 T

M
C-

dr
op

ou
t

0.002

0.004

0.006

0.008

0.010

EC
E

(a) Single-frame test dataset.

SM

SM
 +

 T

DE
SO

T 5

DE
SO

T 5
 +

 T

DE
SO

T 1
0

DE
SO

T 1
0 +

 T

DE
5

DE
5 +

 T

M
C-

dr
op

ou
t

0.020

0.025

0.030

0.035

0.040

EC
E

(b) Sequences dataset.

Figure 4.5: Uncertainty quantification performance for each model on in-
distribution data measured in ECE. Lower is better. Temperature scaling
seems to significantly improve the calibration for both single models and
ensembles on the test dataset. However, for the sequence dataset, tem-
perature scaling seems to increase ECE. Overall, MC-dropout is the worst
calibrated out of all the models.

35


4. Empirical Findings

4.3.1 Discussion on in-domain uncertainty quantification
The ca