Novel Scenario Detection in Road Traf-
fic Images
A Comparative Evaluation of Novelty Detection Algorithms

Master’s thesis in Complex Adaptive Systems

ERIK KRATZ

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2019


Novel Scenario Detection in Road Traffic Images

A Comparative Evaluation of Novelty Detection Algorithms

ERIK KRATZ

Department of Electrical Engineering
Division of Communication and Antenna Systems

Communication Systems group
Chalmers University of Technology

Gothenburg, Sweden 2019


Novel Scenario Detection in Road Traffic Images
A Comparative Evaluation of Novelty Detection Algorithms
ERIK KRATZ

© ERIK KRATZ, 2019.

Supervisor: Roman Sokolovskii, Department of Electrical Engineering
Supervisor: Cristofer Englund, RISE Viktoria
Examiner: Giuseppe Durisi, Department of Electrical Engineering

Department of Electrical Engineering
Division of Communication and Antenna Systems
Communication Systems group
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Images used for novelty detection experiments. Top row: samples from a
simulator-generated dataset. Bottom row: samples from videos captured in road
traffic.

Typeset in LATEX
Gothenburg, Sweden 2019

iv


Novel Scenario Detection in Road Traffic Images
A Comparative Evaluation of Novelty Detection Algorithms
Erik Kratz
Department of Electrical Engineering
Chalmers University of Technology

Abstract
For artificial neural networks to be deployed in safety critical applications, such as
autonomous driving, there is a need for reliable detection and rejection of unfamiliar
inputs, because of the black box nature of such algorithms. This thesis compares
the performance of three recently published convolutional autoencoder-based nov-
elty detection algorithms when applied to road traffic images. The algorithms were
reimplemented for high-resolution images, and tested for detecting two types of pre-
viously unseen scenarios: unseen weather conditions and unseen type of landscape.
Each use case was represented in two datasets: one simulated dataset with low scene
variation, and one dataset captured in real road-traffic. Classification results in terms
of area under receiver operating characteristic and area under precision-recall curve
show that for low variability in the normal scenario, novelties can be reliably de-
tected with two out of three approaches. For the real image dataset, performance
is consistently lower, indicating that more complex and/or more well tuned models
are needed for use in real-world applications.

Keywords: novelty detection, outlier detection, anomaly detection, artifical neural
networks, machine learning, verification

v


Acknowledgements
First and foremost, I would like to thank Cristofer Englund and Boris Duran at RISE
Viktoria and my supervisor Roman Sokolovskii at the Department of Electrical
Engineering, for helping and guiding me throughout the entirety of this project.
A thanks also goes out to Dr. Giuseppe Durisi at the Department of Electrical
Engineering, for taking on the role of examiner and for reassuring me that I was on
the right track during the most difficult part of the project.
I also want to thank everyone at RISE Viktoria and the participants of SMILE

II, for giving me the opportunity to work on this exciting project. It has been both
fun and challenging.
A final thanks to my friends at Chalmers and my family. Without you, my time

at Chalmers would not have been the same, including this thesis project.

Erik Kratz, Gothenburg, March 2019

vii


Contents

List of Figures xi

List of Tables xv

List of Abbreviations xvii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Scope and Delimitations . . . . . . . . . . . . . . . . . . . . . 2

1.3 Social and Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 One-Class Classification . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Support Vector Data Description . . . . . . . . . . . . . . . . . . . . 5
2.3 Artifical Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Mini-Batch Training . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.5 Epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Fully Connected Layers . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.4 Max Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.7 Typical Convolutional Neural Network Architecture . . . . . . 13

2.5 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Convolutional Autoencoders . . . . . . . . . . . . . . . . . . . 13
2.5.2 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . 14

2.6 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . 15

ix


Contents

2.7 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7.1 Benchmarking Image Datasets . . . . . . . . . . . . . . . . . . 15
2.7.2 Self-driving Datasets . . . . . . . . . . . . . . . . . . . . . . . 21

3 Literature Review 25
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Finding Articles . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Ground criteria for algorithm selection . . . . . . . . . . . . . 25
3.1.3 Ground criteria for dataset selection . . . . . . . . . . . . . . 26

3.2 Results of Literature Review . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Summary of Current State-of-the-art Novelty Detection . . . . 26
3.2.2 Algorithms Selected for Reimplementation . . . . . . . . . . . 27
3.2.3 Datasets Selected for the Evaluations . . . . . . . . . . . . . . 31

4 Novelty Detection Experiments 33
4.1 Reimplementation of Selected Algorithms . . . . . . . . . . . . . . . . 33
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Optimization of Novelty Detection Models . . . . . . . . . . . 38
4.2.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Results for Experiments on the Pro-SiVIC Highway Scenario

Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Results for Experiments on the Dr(eye)ve Dataset . . . . . . . 48

5 Discussion 55
5.1 Implications of Experiment Results . . . . . . . . . . . . . . . . . . . 55

5.1.1 Comparison of Evaluated Algorithms . . . . . . . . . . . . . . 55
5.1.2 Relation to Similar Work . . . . . . . . . . . . . . . . . . . . . 56

5.2 Validity of Experiment Results . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusions 59

Bibliography 61

x


List of Figures

1.1 Examples of high confidence misclassification of previously unseen
objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Two dimensional example illustrating the difference between an OCC
problem and a binary classification problem. . . . . . . . . . . . . . . 6

2.2 Example of a MLP with one hidden layer. The nodes represent neu-
rons and the arrows represent connections, or weights. . . . . . . . . . 7

2.3 Plots of various ANN layer activation functions. . . . . . . . . . . . . 8
2.4 BCE and MSE plotted as functions of p, for a fixed y = 0.5. For the

MSE function, this is the special case where n = 1. . . . . . . . . . . 10
2.5 Two examples of a two-dimensional convolutional filter with filter size

k = 2 in both dimensions, applied to an input image of size 4×4. The
filter is multiplied element-wise with different sections of the input
image and the sum of the products is assigned to the corresponding
position of the output map. The output size is determined not only
by the input and filter shape but also the stride s, which is different
for the two examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Example of a convolutional transpose layer, with input of size 2, filter
of size k = 3 and stride s = 2 in both dimensions. The two subfigures
show the difference between the first and second position of the filter,
which is moved in steps of s in the output image instead of the input
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7 Example images from the MNIST database of handwritten digits. . . 16
2.8 Example images from the Fashion-MNIST database. . . . . . . . . . . 17
2.9 Example images from the CIFAR-10 dataset. . . . . . . . . . . . . . . 18
2.10 Example images from the Caltech-256 dataset. Each row shows a set

of examples from one category, resized to a square image. . . . . . . . 19
2.11 Example images from the COIL-100 dataset. Each row represents a

new object, in a number of different angles. . . . . . . . . . . . . . . . 20
2.12 Example images from the Berkeley DeepDrive image dataset. . . . . . 21
2.13 Example frames from a subset of the Dr(eye)ve dataset videos. Each

image is the first frame of the corresponding video. . . . . . . . . . . 22
2.14 Example images from the Pro-SiVIC highway scenario dataset. . . . . 23

4.1 Resized image samples from the normal class and the two novelty
scenarios in the Pro-SiVIC highway scenario dataset. . . . . . . . . . 36

xi


List of Figures

4.2 Resized image samples from the normal class and the two novelty
scenarios in the Dr(eye)ve dataset. . . . . . . . . . . . . . . . . . . . 37

4.3 CAE architectures used for experiments with the respective datasets.
In both cases, an extra batch normalization layer was added after the
output layer for the ALOCC algorithm, as this proved to improve
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Architectures used for the discriminators in the ALOCC algorithm.
Left: architecture for the Pro-SiVIC highway scenario dataset. Right:
architecture for the Dr(eye)ve dataset. . . . . . . . . . . . . . . . . . 39

4.5 Architectures used for the discriminators in the GPND algorithm.
Left: architecture for the Pro-SiVIC highway scenario dataset. Right:
architecture for the Dr(eye)ve dataset. . . . . . . . . . . . . . . . . . 39

4.6 ROC and PRCs for the ALOCC algorithm, on the Pro-SiVIC highway
scenario dataset, with unseen weather as novelties. . . . . . . . . . . . 42

4.7 Histograms of novelty scores for the ALOCC algorithm, on the Pro-
SiVIC highway scenario dataset, with unseen weather as novelties. . . 42

4.8 ROC and PRCs for the ALOCC algorithm, on the Pro-SiVIC highway
scenario dataset, with unseen landscape as novelties. . . . . . . . . . 43

4.9 Histograms of novelty scores for the ALOCC algorithm, on the Pro-
SiVIC highway scenario dataset, with unseen landscape as novelties. . 43

4.10 ROC and PRCs for the DSVDD algorithm, on the Pro-SiVIC highway
scenario dataset, with unseen weather as novelties. . . . . . . . . . . . 44

4.11 Histograms of novelty scores for the DSVDD algorithm, on the Pro-
SiVIC highway scenario dataset, with unseen weather as novelties. . . 45

4.12 ROC and PRCs for the DSVDD algorithm, on the Pro-SiVIC highway
scenario dataset, with unseen landscape as novelties. . . . . . . . . . 45

4.13 Histograms of novelty scores for the DSVDD algorithm, on the Pro-
SiVIC highway scenario dataset, with unseen landscape as novelties. . 46

4.14 ROC and PRCs for the GPND algorithm, on the Pro-SiVIC highway
scenario dataset, with unseen weather as novelties. . . . . . . . . . . . 46

4.15 Histograms of novelty scores for the GPND algorithm, on the Pro-
SiVIC highway scenario dataset, with unseen weather as novelties. . . 47

4.16 ROC and PRCs for the GPND algorithm, on the Pro-SiVIC highway
scenario dataset, with unseen landscape as novelties. . . . . . . . . . 47

4.17 Histograms of novelty scores for the GPND algorithm, on the Pro-
SiVIC highway scenario dataset, with unseen landscape as novelties. . 47

4.18 ROC and PRCs for the ALOCC algorithm, on the Dr(eye)ve dataset,
with unseen weather as novelties. . . . . . . . . . . . . . . . . . . . . 48

4.19 Histograms of novelty scores for the ALOCC algorithm, on the Dr(eye)ve
dataset, with unseen weather as novelties. . . . . . . . . . . . . . . . 48

4.20 ROC and PRCs for the ALOCC algorithm, on the Dr(eye)ve dataset,
with unseen landscape as novelties. . . . . . . . . . . . . . . . . . . . 49

4.21 Histograms of novelty scores for the ALOCC algorithm, on the Dr(eye)ve
dataset, with unseen landscape as novelties. . . . . . . . . . . . . . . 49

4.22 ROC and PRCs for the DSVDD algorithm, on the Dr(eye)ve dataset,
with unseen weather as novelties. . . . . . . . . . . . . . . . . . . . . 49

xii


List of Figures

4.23 Histograms of novelty scores for the DSVDD algorithm, on the Dr(eye)ve
dataset, with unseen weather as novelties. . . . . . . . . . . . . . . . 50

4.24 ROC and PRCs for the DSVDD algorithm, on the Dr(eye)ve dataset,
with unseen landscape as novelties. . . . . . . . . . . . . . . . . . . . 50

4.25 Histograms of novelty scores for the DSVDD algorithm, on the Dr(eye)ve
dataset, with unseen landscape as novelties. . . . . . . . . . . . . . . 51

4.26 ROC and PRCs for the GPND algorithm, on the Dr(eye)ve dataset,
with unseen weather as novelties. . . . . . . . . . . . . . . . . . . . . 52

4.27 Histograms of novelty scores for the GPND algorithm, on the Dr(eye)ve
dataset, with unseen weather as novelties. . . . . . . . . . . . . . . . 52

4.28 ROC and PRCs for the GPND algorithm, on the Dr(eye)ve dataset,
with unseen landscape as novelties. . . . . . . . . . . . . . . . . . . . 53

4.29 Histograms of novelty scores for the GPND algorithm, on the Dr(eye)ve
dataset, with unseen landscape as novelties. . . . . . . . . . . . . . . 53

xiii


List of Figures

xiv


List of Tables

4.1 Programming frameworks used in the implementations of the evalu-
ated algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Number of images in the dataset splits. The two values in the Pro-
SiVIC highway scenario test set refer to the novelty scenarios weath-
er/landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Attributes used for splitting the Dr(eye)ve dataset into normal sce-
nario and novelty scenarios, and the videos each set of images was
sampled from . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 CAE architectures and training settings . . . . . . . . . . . . . . . . . 39
4.5 A confusion matrix, showing the relation between classifier prediction

and true value of normal class membership . . . . . . . . . . . . . . . 40
4.6 ND performance metrics for all experiments . . . . . . . . . . . . . . 42

xv


List of Tables

xvi


Abbreviations

AAE adversarial autoencoder
AE autoencoder
ALOCC adversarially learned one-class classifier for novelty

detection
ANN artificial neural network
AUPRC area under precision-recall curve
AUROC area under receiver operating characteristic

BCE binary cross entropy

CAE convolutional autoencoder
CNN convolutional neural network

DNN deep neural network
DSVDD deep support vector data-description

FFNN feed-forward neural network
FN false negatives
FP false positives
FPR false positive rate

GPND generative probabilistic novelty detection

ML machine learning
MLP multilayer perceptron
MNIST Modified National Institute of Standards and

Technology
MSE mean squared error

ND novelty detection

OCC one-class classification

PRC precision-recall curve

ReLU rectified linear unit

xvii


Abbreviations

ROC receiver operating characteristic

SMILE II Safety analysis and verification/validation of Ma-
chIne LEarning based systems

SVDD support vector data-description

TN true negatives
TP true positives
TPR true positive rate

VAE variational autoencoder

xviii


1
Introduction

1.1 Background

Artificial neural networks (ANNs) have in recent years become the state-of-the-art
within pattern recognition and classification. They have successfully been used for
tasks such as image classification [1, 2, 3], speech recognition [4], and even generation
of realistic images from drawings [5].

The potential of finding complex mappings between input data, such as images
from a front-facing vehicle camera, and desired outputs make ANNs, and specifically
deep neural networks (DNNs), promising for use in autonomous driving applications.
Examples of existing results include learning to output the vehicle steering angle [6]
or a set of currently feasible actions [7] based on raw image input.

The power of DNNs lies in the high number of connections. However, the very
same property also means that the flow of information inside a DNN is difficult to
follow. This black box nature of DNNs is a severe drawback. Even networks that
achieve high accuracy, in both training and testing, can misclassify inputs, some-
times with high confidence, and knowing what causes the errors is often practically
impossible. This unreliability effectively disqualifies DNNs for use in safety-critical
situations, such as autonomous driving, where an undetected misclassification can
have dire conseqeunces.

Misclassifications can occur for a number of reasons, including the DNN being
overfitted to training data and the DNN being poorly trained. Another reason are
the recently discovered adversarial examples: small carefully designed perturbations
of an input that would be classified correctly, which result in a severe misclassi-
fication. A situation where misclassifications will always occur is when there is a
discrepancy between the distribution of the training data and the distribution of
the test data. A property of DNN classifiers is that they will classify any input into
one of the classes seen during training, simply because this is what they were de-
signed to do. A simple example can be seen in Fig. 1.1, which shows a set of images
with corresponding classifications generated by a classifier trained on the CIFAR-
10 dataset [8]. The ground truth for all images is a type of novelty, i.e., an object
class not present in CIFAR-10. For each of these inputs, the network provides high
confidence classifications in one of the CIFAR-10 classes.

The preferred behaviour of the above classifier, if integrated in a safety-critical
system, would be to detect the example images as unknown and subsequently pass
control to another part of the system, e.g., a default safe behaviour which does not
require knowledge of the image content. The research project Safety analysis and
verification/validation of MachIne LEarning based systems (SMILE II) [9] aims to

1


1. Introduction

(a) A lamb, classified as a bird with
probability 0.995.

(b) A goldfish, classified as an air-
plane with probability 0.992.

(c) Some flowers, classified as a
frog with probability 0.999.

Figure 1.1: Examples of high confidence misclassification of previously unseen objects.

solve this problem for the case of vehicular perception sensor inputs, and is led by
the research institute RISE Viktoria. This thesis is conducted as a part of SMILE
II, and will focus on the novelty detection (ND) step, where inputs that are not part
of the training data distribution are detected and rejected.

1.2 Problem Description

The problem addressed in this thesis is that of performing ND in sets of raw pixel
images from on-road traffic. More specifically, the goal is finding one or several al-
gorithms capable of modelling a normal scenario class consisting of a set of images
from a front-facing vehicle camera driving in a certain type of conditions. The ob-
jective during testing is to detect images taken in other driving conditions and mark
them as novel, while still accepting normal images. A successful ND algorithm effec-
tively becomes a safety cage for any DNN built for application using the normal class
dataset, since the ND algorithm removes inputs not seen during training of the DNN.
This would make the DNN more suited for use in safety-critical applications.

1.2.1 Aim

The aim of this project is to identify a set of existing state-of-the-art novelty de-
tection algorithms that perform well on front camera input to self-driving systems.
Algorithms were reimplemented, evaluated, and compared using relevant classifica-
tion evaluation measures.

1.2.2 Scope and Delimitations

The following delimitations serve to further limit the scope of the project:
• For selected algorithms to be extendable to work with any type of sensor input

and with large amounts of data, we demand that they are unsupervised: i.e.,
given a set of training inputs, there should not be any requirement for further
annotation, such as subclasses or object labels.

• ND will be performed on single image frames. Algorithms analyzing video
sequences are not considered in this thesis.

2


1. Introduction

• Whole, raw pixel images will be used as inputs, so as to learn to detect an
entire scenario as either normal or novel. No object detection or other type of
image segmentation will be performed explicitly.

• Editing of selected algorithms is limited to making them compatible with the
chosen dataset(s) and tuning of existing parameters and settings.

• When possible, open source code will be used.
• Hardware requirements for testing in real-time will not be taken into account

when selecting algorithms for evaluation. However, these properties may be
discussed when comparing the selected algorithms after evaluation.

1.3 Social and Ethical Aspects
This section reflects on possible negative social and ethical impacts of the completion
of this thesis project.

1.3.1 Implementation
Our view is that there are no ethical or social problems related with the implemen-
tation of this project. The main activities will be a literature review and computer
simulations, neither of which is going to have any unwanted impact on other peo-
ple.

1.3.2 Outcome
However large or small the contribution, the goal of this project is to help in achiev-
ing safe autonomous driving applications. This goal likely would have a big impact
on society if realized; one negative impact is the loss of employment for profes-
sional drivers. However, this issue is clearly outweighed by the expected benefits of
increased traffic safety and reduced greenhouse gas emissions, making the project
worth implementing.

1.4 Thesis Outline
First, theory relevant for understanding the problem and the proposed solutions is
presented in Section 2. Then, the literature review conducted to find relevant self-
driving image datasets and state-of-the-art ND algorithms is presented in Section 3.
The reimplementation of selected algorithms and experimental results are presented
in Section 4. The experimental results are then discussed in Section 5. Finally, the
main conclusions of the thesis project are given in Section 6.

3


1. Introduction

4


2
Theory

This chapter presents the basic theoretical concepts needed to understand the method-
ology and experiments in this thesis.

2.1 One-Class Classification
ND is often approached as a one-class classification (OCC) problem. In OCC, the
aim is to determine whether an input belongs to the normal class or not. More
formally, for a given point x ∈ Rn , the goal is to determine if x ∈ A ⊂ Rn, or if
x ∈ Rn \ A, where the only available information is a set of N known members of
A, Â = {a1, . . . , an} ⊂ A. This is different from binary classification and multiclass
classification, where all known classes have samples, and every possible input is as-
sumed to belong to one of the known classes. The difference is illustrated in Fig. 2.1.
In binary classification, it is enough to find a boundary separating the closest sam-
ples of different classes to separate all samples. In one-class classification there is no
information about points outside the normal class, which makes it harder to find a
suitable decision boundary.

2.2 Support Vector Data Description
Support vector data-description (SVDD) [10] is an OCC algorithm for describing a
set of n-dimensional column vector inputs xi ∈ Rn, i = 1, . . . , N . The aim is to find
a hypersphere of minimum radius R enclosing feature space points Φ(xi), where Φ
is some mapping from the input space Rn to a feature space. In the simplest case,
Φ is the identity mapping, so that Φ(xi) = xi, but to obtain more flexible solutions,
other functions can be used, e.g., kernel functions mapping inputs into a feature
space of higher dimension m > n [10]. For a given Φ, the SVDD objective can be
defined as

min
R,c,ξ

R2 + 1
νN

N∑
i=1

ξi, (2.1)

with constraints

||Φ(xi)− c||2 ≤ R2 + ξi, ξi ≥ 0, i = 1, . . . , N. (2.2)

The variables ξi enable a soft boundary such that all feature points need not lie
within distance R from the hypersphere center c, and the parameter ν determines
the trade-off between minimizing R and ξi. The above notation is the same as in
[11], which is further elaborated on in Section 3.2.2.

5


2. Theory

x

y

(a) Example of a binary classification problem. The
straight boundary separating the two well sampled
classes is relatively easy to model.

y

x

(b) Example of an OCC problem. Since only the nor-
mal class is well sampled, it is relatively hard to model
an optimal decision boundary.

Figure 2.1: Two dimensional example illustrating the difference between an OCC problem and a binary classification
problem.

2.3 Artifical Neural Networks

An ANN is, as the name suggests, a network where the basic unit is an artificial
version of the biological neuron. Each artificial neuron can be connected to other
neurons via both incoming and outgoing connections. These connections correspond
to the synapses in the brain. The basic action of a neuron is to:

1. Receive input signals from other neurons through incoming connections.
2. Compute an output signal, called activation, based on the inputs.
3. Send the activation signal to other neurons through outgoing connections.
A feed-forward neural network (FFNN) is an ANN where information flows one

way: from the input in one end to the output in the other end. A common type of
FFNN is the multilayer perceptron (MLP). In a MLP, the input signal is propagated
through one layer of neurons at a time. Neurons in a given layer have incoming
connections only from neurons in the previous layer and outgoing connections only
to the subsequent layer. The activation x

(l)
j of neuron j in layer l is determined

as

x
(l)
j = g

(∑
i

w
(l)
ij x

(l−1)
i + b

(l)
j

)
, (2.3)

that is, a weighted sum of the previous layer activations x(l−1)
i , with weights w(l)

ij

and bias b(l)
j . The function g is called the activation function and is normally chosen

depending on the type of network and layer, see Section 2.3.1. The bias can also be
included as an extra neuron with constant activation equal to 1, and can therefore
be excluded without loss of generality. In a MLP, the nodes representing the input
signal are called the input layer, the nodes representing the output signal are called
the output layer, while all intermediate layers are called hidden layers. An example

6


2. Theory

x
(1)
1 x

(2)
1 x

(3)
1

x
(1)
i x

(2)
j x

(3)
k

x(3)
n3

w(2)
ij

x(2)
n2x(1)

n1

Figure 2.2: Example of a MLP with one hidden layer. The nodes represent neurons and the arrows represent
connections, or weights.

of a MLP with n1-dimensional input, a single hidden layer with n2 neurons, and
n3-dimensional output can be seen in Fig. 2.2.

2.3.1 Activation Functions
Below are some common activation functions for ANNs. They are also plotted in
Fig. 2.3.

Sigmoid

Sigmoid functions are defined by their S-shape, and are monotonically increasing
functions between two real values. Often the output range is (−1, 1) or (0, 1). Ex-
amples are the logistic function,

g(x) = 1
1 + e−x

∈ (0, 1), ∀x ∈ (−∞,∞), (2.4)

and hyperbolic tangent function,

g(x) = ex − e−x

ex + e−x
∈ (−1, 1), ∀x ∈ (−∞,∞). (2.5)

Because of its monotonic mapping and (0, 1) output range, the logistic function (2.4)
is normally used in layers where the output is interpreted as a probability.

7


2. Theory

(a) Plots of the logistic function and the hyperbolic
tangent function.

(b) Plots of the ReLU function and the leaky ReLU
function.

Figure 2.3: Plots of various ANN layer activation functions.

ReLU

The rectified linear unit (ReLU) activation is defined by

g(x) = max(0, x), (2.6)

and is very popular in deep learning applications since it has been demonstrated
empirically [12] to lead to faster learning compared with traditional activations,
such as sigmoids. A possible reason for this is that ReLU combines non-linearity
with being computationally inexpensive. With ReLU, the problem of vanishing and
exploding gradients is also avoided, since the derivative is always 1 for x > 0.
A known problem with the ReLU activation is that neurons become permanently
inactive if they at some point get negative input for all training set samples. Since
their gradient is 0 for x < 0, ReLU neurons cannot recover in such situations and
will output 0 indefinitely. This is known as the dying ReLU problem.

2.3.1.1 Leaky ReLU

The leaky ReLU function is defined by

g(x) =
{

x, x ≥ 0
αx, x < 0,

(2.7)

where α > 0 is a small number. Leaky ReLU is an option used to avoid the dying
ReLU problem, since its gradient is nonzero also for x < 0.

Softmax

The softmax function is a generalization of the logistic function (2.4), outputting a
vector of numbers in the range (0, 1). It is defined by

gj(x) = exj∑N
i=1 e

xi
. (2.8)

8


2. Theory

which by definition yields gj such that gj > 0, j = 1, . . . , N and ∑N
j=1 gj = 1. The

softmax layer is commonly used for categorical probability distributions as the final
output of multi-class classifiers, where the output gj signifies the probability that
the input belongs to class j.

2.3.2 Loss Function
The loss function is the optimization objective during training of an ANN. It is
normally a measure of the difference between the output prediction p of the ANN
and a corresponding target output y and is preferably differentiable w.r.t. the ANN’s
weights and biases. This way, ANN optimization can be performed by minimizing
the loss function using, e.g., gradient descent. The loss functions used in this thesis
are binary cross entropy (BCE),

BCE(y, p) = −y log p− (1− y) log (1− p) (2.9)

and mean squared error (MSE),

MSE(y,p) = 1
n

n∑
i=1

(yi − pi)2, (2.10)

where the MSE is computed over a mini-batch of n predictions p = p1, . . . , pn and
corresponding targets y = y1, . . . , yn. Both functions are differentiable everywhere
and have a global minimum at y = p, which make them suitable for regression. Plots
for both BCE and MSE for a fixed y = 0.5 can be seen in Fig. 2.4.

2.3.3 Backpropagation
Backpropagation refers to the process of computing the loss in the output layer and
then using this error to compute a similar loss for the previous layer. This process is
then repeated until the input layer is reached. The parameters of all layers can hence
be updated by propagating the output loss backwards through the ANN, yielding
the name backpropagation.

2.3.4 Mini-Batch Training
Mini-batch training refers to processing a subset, called a mini-batch, of the training
data between each update of the ANN parameters. For each input sample in a
mini-batch, the loss and corresponding parameter update is computed separately.
The parameters are then updated just once per mini-batch, using the sum of all
parameter updates computed for the current mini-batch. The word batch refers to
the entire set of training inputs, also called the training set.

2.3.5 Epoch
In machine learning, one training epoch is said to have passed each time all input
samples in the training set have been processed one time and the ANN parameters
have been updated according to the used optimization objective.

9


2. Theory

Figure 2.4: BCE and MSE plotted as functions of p, for a fixed y = 0.5. For the MSE function, this is the special
case where n = 1.

10


2. Theory

0.9 0.1 0.0 0.2

0.5 0.5 0.7 1.0

0.8 0.4 0.6 0.5

0.7 0.2 0.1 0.9

0.41 0.96 1.36

0.53 0.43 0.28

0.18 -0.120.57

-0.4 -0.3

1.00.6

(a) Convolution with stride s = 1 in both dimensions.

0.9 0.1 0.0 0.2

0.5 0.5 0.7 1.0

0.8 0.4 0.6 0.5

0.7 0.2 0.1 0.9

0.41 1.36

0.18 0.57

-0.4 -0.3

1.00.6

(b) Convolution with stride s = 2 in both dimensions.

Figure 2.5: Two examples of a two-dimensional convolutional filter with filter size k = 2 in both dimensions,
applied to an input image of size 4 × 4. The filter is multiplied element-wise with different sections of the input
image and the sum of the products is assigned to the corresponding position of the output map. The output size is
determined not only by the input and filter shape but also the stride s, which is different for the two examples.

2.4 Convolutional Neural Networks

Digital image inputs are represented with one or more scalar values per pixel (usu-
ally three for color images). This means the input dimensionality is high, even for
relatively small images. In a MLP, there are (n + 1)m weights and biases between
an input of dimension n and a subsequent layer of dimension m, which leads to a
large number of parameters for MLPs built for processing images. Moreover, with
unique weights for each pixel, the features learned by the MLP are local, meaning
that objects seen during training will only be recognized if they are in the same
part of the input image as they were in the corresponding training sample(s). These
problems are either reduced or avoided with convolutional neural networks (CNNs),
a type of FFNNs which use convolutional layers as part of the network. Below, the
different parts of a typical CNN are described. In addition to CNNs, they are also
applicable in similar types of networks, such as convolutional autoencoders (CAEs),
described in Section 2.5.1.

2.4.1 Convolutional Layers
In a convolutional layer, a set of k × k filters is cross-correlated with the input
image. Cross-correlation means sliding the filter across the input in steps of size s
and performing element-wise multiplication between the filter and the input points
currently covered by the filter. For each filter position, the k2 products are added
to yield a single data point in the output. The result is a new image, called an
activation map or a feature map. The activation map has high pixel values in areas
where the input is similar to the filter and low values where it is not. This way, a
filter works as a feature extractor. Typically, a number of filters is used in each layer,
to extract different features. The number of filters is equal to the number of output
maps, and is called the depth of the layer. The shape of the output maps depend
on the input shape, the filter size k, the stride s and the dilation d. The stride s
is the step size used when moving filters across the image and can be different in
different dimensions. The dilation d determines the spacing between the pixels the
filter is applied to. The difference between s = 1 and s = 2 can be seen in Fig. 2.5.
Setting s > 1 or d > 1 is a way of reducing the size of the output image, thereby
performing dimensionality reduction.

11


2. Theory

2.4.2 Fully Connected Layers
Fully connected layers, also called dense layers, are equivalent to layers in a MLP.
Each neuron in the input is connected to each neuron in the output. Fully connected
layers have the advantage of being able to model complex mappings, and are thus
often used after feature extraction, as a mapping from high level features to the
desired output shape, e.g., a class prediction. The drawback with fully connected
layers, compared with convolutional layers, is the high number of parameters (i.e.,
weights and biases) which can make them inconvenient and difficult to train.

2.4.3 Padding
Padding is the process modifying the shape of an image by adding pixels to it.
Padding is normally applied before or after convolutional layers, since their output
shape can not be explicitly specified.

2.4.4 Max Pooling
Pooling is used for dimensionality reduction, often after convolutional layers. A
cluster of neurons in the input is represented as one neuron in the output. In max
pooling, the maximum value of the cluster is chosen.

2.4.5 Dropout
In a dropout layer, each neuron is deactivated, meaning it outputs zero activa-
tion, with some nonzero probability. This means the total layer activation becomes
less dependant on the outputs of specific neurons, which improves network robust-
ness.

2.4.6 Batch Normalization
By using a batch normalization [13] layer, the activation of the preceding ANN layer,
which is the input to the batch normalization layer, is normalized by having its mean
and variance kept fixed during training. The mean and variance is computed over
all samples in the batch, but independently for each point in the input. The details
of the normalization process will not be covered here, however they can be found in
the original paper [13].

The purpose of batch normalization is to reduce internal covariate shift, which is
when the probability density functions of inputs to the intermediate layers of a DNN
change during training, as the network parameters change. When the distribution of
a layer activation changes, the learning is slowed down, since the subsequent layer
has to learn to match a new input distribution to the training targets. Through
batch normalization, internal covariate shift is reduced, which in turn reduces the
total training time. The vanishing gradient problem, which refers to when the gra-
dient of the loss function becomes so small that learning is effectively stopped, is
also eliminated when using batch normalization. For common activation functions,

12


2. Theory

such as sigmoids, the gradient vanishes for inputs far from zero, something which
normalizing the inputs to zero mean and unit variance counteracts.

2.4.7 Typical Convolutional Neural Network Architecture
In a typical CNN classifier, the above layers are combined in two parts. The first is
a feature extraction part: this consists of a set of convolutional layers, each followed
by one or several of the pooling, dropout and batch normalization layers. After the
feature extraction, a classifier is used to model the mapping from feature space to
desired output space. For a multiclass classifier, this is usually one or several fully
connected layers, i.e., a MLP, followed by a final softmax layer. The activations in
the final fully connected layer are real valued and are typically called logits, and
the softmax layer converts the logits to probabilities, which are used as a class
membership prediction. This is true also for a one-class classifier, with the softmax
layer reducing to a simple sigmoid activation and the probability vector reducing to
a scalar which denotes the probability of the input being a member of the normal
class.

During training, backpropagation can be used for the whole network, however
the convolutional layers require a special type of backpropagation, which will not
be outlined here. For more information on CNNs, including a a derivation of back-
propagation rules, the interested reader is referred to [14].

2.5 Autoencoders
An autoencoder (AE) is a FFNN which reproduces its inputs after first compressing
them into a space of lower dimension. An AE has two parts: the encoder network
and the decoder network. The encoder compresses the input into a latent space, and
the decoder maps the latent representation back into input space. The decoder is
normally a mirrored version of the encoder, meaning the autocoder architecture is
symmetric. An AE is normally trained using backpropagation, using the reconstruc-
tion error, or reconstruction loss, i.e., some measure to evaluate how different the
reconstruction is from the original input, as loss function. Since the target output is
the input, no annotation of the training data is required, and thus the training can be
said to be unsupervised. This property of AEs is useful for feature extraction when
no labels for further classifying the input data are available, e.g., in OCC.

2.5.1 Convolutional Autoencoders
As the name suggests, a CAE is an AE with convolutional layers. Similarly to a CNN,
it is suitable for image inputs. Where the encoder network contains convolutional
layers, the decoder contains convolutional transpose layers, sometimes also called
deconvolutional layers or fractionally strided convolutional layers.

Since the encoder reduces the size of the input image, the decoder needs to in-
crease it to retain the original image shape in the output layer. Just as dimensionality
reduction, dimensionality increase can be performed in different ways. Two of them
are presented below.

13


2. Theory

0.12 0.2

1.4

0.46 0.6 0.3

0.1 0.0 0.12

0.2 0.6

0.16 0.12 0.92 0.42

0.24 0.4 1.04 1.4

0.12 0.06 0.64 0.180.4
0.20.1 0.0

0.9

0.5

0.4 0.3

1.00.6
1.26

0.7

0.82

0.02 0.0

(a) Convolutional transpose operation on the first data
point in the input.

0.46 0.6 0.3

0.1 0.0 0.12

0.2 0.6

0.16 0.12 0.92 0.42

0.24 0.4 1.04 1.4

0.12 0.06 0.64 0.18

0.12 0.2

0.4 1.4
0.20.1 0.0

0.9

0.5

0.4 0.3

1.00.6
1.26

0.7

0.82

0.02 0.0

(b) Convolutional transpose operation on the second
data point in the input, where the filter has been moved
s = 2 steps in the horizontal dimension of the output,
compared with the first operation in Fig. 2.6a.

Figure 2.6: Example of a convolutional transpose layer, with input of size 2, filter of size k = 3 and stride s = 2 in
both dimensions. The two subfigures show the difference between the first and second position of the filter, which
is moved in steps of s in the output image instead of the input image.

Unpooling layers

If the encoder network contains pooling layers, the decoder normally contains un-
pooling or upscaling layers in the corresponding position of the decoder. Max pooling
is not reversible, but the index of the maximum value during the pooling operation
can be stored. During the unpooling of a pixel, the pixel value is assigned to the out-
put pixel corresponding to the stored index, while the rest of the unpooling output
cluster is set to some constant value.

Upsampling by transposed convolution

If the encoder has convolutional stride s > 1 for dimensionality reduction, upsam-
pling in the decoder can be performed in the convolutional transpose layers. In short,
with a filter of size k×k, 1 point in the input corresponds to k2 points in the output,
instead of the opposite, which is the case with convolutional layers. The stride s of
a convolutional transpose layer refers to the step size in the output image and not
the input. Another way of viewing this stride is that moving 1 point in the out-
put corresponds to moving 1/s points in the input image, giving the optional name
fractionally strided convolution.

Upsampling with convolutional transpose layers has the advantage of allowing
more parameter optimization than unpooling, i.e., the upsampling operation is learn-
able. An example of a convolutional transpose layer can be seen in Figure 2.6. For
a more detailed explanation of convolutional transpose layers, the interested reader
is referred to [15].

2.5.2 Variational Autoencoders
In a variational autoencoder (VAE), a prior distribution is imposed on the latent
space representations. This means that apart from training the encoder to extract
relevant features for reconstruction and the decoder to reproduce the input, the
encoder is trained to map inputs onto a specific distribution, typically a normal
distribution, in latent space. Instead of encoding the input x into one latent vector
z(x), two vectors µ(x) and σ(x) are generated. The latent representation fed to the

14


2. Theory

decoder is then sampled from the normal distribution N (µ(x), σ(x)). This means
that during training, the same input x will generate different latent representations,
making the latent space more well sampled than with deterministic encodings. The
result is a latent space distribution which is continuous. This makes it possible to
randomly sample latent space representations, feed them to the trained decoder,
and retrieve new outputs which are similar to the training data: the VAE is said to
have generative properties.

For a more thorough explanation of VAEs, we refer to [16].

2.6 Generative Adversarial Networks
A generative adversarial network consists of a generator G and a discriminator D.
G generates outputs, e.g., images, from inputs z that are sampled from some distri-
bution f(z). The purpose of D is to distinguish outputs of G from real images from
some training dataset. G and D are trained jointly, but have separate loss functions:
G is rewarded for tricking D into believing its outputs are from the training set
and D is rewarded for correctly classifying inputs as either training set members or
outputs of G. Since the structure of G is similar to the decoder part of a VAE, G
can be initialized as the decoder of a VAE pretrained on the training set.

2.7 Datasets
This section presents a list of publically available image datasets. Section 2.7.1
presents the image datasets that were used in the original experiments of algorithms
reimplemented for this thesis. Self-driving datasets are presented in Section 2.7.2.
The self-driving datasets used for experiments in this thesis are further discussed in
Section 3.2.3.

2.7.1 Benchmarking Image Datasets

Modified National Institute of Standards and Technology

The Modified National Institute of Standards and Technology (MNIST) database
of handwritten digits [17] is a frequently used dataset for benchmarking image clas-
sification algorithms. It consists of 70 000 images of handwritten digits: 60 000 in a
training set and 10 000 in a test set, distributed across all 10 digit classes. The im-
ages are 28× 28 pixels in grayscale. Examples of images from the MNIST database
can be seen in Fig. 2.7.

Fashion-MNIST

The Fashion-MNIST database [18] was created to be a more challenging alternative
to the original MNIST database and consists of images of garments instead of hand-
written digits. It has the same format as MNIST: 70 000 grayscale images of size
28× 28. Example images can be seen in Fig. 2.8.

15


2. Theory

Figure 2.7: Example images from the MNIST database of handwritten digits.

16


2. Theory

Figure 2.8: Example images from the Fashion-MNIST database.

17


2. Theory

Figure 2.9: Example images from the CIFAR-10 dataset.

CIFAR-10

The CIFAR-10 dataset [8] is a set of small color images, 32×32 pixels, in 10 different
object categories. The images are downscaled from larger images of various sizes and
aspect ratios. The object classes are airplane, automobile (but not truck or pickup
truck), bird, cat, deer, dog, frog, horse, ship, and truck (but not pickup truck). The
images depict different instances of each object class, on different backgrounds, and
from different viewpoints. Each category has 6 000 samples. Example images can be
seen in Fig. 2.9.

Caltech-256 Object Category Dataset

The Caltech-256 Object Category Dataset [19] is a dataset of 30 607 images of
different sizes and aspect ratios, sorted into 257 different object categories. The
257th category is called "clutter" and is included to represent novelty samples. The

18


2. Theory

Figure 2.10: Example images from the Caltech-256 dataset. Each row shows a set of examples from one category,
resized to a square image.

number of images per category varies from 80 to 807 with a mean of 119. Examples
can be seen in Fig. 2.10.

Columbia Object Image Library (COIL-100)

The COIL-100 database [20] consists of 7 200 images distributed evenly across 100
different objects. Each object is depicted in 72 different angles, with a 5 degree
difference between each image. Examples of 10 of the objects in 10 different poses
can be seen in Fig. 2.11.

19


2. Theory

Figure 2.11: Example images from the COIL-100 dataset. Each row represents a new object, in a number of
different angles.

20


2. Theory

Figure 2.12: Example images from the Berkeley DeepDrive image dataset.

2.7.2 Self-driving Datasets

Berkeley DeepDrive

The Berkeley DeepDrive database [21] is a large video database for self-driving
applications, consisting of 100 000 high definition videos in a range of locations and
driving conditions. Each video is roughly 40 seconds long at 30 frames/s, resulting in
120 000 000 images. For each video sequence, there is metadata annotation including
time of day, weather conditions and type of landscape. There is also a separate image
dataset with one frame from each video sequence which, in addition to the metadata,
is provided with annotation for tasks such as object detection and drivable area
segmentation. Example images can be seen in Fig. 2.12.

Dr(eye)ve

The Dr(eye)ve dataset [22] consists of 74 video sequences of 5 minutes each at
25 frames/s, totalling 555 000 frames. The videos were captured with a vehicle-

21


2. Theory

Figure 2.13: Example frames from a subset of the Dr(eye)ve dataset videos. Each image is the first frame of the
corresponding video.

mounted, front-facing camera. Each video sequence is annotated with time of day
(morning, evening, night), weather (sunny, cloudy, rainy) and type of landscape
(highway, countryside, downtown). The data annotation also includes driver’s gaze
fixation, as the dataset was originally created for tasks regarding driver attention.
Example frames can be seen in Fig. 2.13.

Pro-SiVIC highway scenario dataset

The Pro-SiVIC highway scenario dataset was created for the project SMILE II by
QRTECH AB, and consists of images generated in the simulator ESI Pro-SiVIC™
[23]. The dataset is divided into three scenarios with different types of conditions:
a highway in sunny weather, the same highway in heavy fog, and an urban setting
in sunny weather. Example images from each of the three conditions can be seen in
Fig. 2.14.

22


2. Theory

(a) Images from the Pro-SiVIC highway scenario dataset. The scenario is defined by sunny weather and a highway
landscape.

(b) Images from the Pro-SiVIC highway scenario dataset. The scenario is defined by foggy weather and a highway
landscape.

(c) Images from the Pro-SiVIC highway scenario dataset. The scenario is defined by sunny weather and an urban
landscape.

Figure 2.14: Example images from the Pro-SiVIC highway scenario dataset.

23


2. Theory

24


3
Literature Review

This section presents the literature review which was conducted to get an under-
standing of the current state-of-the-art in ND in image data, and to identify the most
suitable ND algorithms for evaluation in this thesis project. First, we present the
applied method for finding and selecting ND algorithms and self-driving datasets.
Afterwards, the results of the review are presented.

3.1 Methodology
The methodology for the literature review is presented in the following order: first,
the principle for finding and selecting articles to read is presented. Then, the criteria
for selecting algorithms for reimplementation and criteria for selecting datasets for
evaluation are presented.

3.1.1 Finding Articles
The first articles were found through a keyword search, such as "novelty detection",
using the search engine Google Scholar. Articles were chosen based on title and
filtered after reading the abstracts. After fully reading the first round of articles,
an associative search method was mainly used, inspired by [24]: new articles were
found through association with those already analyzed, i.e., by being references in
read articles or by appearing in the related or recommended articles section of the
database page of a read article. This process replaced the earlier keyword search,
and new articles were again filtered, first by title and then by abstract. During the
reading process, other relevant keywords were encountered, such as synonyms for
ND. The process would then start over with keyword search. Throughout the article
selection process, filtering was based on the criteria in Section 3.1.2, where all articles
were given the benefit of the doubt: if the article was not obviously irrelevant, it was
deemed potentially relevant and would go on to the next stage.

The search for new articles was concluded when all of the articles selected through
associative search had already been processed before, meaning they had either been
read or discarded, indicating that the scientific field in question had been examined
to a such an extent that publications presenting current state-of-the-art algorithms
were unlikely to have been overlooked.

3.1.2 Ground criteria for algorithm selection
Algorithms were selected for reimplementation subject to the following criteria:

25


3. Literature Review

A1 Each selected algorithm must yield good results, in terms of area under receiver
operating characteristic (AUROC) (see Section 4.2.3), for at least one well-
known image dataset.

A2 Each selected algorithm must have source code readily available and licensed
for use in research purposes.

A3 No selected algorithm should require the normal class to have labeled sub-
classes: they should be able to model a single normal class.

A4 The number of selected algorithms should be at least 3, so that a comparison
can be made.

A5 The number of selected algorithms should not be too large so as to allow for
the selected algorithms to be reimplemented and tested within the limited
time of the project.

3.1.3 Ground criteria for dataset selection
Self-driving datasets were selected for use in this thesis subject to the following
critera:
D1 All selected datasets must be image datasets for self-driving applications, with

images taken in the forward direction of a vehicle in road traffic.
D2 Each selected dataset must have metadata available, e.g., weather conditions,

such that the data can be divided into at least one normal class and at least
one novelty class.

D3 Each dataset should represent a different level of novelty detection difficulty,
i.e., if more than one dataset is selected they should have different variability
in the elements of the normal class.

D4 It is preferable if all selected datasets can be evaluated with similar differences
between the normal scenarios and the novelty scenarios.

D5 The number of selected datasets should not be larger than such that all selected
algorithms can be tested with each dataset within the limited time of the
project.

3.2 Results of Literature Review
The implemented method for finding relevant articles, presented in Section 3.1.1,
resulted in a total of 17 which were deemed relevant enough for this thesis. In this
section, they are presented as follows: first, there is a summary of the read mate-
rial. Then, the three algorithms selected for reimplementation and experiments are
explained in more detail. Finally, there are some remarks on interesting algorithms
which, for various reasons, were dismissed.

3.2.1 Summary of Current State-of-the-art Novelty Detec-
tion

The latest complete review of the field of ND was made in 2014 by Pimentel et al.
[25]. The majority of papers reviewed here were published later, however it serves

26


3. Literature Review

as a basis for understanding the broader field of ND. In [25], the authors sort ND
algorithms into 6 different categories: probabilistic, distance-based, reconstruction-
based, domain-based and information-theoretic ND. They further state that in the
application domain of image processing, there exist algorithms in all but the last of
these categories.

Since we concern ourselves only with ND in image data, the most common ap-
proach among those investigated is to use a CAE for feature extraction in some way.
This includes both using regular CAEs [11, 26, 27, 28, 29] and using generative ad-
versarial networks with convolutional encoder and generator [30, 31, 32, 33]. Other
approaches use non-convolutional deep AEs, either by preprocessing the image data
and thereby reducing its dimensionality [34] or by using the entire image in a fully
connected AE [35]. Some algorithms use no AE at all, but instead use CNNs for
feature extraction, either with transfer learning from pre-trained models [36, 37], or
by only working on normal classes with subclasses in their original implementation
[38, 39].

The main difference between algorithms with similar feature extraction methods
is how the trained ANN is used for assigning novelty scores to testing inputs. Com-
mon ways of doing this are using CAE reconstruction error, using the discriminator
output in a generative adversarial network, or applying a separate OCC algorithm
to the feature space representation of a CAE.

In [40], the authors compare 20 different VAE statistics as novelty scores, and
show that though some metrics yield higher AUROC than the most typical one,
which is the output layer reconstruction error, there is no large variation between
the top 10 AUROC scores, which are all in the range [0.86, 0.881]. Although for a
VAE and not a regular CAE, it indicates that for the same trained model, different
internal statistics of the model contain the same amount of information about the
normal class features.

3.2.2 Algorithms Selected for Reimplementation
The algorithms selected for reimplementation and evaluation in this thesis are pre-
sented and motivated below. Each algorithm is presented under an acronym, pri-
marily the one used by the original authors. If there was no such acronym, one has
been devised here.

Adversarially learned one-class classifier for novelty detection (ALOCC)

Sabokrou et al. [32] train an adversarial autoencoder (AAE), which means the train-
ing objective is a weighted combination between reconstruction loss and an adver-
sarial loss. The setup consists of two networks; the autoencoder, denoted R, and a
discriminator D. R maps the input image x as

R : x̃ = (x ∼ pt) + (η ∼ Nσ) −→ x′ ∼ pt, (3.1)

where Nσ = N (0, σ2I) is normally distributed noise with zero mean and variance
σ2. Note that the reconstructed image x′ is mapped to the same distribution pt as

27


3. Literature Review

the input x, making R a denoising AE. The discriminator D maps the output of R
as

D : R(x̃) −→ p ∈ (0, 1), (3.2)
min
R

max
D
LR+D, (3.3)

where
LR+D = Ex∼pt [log (D(x))] + Ex̃∼pt∗Nσ [log (1−D(R(x̃)))]. (3.4)

An intuitive way to explain (3.4) is that both terms train D to distinguish training
images x ∼ pt from reconstructed images R(x̃), since outputting D(x) = 1 and
D(R(x̃)) = 0 maximizes (3.4). The R network is only affected by the second term,
where (3.4) is minimized for D(R(x̃)) = 1, meaning that R successfully tricks D
that R(x̃) belongs to the training dataset distribution pt.

The R network is also trained with a reconstruction loss

LR = ||x− x′||2 (3.5)

giving the complete training objective

min
R

{
max
D
LR+D + λLR

}
(3.6)

where λ > 0 is a tradeoff hyperparameter.
During testing of an image x̄, the normal class likelihood Snormal(x) = D(R(x̄))

is used to detect novelties. Note that since D outputs higher values for normal class
images, the score is a normalcy score, so that low scores signify higher probability
of x̄ being a novelty.

In the original paper [32], the algorithm is tested on the MNIST database as well
as the Caltech-256 Object Category Dataset, both presented in Section 2.7.1. The
authors compare D(x̄) and D(R(x̄)) as scoring functions, obtaining AUROC= 0.932
and AUROC= 0.942 respectively for the Caltech dataset, using one object class as
normal class and 50% novelties, sampled from the 257th class, "clutter".

Deep support vector data-description (DSVDD)

Ruff et al. [11] pre-train a CAE for learning normal class features, and then use the
encoder network as initialization for a CNN used for feature extraction. A SVDD
classifier is attached to the final layer of the CNN to perform ND, and the resulting
method is called DSVDD. The two parts of the new network are then trained jointly,
with objectives designed to optimize network parameters W so that the CNN learns
to map samples x from the normal class into a hypersphere of radius R. Two algo-
rithms are proposed: soft-boundary DSVDD and one-class DSVDD. Soft-boundary
DSVDD has the objective

min
R,W

{
R2 + 1

νn

n∑
i=1

max{0, ‖φ(xi;W )− c‖2 −R2}+ λ

2

L∑
l=1
‖Wl‖2

F

}
, (3.7)

and one-class DSVDD has the objective

min
W

{
1
n

n∑
i=1
|φ(xi;W )− c‖2 + λ

2

L∑
l=1
‖Wl‖2

F

}
. (3.8)

28


3. Literature Review

In both (3.7) and (3.8), || · ||F denotes the Frobenius norm

‖Wl‖F =
√∑

i,j

W l
ij

2
, (3.9)

i.e., the root square sum of all network parameters.
The soft-boundary DSVDD objective (3.7) optimizes parameters W and hyper-

sphere radius R jointly. The first term aims to minimize the hypersphere volume.
The second term penalizes the network for all points lying outside of the sphere, since
the max operator sets the second term to zero for all points within the hypersphere.
The hyperparameter ν controls the tradeoff between the two terms.

The one-class DSVDD objective (3.8) penalizes the distance of any feature point
φ(x;W ) to the center c, implicitly minimizing the radius of the smallest hypersphere
enclosing all feature space representations. In both (3.7) and (3.8), the last term is
a regularizer with hyperparameter λ, serving as a type of weight decay.

In the source code [41] for DSVDD algorithm, there is the option to optimize
both (3.7) and (3.8) w.r.t. to the center c as well as R and W . However, the authors
recommend not to do so, since it increases the risk for what they call hypersphere
collapse: when the network learns the trivial solution to set R = 0 and a constant
mapping φ(x;W ) = c0 for any x.

The novelty score of a new input x̄ is assigned in a similar way for both soft-
boundary DSVDD and one-class DSVDD. For one-class DSVDD, it is simply the
distance from the feature space point to the hypersphere center:

Snovelty(x̄) = s(x̄) = ||φ(x̄;W )− c||2. (3.10)

For soft-boundary DSVDD, the novelty score is set as Snovelty(x̄) = s(x̄) − R, as
to get negative scores for normal class inputs and positive scores for novelties. In
the original paper [11], the algorithm is benchmarked on, among others, the MNIST
database and the CIFAR-10 dataset, see Section 2.7.1. For both datasets, one class
at a time is used as normal class with samples from the all other classes used as
novelties. For MNIST, the average AUROC is 0.935 and 0.948 for the soft-boundary
and one-class methods, respectively, while the corresponding values for CIFAR-10
are 0.633 and 0.648.

Generative probabilistic novelty detection (GPND)

Pidhorskyi et al. [30] propose a probabilistic novelty score based on learning fea-
tures of the normal class in an AAE. It is assumed that all normal class samples
xi ∈ Rm, i = 1, . . . , N , are sampled from a manifold M of dimension n < m, such
that

xi = f(zi) + ξi, (3.11)
where zi ∈ Ω ⊂ Rn and ξi denotes noise. The manifoldM is then defined by

M≡ f(Ω). (3.12)

The authors further assume that f : Ω→ Rm is smooth and invertible with inverse
g : Rm → Rn such that xi = f(g(xi)), i = 1, . . . , N . By linearizing f onM, using a

29


3. Literature Review

first order Taylor expansion, they express the probability pX(x̄) that a new input x̄
is sampled fromM in terms of entities whose computation only require numerical
estimates of f and g. The full derivation of this probability estimation will not be
covered here, but can be found in the original paper [30].

The mappings f and g are approximated using an AAE: the encoder network
mapping of input x to latent representation z represents g, while the decoder network
represents f . The encoder-decoder network is also trained in a way similar to a
VAE, that is, a prior distribution is imposed on the latent space Ω. In this case it
is a standard normal distribution N (0, 1). In addition to the VAE objective, the
adversarial setup consists of two discriminators, Dz and Dx, discriminating upon
the latent space representation and the reconstructed image, respectively. The loss
Ladv−dz for Dz is defined as

Ladv−dz(x, g,Dz) = E [log (Dz(N (0, 1)))] + E [log (1−Dz(g(x)))] . (3.13)

Minimizing (3.13) w.r.t. the parameters of g trains g to map x onto z following
the prior distribution N (0, 1). Maximizing it w.r.t. the parameters of Dz trains
Dz to distinguish between the mappings of x and random samples from the prior
distribution.

The loss Ladv−dx for Dx is defined as

Ladv−dx(x, f,Dx) = E [log (Dx(x))] + E [log (1−Dx(f(N (0, 1))))] . (3.14)

Minimizing (3.14) w.r.t. the parameters of f trains f to map samples from N (0, 1)
to reconstructed images that resemble the input images x. Maximizing it w.r.t. the
parameters of Dx trains Dx to distinguish between the generated images f(N (0, 1))
and x.

To approximate the manifoldM well, a reconstruction loss Lerror is imposed on
g and f :

Lerror(x, g, f) = E [BCEimage(x, f(g(x)))] , (3.15)
where

BCEimage(x, f(g(x))) =
m∑
j=1

BCE(yj, pj), (3.16)

where yj and pj are data points in the images x and f(g(x)), respectively, and
BCE(pj, yj) is given by (2.9).

The AAE is trained using stochastic gradient descent, updating the networks in
the following order:

1. Maximize Ladv−dx w.r.t. parameters of Dx.
2. Minimize Ladv−dx w.r.t. parameters of f .
3. Maximize Ladv−dz w.r.t. parameters of Dz.
4. Minimize Ladv−dz + λLerror w.r.t. parameters of g and f ,

where λ is a hyperparameter controlling the tradeoff between adversarial loss and
reconstruction loss.

After completed training, the approximations of f and g are used to estimate
pX (x̄), which is in turn used as the normalcy score Snormal(x) for any input x̄. In the
original paper [30], the algorithm is benchmarked on the MNIST, Fashion-MNIST
and COIL-100 datasets. AUROC results, using one object class as normal class

30


3. Literature Review

and 50% novelties sampled from the other classes, are 0.932 for MNIST, 0.901 for
Fasion-MNIST, and 0.968 for COIL-100.

Dismissed algorithms

All ND articles and related algorithms from the literature review were judged based
on the conditions presented in Section 3.1.2. The algorithms selected for reimple-
mentation were mainly selected for their source code availability and their relative
difference in approach. Several [27, 28, 31, 35, 36, 37, 38] of the considered algorithms
met most of the conditions on individual algorithms, and are worth investigating in
future work.

3.2.3 Datasets Selected for the Evaluations
Three datasets were considered for use in experiments in this thesis: Berkeley Deep-
Drive, Dr(eye)ve and the Pro-SiVIC highway scenario dataset, all described in Sec-
tion 2.7.2.

The datasets chosen for ND experiments were the Dr(eye)ve dataset and the Pro-
SiVIC highway scenario dataset. The Pro-SiVIC highway scenario dataset provides
a simple test case, with relatively low scene variation even though elements such as
vehicles, bridges and buildings on the side of the road gives the scenes some complex-
ity. The Dr(eye)ve dataset provides an increase in scene variation and complexity,
since it consists of real world images and the images are captured across different
runs, in different locations.

Dismissed datasets

The time frame of this thesis did not allow for three datasets to be used, specified
in the final criterion in Section 3.1.3. The Berkeley DeepDrive database was omit-
ted from experiments because it was deemed, through initial testing, to present a
higher level of difficulty than the two selected datasets. The increased difficulty is
likely caused by the number of runs used for the different datasets: a single run
for the Pro-SiVIC highway scenario dataset, 74 runs for the Dr(eye)ve dataset, and
100 000 runs for the Berkeley DeepDrive dataset. The high number of runs in Berkely
DeepDrive causes an increase in the variation within the training dataset due to,
e.g., an increased number of landscape types and a larger variation in the camera
angle.

31


3. Literature Review

32


4
Novelty Detection Experiments

This chapter presents all the experiments performed with the algorithms selected in
Section 3.2.2 and datasets selected in Section 3.2.3. First, the reimplementation of
the selected algorithms is detailed. Then, the experimental setup is outlined. Finally,
the results of all experiments are reported.

4.1 Reimplementation of Selected Algorithms
All selected algorithms have full source code available online, which greatly facili-
tated reimplementation. Each algorithm was implemented using the original source
code, with modifications. All three algorithms were originally implemented in Python,
but in different Python versions and different machine learning (ML) frameworks,
shown in Table 4.1. Any resulting differences regarding model performance, given
identical ANN architectures and optimization settings, were assumed to be negligi-
ble.

Since all the selected algorithms were originally implemented for MNIST and
datasets with images of approximately the same size, the main modification was
to adapt each of them to larger input images. Since larger images contain more
information, CAE architectures with more convolutional layers were needed in order
to extract enough relevant features from the training datasets.

For all three selected algorithms a method was implemented, which enabled to
easily change the number of convolutional layers and filters in each CAE, in order
to evaluate how much different hyperparameters affected the ability of the CAEs to
encode meaningful latent representations of the normal class samples. Since training
a CAE until convergence with the full datasets took on a timescale of hours, a
subset of 100 images was used for testing different architectures. The investigated
hyperparameters were:

• Convolutional filter size: k ∈ {4, 5}.
• Number of convolutional layers: 2 ≤ nconv ≤ 7.
• Number of fully connected layers: nfc ∈ {0, 1}.
• Number of filters in the first convolutional layer: c1 ∈ {8, 16, 32, 64}.
• Dimensionality of the latent representation: cz ∈ {256, 512, 1 024, 2 048}.

Table 4.1: Programming frameworks used in the implementations of the evaluated algorithms

Algorithm Python version ML framework
ALOCC 3.5.2 Keras 2.2.4
DSVDD 2.7.12 Lasagne 0.2.dev1
GPND 3.5.2 Torch 0.4.1

33


4. Novelty Detection Experiments

Table 4.2: Number of images in the dataset splits. The two values in the Pro-SiVIC highway scenario test set refer
to the novelty scenarios weather/landscape

Dataset Pro-SiVIC highway scenario Dr(eye)ve
Number of training images 6 785 6 000
Number of validation images 840 600
Number of test normal samples 500/488 600
Number of test novelties 500/488 600

• Learning rate: η ∈ {0.0001, 0.001, 0.01}.
The number of convolutional layers nconv and fully connected layers nfc refer to either
of the encoder and the decoder, meaning that the whole CAE had nconv convolutional
layers, 2nfc fully connected layers, and nconv transposed convolution layers. Settings
which were kept constant are 2×2 stride, to get feature map dimensionality reduction
in each convolutional layer, and that the number of filters was doubled for each new
convolutional layer: e.g., with nconv = 3 and c1 = 8 the number of filters in each of the
layers would be {8, 16, 32}. A full grid search was not done, as the hyperparameter
options listed above yield 1152 combinations. Instead, one hyperparameter at a
time was varied while keeping the others constant. Some combinations, such as
nconv = 7, c1 ≥ 32, yielded models too large to fit into the memory of the used
hardware (see Section 4.2.4), and could not be evaluated. It was further assumed
that the feature extraction capability of a CAE model with a certain architecture
would be the same for all three algorithm implementations, so that testing different
architectures for only one of them would suffice to determine a common architecture
to be used with all three. The implementation used for this was ALOCC, because
of the user-friendly functionality for changing the CAE architecture provided by the
Keras ML framework.

For each hyperparameter setting, a CAE was trained until the average recon-
struction error over the training data subset was no longer decreasing. Using the
training set reconstruction error for this increases the risk for overfitted models.
However, this risk was deemed low in practice, based on initial experiments with
the DSVDD algorithm, where the average error for the validation and training sets
started diverging after a considerably larger number of epochs than what was used
in the experiments presented here. Settings to be used for the presented experiments
were chosen using a tradeoff between the number of epochs needed for convergence
and the lowest reconstruction error reached: a model with more layers and filters
could, in theory, extract and reconstruct more complex and detailed features due to
the increased number of neurons, but the weight search space is larger and it might
not be practically feasible to find the optimal weight configuration.

4.2 Experimental Setup

4.2.1 Dataset Preparation

The number of images in the training, validation and testing subsets of each dataset
are shown in Table 4.2. Below is a description of how images were arranged into
normal and novelty sets for the respective datasets.

34


4. Novelty Detection Experiments

Table 4.3: Attributes used for splitting the Dr(eye)ve dataset into normal scenario and novelty scenarios, and the
videos each set of images was sampled from

Scenario Weather Landscape Sampled videos
Normal Sunny Higway or countryside 23, 25, 34, 45, 55
Novel weather Rainy Highway or countryside 14, 17, 22, 31, 32, 37, 44,

50, 63
Novel landscape Sunny Downtown 06, 40, 65

Pro-SiVIC highway scenario dataset splits

Since the Pro-SiVIC highway scenario dataset was created for the SMILE II project,
the normal and novelty classes were designed specifically for the purpose of novel
scenario detection. The normal class is set on a highway in sunny weather conditions.
The two types of novelty scenarios are, in relation to the normal class:

1. same landscape, but with foggy weather conditions,
2. same weather conditions, but in urban conditions.

Example images from the normal class and the two novelty scenarios are shown in
Fig. 4.1.

Dr(eye)ve dataset splits

To get a normal class scenario and novel scenarios which are similar for both datasets,
a subset of the videos in the Dr(eye)ve dataset were selected based on the metadata
attributes provided: time of day, weather and landscape.

Since labels for time of day were "morning", "evening" and "night", while the re-
quirement for these experiments were only for all images to be in daylight conditions,
an extra filtering of data was required. The relatively low number of videos, 74, in
the Dr(eye)ve dataset, allowed for manual inspection of all videos. This resulted in
a relabeling of each video as either having daylight conditions, eligible for use in this
thesis, or being too dark and consequently left out of experiments. Out of the videos
with daylight conditions, sets of weather and landscape attributes were chosen as
similar as possible to the scenarios for the Pro-SiVIC highway scenario dataset. The
attributes chosen for the different data subsets are shown in Table 4.3. Each of the
5 minute videos selected for the respective scenarios was then sampled at a rate of
5 frames/s. The images extracted from the normal class videos were randomly sorted
into training, validation and testing splits. When the number of videos matching the
attribute description of a scenario was so large that 5 frames/s sampling generated
more images than required, a subset of the extracted frames was randomly selected.
Example images from each of the three scenarios can be seen in Fig. 4.2.

Common preprocessing of both datasets

Images in both datasets were preprocessed by resizing to 256 × 256 pixels using
OpenCV’s resize function with INTER_AREA interpolation option, and rescaling
all pixel values to the range [0, 1]. The image size 256× 256 was chosen as a tradeoff
between reducing the input dimensionality while still keeping much of the detail in
the images.

35


4. Novelty Detection Experiments

(a) Resized images from the Pro-SiVIC highway scenario dataset normal scenario.

(b) Resized images from the Pro-SiVIC highway scenario dataset novel weather scenario.

(c) Resized images from the Pro-SiVIC highway scenario dataset novel landscape scenario.

Figure 4.1: Resized image samples from the normal class and the two novelty scenarios in the Pro-SiVIC highway
scenario dataset.

36


4. Novelty Detection Experiments

(a) Resized images from the Dr(eye)ve dataset normal scenario.

(b) Resized images from the Dr(eye)ve dataset novel weather scenario.

(c) Resized images from the Dr(eye)ve dataset novel landscape scenario.

Figure 4.2: Resized image samples from the normal class and the two novelty scenarios in the Dr(eye)ve dataset.

37


4. Novelty Detection Experiments

16x16x128

128x128x16

64x64x32

32x32x64

256x256x3

8x8x256

512

8x8x256

16x16x128

32x32x64

64x64x32

128x128x16

256x256x3

B B C C C C D

ConvTranspose2D, 5x5 filter, 2x2 stride + BatchNorm + LeakyRelu
ConvTranspose2D, 5x5 filter, 2x2 stride + Logistic

Conv2D, 5x5 filter, 2x2 stride + BatchNorm + LeakyRelu
Fully connected + BatchNorm + LeakyRelu

C
B
A

D

AAAAA

(a) Pro-SiVIC highway scenario CAE architecture.

4x4x512 4x4x512

16x16x128

128x128x16
64x64x32

32x32x64

256x256x3

8x8x256
512

8x8x256
16x16x128

32x32x64
64x64x32

128x128x16
256x256x3

B B C C C C C DA

ConvTranspose2D, 5x5 filter, 2x2 stride + BatchNorm + LeakyRelu
ConvTranspose2D, 5x5 filter, 2x2 stride + Logistic

Conv2D, 5x5 filter, 2x2 stride + BatchNorm + LeakyRelu
Fully connected + BatchNorm + LeakyRelu

C
B
A

D

AAAAA

(b) Dr(eye)ve CAE architecture.

Figure 4.3: CAE architectures used for experiments with the respective datasets. In both cases, an extra batch
normalization layer was added after the output layer for the ALOCC algorithm, as this proved to improve results.

4.2.2 Optimization of Novelty Detection Models
For each dataset and algorithm, CAE optimization was performed with the algo-
rithm specific optimization objective, presented in Section 3.2.2, and dataset specific
architecture and training settings, which are listed in Table 4.4. The CAE architec-
tures used for the two datasets are also depicted in Fig. 4.3. The discriminator
architectures used with the ALOCC and GPND are shown in Figs. 4.4–4.5. For the
GPND algorithm, the architecture is identical to the CAE encoder network for the
respective datasets, except for the output being a single scalar. For the ALOCC
algorithm, an additional convolutional layer is used. The reason for the difference is
that in the original implementations, the GPND discriminator has the same depth
as the CAE encoder, while the ALOCC discriminator has one more convolutional
layer.

4.2.3 Evaluation metrics
For evaluating ND experiments as a type of OCC, we define the prediction novelty
to be a positive prediction in this thesis, since that is what the algorithms are
aiming to detect. Consequently, normal class membership is labeled as a negative
prediction. For each input x in a testing set, each algorithm assign a novelty score
Snovelty(x). For the two algorithms which in their original implementation output a

38


4. Novelty Detection Experiments

4x4x512

16x16x128

128x128x16
64x64x32

32x32x64

256x256x3

8x8x256
1

BAAAAAA

Conv2D, 5x5 filter, 2x2 stride + BatchNorm + LeakyRelu
Fully connected + BatchNorm + LogisticB

A

BA

2x2x1024

A

4x4x512

16x16x128

128x128x16
64x64x32

32x32x64

256x256x3

8x8x256
1

AAAAA

Figure 4.4: Architectures used for the discriminators in the ALOCC algorithm. Left: architecture for the Pro-SiVIC
highway scenario dataset. Right: architecture for the Dr(eye)ve dataset.

Conv2D, 5x5 filter, 2x2 stride + BatchNorm + LeakyRelu
Fully connected + LogisticB

A

16x16x128

128x128x16
64x64x32

32x32x64

256x256x3

8x8x256
1

BAAAAA

4x4x512

16x16x128

128x128x16
64x64x32

32x32x64

256x256x3

8x8x256
1

BAAAAAA

Figure 4.5: Architectures used for the discriminators in the GPND algorithm. Left: architecture for the Pro-SiVIC
highway scenario dataset. Right: architecture for the Dr(eye)ve dataset.

Table 4.4: CAE architectures and training settings

Architecture
Dataset Pro-SiVIC highway scenario Dr(eye)ve
k 5 5
s 2 2
nconv 5 6
nfc 1 1
c1 16 16
cz 512 512

Training settings
Dataset Pro-SiVIC highway scenario Dr(eye)ve
η 0.001 → 0.0001 0.001 → 0.0001
Epochs 500 500
η change epoch 250 250
Mini-batch size 64 64
Optimizer adam adam
Weight initialization xavier uniform xavier uniform

39


4. Novelty Detection Experiments

Table 4.5: A confusion matrix, showing the relation between classifier prediction and true value of normal class
membership

Prediction
Positive Negative

Actual
value

Positive True positive False negative
Negative False positive True negative

normalcy probability Snormal(x), which is higher for normal samples, we simply use
the probability complement as novelty score, so that Snovelty(x) = 1−Snormal(x). For
a given threshold τ , all inputs for which Snovelty(x) ≥ τ are classified as novelties,
i.e., a positive result. All inputs for which Snovelty(x) < τ are classified as normal,
i.e., a negative result. This allows us to define:

• true positives (TP): number of actual novelties correctly classified as novelties,
• true negatives (TN): number of actual normal samples correctly classified as

normal,
• false positives (FP): number of actual normal samples wrongly classified as

novelties,
• false negatives (FN): number of actual novelties wrongly classified as normal.
The total number of each of the above can be represented in the confusion matrix

of a classifier, and is shown in Table 4.5. From this, we can define several other
classification metrics. The true positive rate (TPR), also called recall, is defined
as

TPR = TP
TP + FN , (4.1)

which in this context means the fraction of all actual novelties that were correctly
detected. The false positive rate (FPR) is defined as

FPR = FP
FP + TN , (4.2)

which in this context means the fraction of all actual normal samples that were
wrongly classified as novelties. The precision is defined as

precision = TP
TP + FP , (4.3)

which in this context means the fraction of all inputs classified as novelties that are
actually novelties.

Results for all experiments in this thesis are presented in Section 4.3, using
two types of binary classification evaluation curves: receiver operating character-
istic (ROC) and precision-recall curve (PRC), as well as their corresponding area
under curve measures: AUROC for the ROC and area under precision-recall curve
(AUPRC) for the PRC.

The ROC curve is obtained by plotting the TPR against the FPR of an experi-
ment for all possible threshold levels τ for the novelty score Snovelty(x). Similarly, the
PRC is obtained by plotting the precision against the recall for all possible values
of τ . The AUROC measure can be viewed as the average probability that a posi-
tive sample, in this case a novelty, is also classified as a positive. This means that
a perfect classifier will yield AUROC = 1, while the opposite case, predicting all

40


4. Novelty Detection Experiments

positives as negatives and vice versa, will yield AUROC = 0. A random baseline
classifier, classifying any input as positive or negative with equal probability, would
yield AUROC = 0.5.

Since the AUROC measure is independent of the threshold τ , it is convenient for
comparison of different classifiers. A drawback with ROC is that the curve remains
unchanged for unbalanced datasets, i.e., when the number of negative samples N is
significantly larger than number of positive samples P , or vice versa. For a high N ,
the lowest threshold τ , corresponding to correctly classifying all positives, meaning
FN = 0 and thereby TPR = 1, might still lead to a relatively low FP, resulting in
a low FPR.

In such cases, the AUPRC measure is more suitable. AUPRC is also threshold
independent, and furthermore, the baseline AUPRC value of a random classifier is
equal to the fraction P/(P + N). Since P = N in this thesis, the baseline value is
0.5 for both AUROC and AUPRC.

Results are also presented as histograms of the novelty scores for each experiment.
Histograms allow a more close examination of the difference between the scores for
normal samples and the scores for novelty samples.

4.2.4 Hardware
All experiments, including training and testing of all models, were performed using
a single Nvidia GeForce GTX 1080 Ti with 11GB memory.

4.3 Experimental Results
Results for all experiments, in terms of AUROC and AUPRC, are shown in Ta-
ble 4.6. Plots of ROC curves and PRCs as well as histograms of Snovelty are shown
in separate subsections for each combination of dataset and implemented algorithm.
For visibility, ROC curves and PRCs are also plotted in separate windows for each
type of novel scenario. In each plot, the classification performance using the respec-
tive AE reconstruction error as novelty score is shown, in addition to the algorithm
specific novelty score. For the DSVDD algorithm, both the soft-boundary and one-
class scores are provided. Note that the vertical scale is logarithmic in all histograms,
to show scores of low frequency more clearly. To enable comparison between exper-
iments, the novelty scores for each experiment were scaled and shifted to the range
[0, 1] before the creation of the corresponding histogram.

4.3.1 Results for Experiments on the Pro-SiVIC Highway
Scenario Dataset

ALOCC

Classification results for the ALOCC algorithm on the Pro-SiVIC highway scenario
dataset are shown in Figs. 4.6–4.7 for the novel weather scenario and Figs. 4.8–4.9
for the novel landscape scenario. For the novel weather scenario, the ALOCC nov-
elty score D(R(x)) yields almost perfect separation of the two classes, with AUROC

41


4. Novelty Detection Experiments

Table 4.6: ND performance metrics for all experiments

Dataset Pro-SiVIC highway scenario Dr(eye)ve
Outlier type Weather Landscape Weather Landscape

Algorithm
Metric AUROC AUPRC AUROC AUPRC AUROC AUPRC AUROC AUPRC

ALOCC D(R(x)) 0.998 0.999 0.999 0.999 0.498 0.748 0.507 0.741
ALOCC AAE 0.330 0.443 0.999 0.999 0.560 0.519 0.705 0.641
DSVDD soft-boundary 0.970 0.904 0.994 0.992 0.808 0.747 0.781 0.679
DSVDD one-class 0.977 0.926 0.992 0.990 0.807 0.747 0.781 0.680
DSVDD CAE 0.969 0.906 0.997 0.996 0.748 0.671 0.948 0.921
GPND pX(x) 0.955 0.954 0.021 0.308 0.427 0.434 0.385 0.434
GPND AAE 0.516 0.531 0.543 0.530 0.514 0.499 0.486 0.497

(a) ROC curves. (b) PRCs.

Figure 4.6: ROC and PRCs for the ALOCC algorithm, on the Pro-SiVIC highway scenario dataset, with unseen
weather as novelties.

(a) Histograms displaying the distributions of
D(R(x)) scores for normal class and novelties.

(b) Histograms displaying the distributions of AAE
scores for normal class and novelties.

Figure 4.7: Histograms of novelty scores for the ALOCC algorithm, on the Pro-SiVIC highway scenario dataset,
with unseen weather as novelties.

42


4. Novelty Detection Experiments

(a) ROC curves. (b) PRCs.

Figure 4.8: ROC and PRCs for the ALOCC algorithm, on the Pro-SiVIC highway scenario dataset, with unseen
landscape as novelties.

(a) Histograms displaying the distributions of
D(R(x)) scores for normal class and novelties.

(b) Histograms displaying the distributions of AAE
scores for normal class and novelties.

Figure 4.9: Histograms of novelty scores for the ALOCC algorithm, on the Pro-SiVIC highway scenario dataset,
with unseen landscape as novelties.

43


4. Novelty Detection Experiments

(a) ROC curves. (b) PRCs.

Figure 4.10: ROC and PRCs for the DSVDD algorithm, on the Pro-SiVIC highway scenario dataset, with unseen
weather as novelties.

= 0.998 and AUPRC = 0.999. The distributions of AAE reconstruction errors are
somewhat overlapping, yielding lower errors than the normal class for some novelties
and higher errors for others. This is also reflected in the classification metrics AU-
ROC = 0.330 and AUPRC = 0.443, which is worse than a random classifier.

For the novel landscape scenario, both the D(R(x)) score and reconstruction
error yield near perfect separation of novelties.

DSVDD

Classification results for the DSVDD algorithm on the Pro-SiVIC highway scenario
dataset are shown in Figs. 4.10–4.11 for the novel weather scenario and Figs. 4.12–4.13
for the novel landscape scenario. The three score types, soft-boundary DSVDD, one-
class DSVDD and CAE reconstruction error, yield very similar score distributions
for the novel weather scenario, with the one-class classifier having a slight edge in
terms of both AUROC and AUPRC. The separation is better in the novel landscape
scenario for all three score types, but instead the CAE reconstruction error yields
higher AUROC and AUPRC.

GPND

Classification results for the GPND algorithm on the Pro-SiVIC highway scenario
dataset are shown in Figs. 4.14–4.15 for the novel weather scenario and Figs. 4.16–4.17
for the novel landscape scenario. For the novel weather scenario, the GPND novelty
score pX(x) yields almost completely separated distributions, with AUROC= 0.955
and AUPRC = 0.954. The AAE reconstruction error yields heavily overlapping con-
tributions, resulting in AUROC and AUPRC near those of a random classifier.

For the novel landscape scenario, the pX(x) score again yields good separation
of distributions, however scoring most novelties lower than normal class samples,
yielding AUROC = 0.021. The reconstruction error contains very little information
also in this case, resembling a random classifier.

44


4. Novelty Detection Experiments

(a) Histograms displaying the distributions of soft-
boundary DSVDD scores for normal class and nov-
elties.

(b) Histograms displaying the distributions of one-
class DSVDD scores for normal class and novelties.

(c) Histograms displaying the distributions of CAE
scores for normal class and novelties.

Figure 4.11: Histograms of novelty scores for the DSVDD algorithm, on the Pro-SiVIC highway scenario dataset,
with unseen weather as novelties.

(a) ROC curves. (b) PRCs.

Figure 4.12: ROC and PRCs for the DSVDD algorithm, on the Pro-SiVIC highway scenario dataset, with unseen
landscape as novelties.

45


4. Novelty Detection Experiments

(a) Histograms displaying the distributions of soft-
boundary DSVDD scores for normal class and nov-
elties.

(b) Histograms displaying the distributions of one-
class DSVDD scores for normal class and novelties.

(c) Histograms displaying the distributions of CAE
scores for normal class and novelties.

Figure 4.13: Histograms of novelty scores for the DSVDD algorithm, on the Pro-SiVIC highway scenario dataset,
with unseen landscape as novelties.

(a) ROC curves. (b) PRCs.

Figure 4.14: ROC and PRCs for the GPND algorithm, on the Pro-SiVIC highway scenario dataset, with unseen
weather as novelties.

46


4. Novelty Detection Experiments

(a) Histograms displaying the distributions of
pX(x) scores for normal class and novelties.

(b) Histograms displaying the distributions of AAE
scores for normal class and novelties.

Figure 4.15: Histograms of novelty scores for the GPND algorithm, on the Pro-SiVIC highway scenario dataset,
with unseen weather as novelties.

(a) ROC curves. (b) PRCs.

Figure 4.16: ROC and PRCs for the GPND algorithm, on the Pro-SiVIC highway scenario dataset, with unseen
landscape as novelties.

(a) Histograms displaying the distributions of
pX(x) scores for normal class and novelties.

(b) Histograms displaying the distributions of AAE
scores for normal class and novelties.

Figure 4.17: Histograms of novelty scores for the GPND algorithm, on the Pro-SiVIC highway scenario dataset,
with unseen landscape as novelties.

47


4. Novelty Detection Experiments

(a) ROC curves. (b) PRCs.

Figure 4.18: ROC and PRCs for the ALOCC algorithm, on the Dr(eye)ve dataset, with unseen weather as novelties.

(a) Histograms displaying the distributions of
D(R(x)) scores for normal class and novelties.

(b) Histograms displaying the distributions of AAE
scores for normal class and novelties.

Figure 4.19: Histograms of novelty scores for the ALOCC algorithm, on the Dr(eye)ve dataset, with unseen weather
as novelties.

4.3.2 Results for Experiments on the Dr(eye)ve Dataset

ALOCC

Classification results for the ALOCC algorithm on the Dr(eye)ve dataset are shown
in Figs. 4.18–4.19 for the novel weather scenario and Figs. 4.20–4.21 for the novel
landscape scenario. For the novel weather scenario, the ALOCC novelty score
D(R(x)) is very similar for almost all samples, yielding nearly completely over-
lapping distributions and AUROC = 0.498. The AAE reconstruction error performs
slightly better, but still does not manage to separate novelties from the normal class
to a large extent.

For the novel landscape scenario, the results are very similar, with the D(R(x))
score being almost constant for all inputs, and the AAE reconstruction error con-
taining little relevant information.

48


4. Novelty Detection Experiments

(a) ROC curves. (b) PRCs.

Figure 4.20: ROC and PRCs for the ALOCC algorithm, on the Dr(eye)ve dataset, with unseen landscape as
novelties.

(a) Histograms displaying the distributions of
D(R(x)) scores for normal class and novelties.

(b) Histograms displaying the distributions of AAE
scores for normal class and novelties.

Figure 4.21: Histograms of novelty scores for the ALOCC algorithm, on the Dr(eye)ve dataset, with unseen
landscape as novelties.

(a) ROC curves. (b) PRCs.

Figure 4.22: ROC and PRCs for the DSVDD algorithm, on the Dr(eye)ve dataset, with unseen weather as novelties.

49


4. Novelty Detection Experiments

(a) Histograms displaying the distributions of soft-
boundary DSVDD scores for normal class and nov-
elties.

(b) Histograms displaying the distributions of one-
class DSVDD scores for normal class and novelties.

(c) Histograms displaying the distributions of CAE
scores for normal class and novelties.

Figure 4.23: Histograms of novelty scores for the DSVDD algorithm, on the Dr(eye)ve dataset, with unseen weather
as novelties.

(a) ROC curves. (b) PRCs.

Figure 4.24: ROC and PRCs for the DSVDD algorithm, on the Dr(eye)ve dataset, with unseen landscape as
novelties.

50


4. Novelty Detection Experiments

(a) Histograms displaying the distributions of soft-
boundary DSVDD scores for normal class and nov-
elties.

(b) Histograms displaying the distributions of one-
class DSVDD scores for normal class and novelties.

(c) Histograms displaying the distributions of CAE
scores for normal class and novelties.

Figure 4.25: Histograms of novelty scores for the DSVDD algorithm, on the Dr(eye)ve dataset, with unseen
landscape as novelties.

51


4. Novelty Detection Experiments

(a) ROC curves. (b) PRCs.

Figure 4.26: ROC and PRCs for the GPND algorithm, on the Dr(eye)ve dataset, with unseen weather as novelties.

(a) Histograms displaying the distributions of
pX(x) scores for normal class and novelties.

(b) Histograms displaying the distributions of AAE
scores for normal class and novelties.

Figure 4.27: Histograms of novelty scores for the GPND algorithm, on the Dr(eye)ve dataset, with unseen weather
as novelties.

DSVDD

Classification results for the DSVDD algorithm on the Dr(eye)ve dataset are shown
in Figs. 4.22–4.23 for the novel weather scenario and Figs. 4.24–4.25 for the novel
landscape scenario. The three DSVDD score types give similar results also for
the Dr(eye)ve dataset. The soft-boundary DSVDD and the one-class DSVDD are
almost indistinguishable from each other, for both the novel weather scenario and the
novel landscape scenario. As seen in Figs. 4.22a–4.22b, both methods yield clearly
better results than the CAE reconstruction error for the novel weather scenario,
with AUROC ≈ 0.8, the highest out of all experiments on that scenario. For the
novel landscape scenario, the reconstruction error yields the best results out of all
experiments, with AUROC = 0.948 and AUPRC = 0.921.

GPND

Classification results for the GPND algorithm on the Dr(eye)ve dataset are shown
in Figs. 4.26–4.27 for the novel weather scenario and Figs. 4.28–4.29 for the novel

52


4. Novelty Detection Experiments

(a) ROC curves. (b) PRCs.

Figure 4.28: ROC and PRCs for the GPND algorithm, on the Dr(eye)ve dataset, with unseen landscape as
novelties.

(a) Histograms displaying the distributions of
pX(x) scores for normal class and novelties.

(b) Histograms displaying the distributions of AAE
scores for normal class and novelties.

Figure 4.29: Histograms of novelty scores for the GPND algorithm, on the Dr(eye)ve dataset, with unseen landscape
as novelties.

53


4. Novelty Detection Experiments

landscape scenario. For both novelty scenarios, both the GPND score pX(x) and
the reconstruction error yield results close to or worse than a random classifier, with
all values of AUROC and AUPRC in the range [0.38, 0.52].

54


5
Discussion

In this section, the results of the thesis project are discussed, in terms of validity and
relevance. Then, possible improvements to the experiments are discussed. Finally,
possible further work is proposed.

5.1 Implications of Experiment Results

5.1.1 Comparison of Evaluated Algorithms
On the Pro-SiVIC highway scenario dataset, all algorithms managed to extract
normal class features well enough to separate it from the novelty scenarios: using
the original algorithm novelty score types, the lowest AUROC was 0.955 for the
novel weather scenario and 0.021 for the novel landscape scenario, both yielded
by the GPND algorithm pX(x) score. We note that AUROC = 0.021 is an almost
perfectly bad score, i.e., if all predictions were inverted it would produce an AUROC
of 0.979. This shows that, although the model learns relevant features, the GPND
scoring method did not produce higher novelty scores for novelties, but rather the
opposite. Since it assigns higher scores than the normal class for one type of novelties
and lower scores for another, it is not safe to assume that it would perform well in
a safety-critical situation, where novelties can come from a much larger distribution
than those tested here.

For the more difficult Dr(eye)ve dataset, there is a notable difference between the
three models, seen in Table 4.6: all the DSVDD score types yield higher AUROC and
AUPRC than the other two algorithms. The only exception is that that the ALOCC