DF

Camera detections LiDAR detections

Camera-LiDAR Active Learning for
Object Detecting Deep Neural Networks
Master’s thesis in Computer Science - algorithms, languages and logic

ALFRED GUNNARGÅRD

DANIEL ODIN

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2020


Master’s thesis 2020

Camera-LiDAR Active Learning for
Object Detecting Deep Neural Networks

Alfred Gunnargård

Daniel Odin

DF

Department of Electrical Engineering
Chalmers University of Technology

Gothenburg, Sweden 2020


Camera-LiDAR Active Learning for Object Detecting Deep Neural Networks
Alfred Gunnargård, Daniel Odin

© Alfred Gunnargård, Daniel Odin, 2020.

Academic Supervisor and Examiner: Lennart Svensson, Electrical Engineering
Industrial Supervisors: Dónal Scanlan, Erik Werner, Adam Tonderski, Zenuity

Master’s Thesis 2020
Department of Electrical Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Bird’s eye view 3D detections in LiDAR point cloud. Left: From network
trained on images. Right: From network trained on LiDAR point clouds which are
more accurate in position.

Typeset in LATEX
Gothenburg, Sweden 2020

iv


Camera-LiDAR Active Learning for Object Detecting Deep Neural Networks
Alfred Gunnargård, Daniel Odin
Department of Electrical Engineering
Chalmers University of Technology

Abstract
Deep Neural Networks (DNNs) processing images and other types of high dimen-
sional data need to train on large annotated datasets to reach usable performance.
Retrieving annotations is expensive due to the required involvement of humans. In
order to reduce annotation costs, it is desirable to annotate only those samples that
help the DNN perform better. However, finding those samples is not trivial.

To find and annotate samples where the DNN performs poorly is one way of
improving its performance. Still, finding those samples is challenging. In many
cases, the output of a DNN is best evaluated when compared to an annotation
made by a human. Since the annotation does not exist yet, some other type of
metric is needed.

The process of choosing samples to annotate based on the performance of the
DNN is called Active Learning (AL). We study AL for the task of 3D Object Detec-
tion (3DOD) in images, i.e. identifying and classifying objects in 3D space. Further-
more, we assume that all images are accompanied by a corresponding LiDAR point
cloud. A LiDAR is a rotating sensor that builds a 3D model of its surroundings by
measuring the time it takes for laser beams to reflect. The reflected beams result in
multiple points in 3D space, referred to as a point cloud.

We quantify the performance of our DNN (referred to as primary DNN) by using a
secondary DNN that detects objects in point clouds. We then compare the detections
from our primary DNN to the detections of the secondary DNN and score each
sample according to a discrepancy measure. The discrepancy measure quantifies
how much the detections from both DNNs differ from each other for a specific
sample, i.e. a pair of image and point cloud. The idea is that a sample with
high discrepancy indicates that the primary DNN performs poorly due to the three-
dimensional position accuracy of the secondary DNN.

We verify our approach using annotations instead of detections from a secondary
DNN. According to some important metrics, we are able to perform slightly better
than or on par with selecting samples randomly. However, the improved performance
over random selection comes from selecting samples that contain many objects,
and that raises the annotation cost. We show that a slightly modified discrepancy
measure is able to perform on par with random selection while maintaining the same
annotation cost.

Further research on how to properly measure this type of discrepancy is concluded
to be necessary for the method to be of use in any real-case scenario. Such research
includes experiments on an additional dataset.

Keywords: Machine Learning, Active Learning, 3D Object Detection, LiDAR,
Deep Neural Network

v


Acknowledgements
We want to thank our industrial supervisors Dónal Scanlan, Erik Werner, and Adam
Tonderski for their support throughout the thesis. Specifically, we want to thank
Adam for his patience during technical questions and Erik for his brilliant ideas and
feedback that helped us focus on the right things.

We also want to thank Lennart Svensson, our academic supervisor, for his sup-
port. We thank Lennart specifically for the support towards the end of the thesis
when everything seemed dark and hopeless.

Furthermore, we thank Zenuity for letting us do our Master’s Thesis at their
company. During our first weeks, we were warmly welcomed and many people offered
us help to get going and showed interest in our thesis. More specifically, we want
to thank Brendan Barrick for his never-ending patience helping us with IT issues
and Mats Nordlund for helping us to find a place to sit at AI Innovation of Sweden
(until Covid-19 locked us into our homes). We also want to thank the people at AI
Innovation for providing us with a nice working environment and delicious pastries
(fika).

Finally, we thank Niklas Gustafsson and Gustav Rödström that did their Master’s
Thesis at Zenuity at the same time we did. They have kept us company during the
time at Zenuity, been supportive, and provided feedback as well as suggestions to
our study. Furthermore, they also provided a great opposition to our thesis with
many interesting discussion subjects.

Alfred Gunnargård, Daniel Odin, Gothenburg, June 2020

vii


Contents

List of Figures xi

List of Tables xv

1 Introduction 1
1.1 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Examples of Informativeness Scores . . . . . . . . . . . . . . . 10
2.2.2 Evaluating an Active Learning Algorithm . . . . . . . . . . . . 11

2.3 LiDAR Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 2D Assignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 Hungarian Method . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Matching Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7.1 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7.2 Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.3 Mean Absolute Error . . . . . . . . . . . . . . . . . . . . . . . 21

3 Method 23
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Primary Object Detection Network . . . . . . . . . . . . . . . . . . . 24
3.3 Secondary Object Detection Network . . . . . . . . . . . . . . . . . . 25

3.3.1 Simulated High Performance Network . . . . . . . . . . . . . . 25
3.3.2 Camera-LiDAR Fusion Network . . . . . . . . . . . . . . . . . 25

3.4 Modified Active Learning Algorithm . . . . . . . . . . . . . . . . . . 25
3.5 Main Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5.1 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

ix


Contents

3.5.2 Discrepancy between a Pair of Matched Detections . . . . . . 29
3.5.3 Discrepancy for a Sample . . . . . . . . . . . . . . . . . . . . 31
3.5.4 Main Approach Examples . . . . . . . . . . . . . . . . . . . . 32

3.6 Alternative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Discrepancy Measure Development Process . . . . . . . . . . . . . . . 35
3.8 Selection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.8.1 Standard Selection . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8.2 Camera Angle Selection . . . . . . . . . . . . . . . . . . . . . 37

4 Experiment Settings 39
4.1 Experiment Data Selection . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Active Learning Setting . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Discrepancy Measure Parameter Setting . . . . . . . . . . . . . . . . 41

5 Results 43
5.1 Annotation Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.1 Selection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Unmatched Detections . . . . . . . . . . . . . . . . . . . . . . 44
5.1.3 Alternative Approach . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.4 Position-only Discrepancy Measure . . . . . . . . . . . . . . . 48
5.1.5 Baseline Comparison . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Fusion Network Discrepancy . . . . . . . . . . . . . . . . . . . . . . . 54

6 Discussion 57
6.1 Dataset and Selection Strategy . . . . . . . . . . . . . . . . . . . . . 57
6.2 The Discrepancy Method . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2.1 Main and Alternative Approach . . . . . . . . . . . . . . . . . 58
6.2.2 Unmatched Detections . . . . . . . . . . . . . . . . . . . . . . 58
6.2.3 F1 Score Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.4 Position-only Discrepancy Measure . . . . . . . . . . . . . . . 60
6.2.5 Discrepancy Aggregation for Matched Detections . . . . . . . 60

6.3 Entropy Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.4 Annotation Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Conclusion 63

Bibliography 65

A Appendix 1 I
A.1 Matching Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Architecture of Camera-LiDAR Fusion Network . . . . . . . . . . . . II
A.3 Modifications in Camera-LiDAR Fusion Network . . . . . . . . . . . . IV

x


List of Figures

2.1 A visualisation of a neuron. Each input xi is multiplied with a weight
wi and summed together. A bias term b is added to the sum. The
final sum is the input for a non-linear activation function a to produce
the final output ŷ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 An example of a fully connected feedforward network consisting of
input layer x ∈ R5 and output layer ŷ ∈ R3. The hidden layer
consists of 7 neurons. Each connection corresponds to a learnable
weight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Overview of a pool-based AL algorithm for DNNs. . . . . . . . . . . . 9
2.4 Example of LiDAR point cloud in bird’s eye view with detections of

objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Example of 2D object detection. . . . . . . . . . . . . . . . . . . . . . 13
2.6 Example of 3D object detection. Predicted location together with

dimension and facing direction in 3D space is visualized in bird’s eye
view to the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 An example of a bipartite graph. The 2D assignment problem consists
of finding edges such that each vertex in V (1-4) is connected to
exactly one vertex in U (5-9) and the sum of weights is minimized. . . 14

2.8 A visual equation for Intersection over Union (Jaccard Index) [1]. . . 18
2.9 Illustration of precision and recall which are common components for

evaluation metrics in object detection [2]. . . . . . . . . . . . . . . . . 19

3.1 An illustration of a 3D bounding box with axis directions. . . . . . . 24
3.2 Illustration of rotation difference between two bounding boxes. The

difference is largest when the bounding boxes are orthogonal. . . . . . 30
3.3 Example of sample where the discrepancy is small. There is only

a small difference in depth between the matched detections and the
largest difference are detections far away from the ego vehicle, mean-
ing small discrepancy since the depth difference is relative. . . . . . . 33

3.4 Example of sample where the discrepancy is large. MF is more ac-
curate in position than MP as seen in the BEV point cloud to the
right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Example images used when developing the discrepancy measure. Dis-
crepancy values for each pair of matched detections are shown to the
left. The dark boxes in the bottom left of the image show unweighted
and weighted mean discrepancy values. . . . . . . . . . . . . . . . . . 36

xi


List of Figures

5.1 AP. Comparison between SS and CAS as well as linear and exponen-
tial growth for unmatched detections. . . . . . . . . . . . . . . . . . . 43

5.2 Position MAE. Comparison between SS and CAS as well as linear
and exponential growth for unmatched detections. . . . . . . . . . . . 44

5.3 F1 score for vehicles. Comparison between SS and CAS as well as
linear and exponential growth for unmatched detections. . . . . . . . 45

5.4 F1 score for pedestrians. Comparison between SS and CAS as well
as linear and exponential growth for unmatched detections. . . . . . . 45

5.5 F1 score for bicyclists. Comparison between SS and CAS as well as
linear and exponential growth for unmatched detections. . . . . . . . 45

5.6 AP. Comparison between MA and AA. . . . . . . . . . . . . . . . . . 46
5.7 Position MAE. Comparison between MA and AA. . . . . . . . . . . . 46
5.8 F1 score for vehicles. Comparison between MA and AA. . . . . . . . 47
5.9 F1 score for pedestrians. Comparison between MA and AA. . . . . . 47
5.10 F1 score for bicyclists. Comparison between MA and AA. . . . . . . . 47
5.11 AP. Comparison between full discrepancy and position-only discrep-

ancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.12 Position MAE. Comparison between full discrepancy and position-

only discrepancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.13 AP per annotated object. Comparison between full discrepancy and

position-only discrepancy. . . . . . . . . . . . . . . . . . . . . . . . . 49
5.14 Position MAE per annotated object. Comparison between full dis-

crepancy and position-only discrepancy. . . . . . . . . . . . . . . . . . 49
5.15 AP. Baselines comparison. . . . . . . . . . . . . . . . . . . . . . . . . 50
5.16 AP per annotated object. Baselines comparison. . . . . . . . . . . . . 50
5.17 Position MAE. Baselines comparison. . . . . . . . . . . . . . . . . . . 51
5.18 Position MAE per annotated 3D object. Baselines comparison. . . . . 51
5.19 F1 score for pedestrians. Baselines comparison. . . . . . . . . . . . . 52
5.20 F1 score for bicyclists. Baselines comparison. . . . . . . . . . . . . . . 52
5.21 F1 score for vehicles. Baselines comparison. . . . . . . . . . . . . . . 52
5.22 F1 score for pedestrians, per annotated pedestrian. Baselines com-

parison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.23 F1 score for bicyclists, per annotated bicyclist. Baselines comparison. 53
5.24 AP per epoch in one AL iteration. Comparison between Camera-

LiDAR fusion, only LiDAR and Annotations. . . . . . . . . . . . . . 54
5.25 AP after last epoch normalized per 100000 annotated objects. Com-

parison between Camera-LiDAR fusion, only LiDAR and Annota-
tions. Results from two different random seeds are shown. . . . . . . 55

5.26 Position MAE per epoch in one AL iteration. Comparison between
Camera-LiDAR fusion, only LiDAR and Annotations. . . . . . . . . . 55

6.1 Plots of the linear and exponential functions for unmatched detections
used in the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 AP. Comparison between three different aggregation functions for the
discrepancy of matched detections. . . . . . . . . . . . . . . . . . . . 61

xii


List of Figures

A.1 Example of depth map. Left: Input image. Center: Target as pro-
jected LiDAR points in image. Right: Estimated depth map. Blue
indicates points close to the ego vehicle and red indicates points fur-
ther away. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.2 Example of outputted MF detections. Top: BEV with LiDAR and
pseudo point cloud. Bottom: Camera view. Red boxes are detections
byMF and green boxes are the annotated objects. . . . . . . . . . . III

xiii


List of Figures

xiv


List of Tables

2.1 An example of calculating recall and precision over a small dataset
containing a total of 10 positives to be predicted. . . . . . . . . . . . 20

xv


List of Tables

xvi


1
Introduction

The potential of self-driving cars cannot be denied. They have both promising
social and environmental benefits. Self-driving cars can be safer than cars with
human drivers due to shorter reaction times and constant perception around the
whole vehicle. They might also be more energy-efficient, e.g. by using optimal
speeds and driving behaviors, planning routes, and avoiding traffic jams by utilizing
data from other cars.

The challenge of developing self-driving cars is demanding in many ways. There
are many vital parts in an autonomous driving system and one of them is the
perception of the surroundings. Such perception is achieved through different types
of sensors mounted all around the vehicle, such as cameras and LiDAR sensors.

A LiDAR sensor (Light Detection and Ranging) builds a 3D model of its sur-
rounding by measuring the time it takes for a laser beam to be reflected to the
sensor. The generated 3D model consists of a large number of unordered points
in 3D space which is why it is often referred to as a point cloud. Different types
of sensors have different algorithms for interpreting sensor data. To extract useful
information from images and point clouds, it is common to use Machine Learning
(ML) algorithms.

ML algorithms are trained to recognize patterns by looking at multiple examples.
If the examples are annotated, the ML algorithm learns by predicting the annotation
with the purpose of being able to perform well on unseen and non-annotated data.
When applying ML to images, it is common to use Deep Neural Networks (DNN)
which requires a tremendous amount of annotated data in order to achieve useful
performance. The cost of annotating this data can get quite large due to the required
involvement of humans. It is therefore desirable to get as much information out of
every annotated image as possible. One way of doing this is to use Active Learning
(AL).

B. Settles manages to summarize the key idea of AL in one sentence:

The key idea behind active learning is that a machine learning algorithm
can achieve greater accuracy with fewer labeled training instances if it
is allowed to choose the training data from which it learns [3].

In most cases, only a subset of all available data has likely been annotated and
some ML model has been trained on this annotated subset. It is then up to the AL
algorithm to decide, based on the current state of the trained ML model, what subset
of the non-annotated data is most informative and should be sent for annotation.
The ML model in our case is a 3D object detecting DNN. The task of 3D Object
Detection (3DOD) consists of these sub-tasks:

1


1. Introduction

1. Detect all relevant objects.
2. Predict their dimensions; height, width, and length.
3. Predict their positions in 3D space.
4. Classify the objects, e.g. car, pedestrian or bicyclist.

3DOD can be done on images and other types of data such as point clouds.
When doing 3DOD on a point cloud, the resulting detections have a larger accuracy
in position than detections in an image. The explanation is the fact that point
clouds are actual 3D representations whereas images lack explicit data on distance
to objects. However, a point cloud is often sparse at distances far away from the
vehicle. Thus, 3DOD on images will find objects further away than what 3DOD on
point clouds can handle.

Now, suppose a 3DOD DNN makes perfect predictions on a non-annotated image.
Annotating the image will be unnecessary since the DNN will not learn anything
new from the annotation. Therefore, images, where the DNN is inaccurate, are more
rewarding to annotate. However, finding images where the DNN is inaccurate is a
non-trivial task.

Comparing with predictions from DNNs that use data from other sensors, where
the predictions are more accurate, could help to find the images where the DNN is
inaccurate. Intuitively, since images are 2D representations of the world, it is harder
to predict an object’s position in 3D space by looking at an image than looking at a
point cloud. Therefore, a DNN trained on point clouds can help find images where
the DNN trained on images is inaccurate.

1.1 Thesis Objective
The objective of this thesis is to investigate the possibility of utilizing the difference
between LiDAR sensors and cameras to do AL for 3DOD in images. In particular, by
having each sample in the dataset consisting of an image and a point cloud captured
at the same time, we can compare the detections obtained from two different DNNs.

We investigate how to use a primary DNN for 3DOD in images in parallel with
a secondary DNN for 3DOD in point clouds. We call the primary DNNMP (where
P stands for primary) and the secondary DNNMS (where S stands for secondary).
The detections from MP are compared to the detections from MS in order to
calculate a camera-LiDAR detection discrepancy measure. This measure is then
used to score images. In AL, the score is used to select images for annotation
that have a large discrepancy, in order to improve the performance of MP . The
intuition behind why this is worth investigating is that the samples with the largest
discrepancy value are assumed to be those whereMP is wrong or uncertain.

2


1. Introduction

1.2 Limitations
The main goal of the thesis is to develop an AL algorithm for 3DOD and not a
DNN architecture. Therefore, we limit ourselves to use existing implementations of
architectures. The implementations that we use are provided by Zenuity and we
only go into brief details regarding the architectures.

1.3 Contributions
We contribute with a definition of a discrepancy measure that quantifies how two
sets of detected objects differ from each other. Furthermore, we present results
showing that the proposed discrepancy measure can be used for scoring samples to
do AL. The method performs better than or on par with selecting samples at random
but raises the annotation cost. The presented results further contribute with the
analysis that the proposed discrepancy measure can be modified to maintain the
annotation cost. However, the modified measure fails to do so while performing
better than random selection. Instead, the modified measure only performs on par
with random selection.

Discussed are ways to improve the method with the goal of outperforming random
selection while maintaining the annotation cost, which is crucial for the method to
be useful in a real-case scenario.

1.4 Related Work
When doing AL for ML in general, the most straight forward approach is to use
some kind of uncertainty sampling [3]. Uncertainty sampling is a method where
the samples selected for annotation are those where the model is least confident in
its predictions. The samples selected are often the samples where the model is not
confident in classification. The approach is feasible for 3DOD, but the classification
is just one part of a prediction in 3DOD, as explained above. Therefore, it would
be beneficial to include the other parts of a prediction in the selection process. The
uncertainty for the other parts of a 3DOD prediction is difficult to find since 3DOD
DNNs do not exhibit uncertainties for those parts. This thesis proposes a method
with the potential to include samples where the model is wrong or uncertain for the
other components as well.

Another well-established approach is Query by Committee where samples are
selected based on the disagreement between a committee of models where all models
are trained on the same dataset [4]. One could see our approach as a committee of
two models. However, even though both models train on samples captured at the
same moment, the samples are not of the same format. According to the definition
of Query by Committee, the models should train on the same dataset in the same
format. However, in our case,MP is using images, andMS is using point clouds.

Our approach can also be compared to [5] where a View Divergence Score is
proposed as a score of informativeness in AL for semantic segmentation. The View
Divergence Score determines what samples are the most informative based on how

3


1. Introduction

the predictions differ between different views of the same scene. The different views,
in this case, are images taken from different angles. This is similar to what we do
but since the scope of [5] is semantic segmentation, it does not cover how to measure
the discrepancy between detections. Neither do they use point clouds but rather
only images.

Another approach uses AL for training an OD DNN for images and proposes two
metrics for the informativeness of a sample [6]. The two metrics are designed to
model the uncertainty of the detections. However, these metrics do not compare
multiple predictions coming from different sources (or views) to do AL, as our dis-
crepancy measure does. Furthermore, [6] focus solely on 2DOD while our method
is developed for 3DOD.

Using both LiDAR and cameras in combination with AL has been done by [7].
However, the actual AL algorithm in [7] does not benefit from the differences between
the two types of sensors, as ours does. Furthermore, only the DNN for OD in the
point clouds is trained using the proposed AL algorithm. We do the opposite - train
the DNN for OD in images.

1.5 Thesis Outline
Chapter 2 explains the theory and all the necessary components to understand the
thesis. The chapter includes a short description of ML and DNN, the fundamental
concept of AL as well as a description of the LiDAR and OD, the Hungarian method
for solving the 2D Assignment Problem, and evaluation metrics for OD. In Chapter
3, our method is described including a description of the dataset and the models. The
discrepancy measure between detections from two DNNsMP andMS is described
in detail. Chapter 4 contains the settings used in the experiments whose results are
presented in Chapter 5. The results are discussed in Chapter 6 with a conclusion in
Chapter 7.

4


2
Theory

This chapter covers the technical details necessary to understand the conducted
research. We begin by shortly describing Machine Learning and Artificial Neural
Networks in Section 2.1 to motivate the reason for Active Learning. The concept of
Active Learning and how traditional Active Learning algorithms work is described in
Section 2.2. Then, the LiDAR sensor is described in Section 2.3 followed by 2D and
3D Object Detection in Section 2.4. We also describe the key parts of our method
which are the Hungarian Method for solving the 2D Assignment Problem in Section
2.5 and Intersection over Union in Section 2.6. Lastly, evaluation metrics for Object
Detection are explained in Section 2.7.

2.1 Machine Learning
Machine Learning (ML) is a common technique for solving complex tasks such as
detecting objects in images. The task is solved by letting the ML algorithm learn
by training on multiple samples. If the samples are annotated, the task is called
supervised learning. In supervised learning, the annotation is the desired output of
the ML algorithm.

The goal in supervised learning is to learn the output y for a specific input
x such that the ML algorithm can accurately predict the output for unseen and
non-annotated data. More specifically, the ML algorithm tries to approximate a
function y = f ∗(x) with f(x; θ) by adjusting some parameters θ. A common way of
approximating such a function is by Artificial Neural Networks.

2.1.1 Artificial Neural Networks
Inspired by biological neural networks, e.g a human brain, Artificial Neural Networks
can solve advanced computational tasks by using a large number of connected pro-
cessors, referred to as artificial neurons. The neurons serve as single units where
each unit act as a function taking a vector as input and outputs a scalar [8]. All
the connected neurons together build up to a complex function ŷ = f(x; θ) that
approximates y = f ∗(x). The parameters θ often consist of weights w and biases
b.

5


2. Theory

Σ a ŷ

x1

x2

xn

...

w1

w2

wn

...

b

Figure 2.1: A visualisation of a neuron. Each input xi is multiplied with a weight
wi and summed together. A bias term b is added to the sum. The final sum is the
input for a non-linear activation function a to produce the final output ŷ.

A visualization of a single neuron can be seen in Figure 2.1. A neuron takes a
vector x = (x1, ..., xn)T as input and outputs a single value ŷ. The mathematical
definition of a neuron is

ŷ = a(
n∑
i=1

wixi + b) (2.1)

where w = (w1, ..., wn)T and b are learnable parameters referred to as weights and
bias and a is a non-linear activation function. Commonly used activation functions
are

Rectified Linear Unit : ReLU(x) =

0 if x ≤ 0
x otherwise

(2.2)

Sigmoid : σ(x) = 1
1 + e−x

(2.3)

Hyperbolic tangent : tanh(x) = ex − e−x

ex + e−x
. (2.4)

Using a large number of neurons, the neural network can approximate many
complex continuous functions. The activation functions allow the network to learn
non-linearities. Without the activation functions, the network is just one linear
operation regardless of how many neurons the network consists of.

The parameters θ are trained by minimizing the total loss over the whole training
dataset using a loss function L, i.e.

min
θ

∑
i

L(yi, f(xi; θ)) (2.5)

where yi is the desired output and ŷi = f(xi; θ) is the output from the neural
network for sample i. The loss essentially measures how far the prediction ŷi is
from the truth yi. The gradient of L with respect to θ is used to optimize θ by

6


2. Theory

taking small steps in the opposite direction of the gradient. We refer to [8] for
details.

The most fundamental network is the fully connected feedforward network [8]. In
a fully connected feedforward network, the neurons are connected in several layers,
where all neurons are connected between consecutive layers. An example of a fully
connected feedforward network can be seen in Figure 2.2. The word feedforward
means the input x is evaluated through all the layers in f , as an ordinary function,
without any intermediate feedback to itself.

Input Layer ∈ ℝ⁵ Hidden Layer ∈ ℝ⁷ Output Layer ∈ ℝ³

Figure 2.2: An example of a fully connected feedforward network consisting of
input layer x ∈ R5 and output layer ŷ ∈ R3. The hidden layer consists of 7 neurons.
Each connection corresponds to a learnable weight.

The fully connected feedforward network consists of an input layer that consti-
tutes input x, up to several hidden layers, and an output layer that constitutes
output ŷ. Each layer outputs the input for the next layer. The word hidden layer
comes from that the training data does not show the desired output from these
kinds of layers. The number of layers is often referred to as the depth of the neural
network.

For complex tasks, such as detecting objects in images, it is common to use neural
networks with large depth, i.e. a tremendous amount of layers. Due to the large
depth, these kinds of networks are called Deep Neural Networks.

7


2. Theory

2.1.2 Deep Neural Networks
A Deep Neural Network (DNN) can consist of hundreds of layers which is crucial for
approximating complicated functions. For example, detecting objects in images is
demanding in many ways. The DNN needs to learn what the different objects look
like as well as what environments the objects interact with. Therefore, to perform
the object detection task well, a tremendous amount of features need to be learned.

An image can consist of millions of pixels. Due to the millions of pixels, if an image
is used as input to a fully connected feedforward network, the input layer (see Figure
2.2) will consist of millions of neurons. Therefore, the number of parameters will be
huge since each connection between neurons has an associated weight. Instead, one
can use a convolutional neural network [8]. In a convolutional layer, only nearby
neurons, i.e. nearby pixels for images, are connected in the neural network. The
same weights are used in all neighborhoods, called weight sharing. However, even
for convolutional networks, the number of parameters can still be large due to many
layers.

Many popular DNNs for solving image tasks, as well as the DNNs used in this
thesis, are built upon an efficient DNN called ResNet. ResNet is a convolutional
neural network using a successful technique called residual learning. We refer to [9]
for details. ResNet-152 consists of 152 layers and millions of learnable parameters.
To optimize all the millions of parameters, a great number of annotated training
data is needed.

The annotation process is tedious, expensive, and requires a human. If a DNN
makes a perfect prediction on a non-annotated sample, it is most likely not infor-
mative to the DNN to train on. Since a large number of annotated training data
is needed, it is important to annotate samples that are informative for the DNN.
To efficiently choose data for annotation which is believed to be informative, Active
Learning can be used.

2.2 Active Learning
The core of Active Learning (AL) is to choose what data to use for training an ML
model. Based on some informativeness score, the AL algorithm (also known as the
learner) can select data which the model believes is informative. The informative-
ness score attempts to measure how beneficial a non-annotated sample would be to
annotate. If the non-annotated sample is considered informative, the learner queries
for an annotation and adds the newly annotated sample to the training set.

There are mainly three different types of AL:membership query synthesis, stream-
based selective sampling and pool-based active learning [3]. Membership query syn-
thesis lets the learner create its own data which it believes the ML model will benefit
the most from. Thereafter the learner queries for annotations of the newly cre-
ated data. Stream-based selective sampling sequentially analyses a non-annotated
dataset. For each sample, the learner makes a decision based on the informative-
ness score whether to keep it (i.e. query for its annotation) or discard it. Lastly,
pool-based active learning is much like stream-based selective sampling with the
difference that it analyses the whole non-annotated dataset before deciding what

8


2. Theory

samples to annotate [3].
Membership query synthesis does not suit complex image tasks, such as 3DOD,

since synthesizing realistic images can be difficult. In stream-based selective sam-
pling, only one sample is drawn at a time. In many real-world cases, large collections
of non-annotated data can be collected at once which makes it possible to analyze
all the samples.

In this thesis, we assume a large collection of non-annotated data can be collected
at once and added to a pool of non-annotated data. We also assume that a batch of
non-annotated samples are selected and sent for annotation at the same time and
added to an existing pool of annotated data. Therefore, we base our method on pool-
based active learning. Figure 2.3 shows an overview of how pool-based AL works.
The target model in the figure is a DNN. The generic framework for pool-based
active learning consists of two datasets A and N where the first has annotations
and the second does not. We call them the annotated and non-annotated datasets
respectively. We have a budget of b samples and denote the target ML model (the
model that is trained by the learner) byM. The AL algorithm is explained step by
step below and each step has a reference to the arrows in Figure 2.3.

1. TrainM on A (purple arrow).
2. ApplyM on N and store the output (yellow arrow).
3. Score the output from step 2 using some informativeness score (yellow arrow).

4. Pick the b samples from N with highest score (yellow arrow) and send for
annotation (blue arrow).

5. Add the annotated data to A and remove it from N (turquoise arrow).
6. Repeat from step 1.

Deep Neural 

Network

Annotator
(human or machine)

Non-annotated

data

Annotated 

data Active Learning

Train the network Select samples based 

on some query

Add annotated samples

to training dataset

Send selected samples

for annotation

Figure 2.3: Overview of a pool-based AL algorithm for DNNs.

9


2. Theory

The main difficulty for AL is to design an informativeness score that correlates
well with what the model needs to learn. This is highly dependent on both the task
that should be solved and the available underlying data. Since designing an infor-
mativeness score is not trivial, we look at a few examples of how the informativeness
of a sample can be measured in the next subsection.

2.2.1 Examples of Informativeness Scores
One of the most straight forward query approaches is Uncertainty Sampling [3].
In Uncertainty Sampling, the samples that the model is most uncertain about are
chosen for annotation. The most uncertain samples are given a large score and
assumed to be the most informative. The uncertain samples are samples where the
model is not confident in its predictions.

For classification tasks, the output is a list of class probabilities. To measure
uncertainty in classification tasks, entropy is suitable [3]. The samples with large
entropy are selected for annotation. For a random variable Y with a set {yi} of
possible outcomes, entropy is defined as

Entropy(Y ) = −
∑
i

P (Y = yi) logP (Y = yi) (2.6)

where P (Y = yi) is the probability that yi occur. For a random variable, the entropy
is largest if the probability for all outcomes is equal which means that we cannot
confidently tell what the outcome will be. Therefore, the outcome is uncertain.

Consider the task of classifying images of cats and dogs as an example. Suppose
the output of the model is 52% probability that an image contains a cat and 48%
probability that the image contains a dog. Even though the probability of being a
cat is larger than being a dog, the similarity indicates that the classification model
is uncertain. There could be unseen features in the image that makes it uncertain
to predict the class. Therefore, the model could benefit from training on the sample
in order to learn new cat and dog features. By contrast, if the probability of a dog is
99% for a decently trained ML algorithm, the image likely contains a dog, in which
case annotating this image does not significantly improve the performance of the
algorithm.

Another query approach is Query by Committee [3]. A committee of models trains
on the same annotated data and the samples selected are based on the disagreement
between the models’ outputs. If the disagreement between the models’ output is
large, the informativeness score is large. For classification tasks, the disagreement
can be measured using vote entropy [3]. For a set {yi} of possible classes and C
number of committee models, vote entropy for sample x is defined as

VoteEntropy(x) = −
∑
i

V (yi;x)
C

log V (yi;x)
C

(2.7)

where V (yi;x) is the number of votes, i.e. number of models predicting class yi
for sample x. The disagreement is large if the votes are evenly distributed over

10


2. Theory

the different classes. Evenly distributed votes lead to all V (yi;x)
C

terms being similar
which corresponds to large entropy, as explained above.

Compared to Uncertainty Sampling and Query by Committee, a more different
query approach is Expected Model Change. Expected Model Change measures how
much impact the sample has on the current model if it is included in the training
set. The sample is assumed to be informative if it makes a great impact on the
model, i.e. make large changes to its parameters.

An example of Expected Model Change used for classification tasks is Expected
Gradient Length. Expected Gradient Length calculates the length of the loss func-
tion gradient. Since the sample is non-annotated, we calculate the average length
of the gradient over all possible classes. Let {yi} be a set of all possible classes and
∇L(yi, f(x; θ)) be the gradient for the loss between yi and the prediction for f(x; θ)
using the current model parameters θ. Expected Gradient Length for sample x is
defined as

ExpectedGradientLength(x) =
∑
i

P (yi|x; θ)||∇L(yi, f(x; θ))||2 (2.8)

where P (yi|x; θ) is the probability of x being class yi and || · ||2 is the L2-norm.
Empirical studies show that Expected Gradient Length works well for some tasks,
but can be computationally expensive if both the number of possible classes and the
feature space is large [3].

2.2.2 Evaluating an Active Learning Algorithm
An AL algorithm should be evaluated before used in any real-case scenario due to
the annotation procedure being both time-consuming and expensive. The selected
samples sent to the annotator must be of some value for the model, otherwise, the
time spent on annotation is wasted. When evaluating an AL algorithm, the human
annotator is commonly simulated. Instead of having a human annotator, the non-
annotated dataset N is not used and the annotated dataset is divided into two parts.
The first part is unchanged and interpreted as the annotated dataset A. The second
part acts as N by hiding the annotations. The human annotator is simulated by
revealing the annotations for some data from the second part and moving it to the
first part.

2.3 LiDAR Sensor
Our AL method measures the informativeness of a sample by using predictions based
on data from a LiDAR sensor. We refer to Chapter 3 for details on our method.

A LiDAR sensor (Light Detection and Ranging) can be mounted on a self-driving
car to achieve perception around the vehicle while moving. The LiDAR measures
the distance to objects by measuring the time it takes for a laser beam to be reflected
to the sensor. The laser beams build a 3D model of the surroundings, consisting of
a large number of unordered points. Due to the large number of unordered points,
the 3D model is referred to as point cloud.

11


2. Theory

An example of a point cloud can be seen in Figure 2.4. The point cloud is
visualized in bird’s eye view, meaning every point is projected onto the ground and
shown from above. How high the points are above the ground is not shown in the
figure. Objects in the point cloud are clearly marked with blue and orange boxes,
where blue indicates a car and orange a truck. The objects are detected by an Object
Detection DNN using point clouds as input. Predicting the position in 3D space
using a point cloud is less difficult than predicting the position using an image. An
image lack explicit data on distance to objects, whereas a point cloud is an actual
3D representation.

Figure 2.4: Example of LiDAR point cloud in bird’s eye view with detections of
objects.

2.4 Object Detection

Object Detection (OD) is a technique used for detecting objects of certain classes,
such as cars and/or pedestrians. It is commonly applied to images, but can also be
used in other models of the world, such as point clouds. Objects can be detected
both in 2D (images) and 3D (images and point clouds).

An example of 2DOD can be seen in Figure 2.5. A detected object is bounded
in a box (a bounding box) and the object in the box is predicted with a class [10].
The 2D bounding box expresses where the object is located in the image.

12


2. Theory

In 3DOD [11], by contrast to 2DOD, the bounding box is three-dimensional and
expresses where the object is located in world coordinates, relative to the ego vehicle.
The ego vehicle is the vehicle where all the sensors are mounted and is the vehicle
from where the image is captured. As in 2DOD, the object is predicted with a class.
The 3D bounding box is defined in 3D space and is commonly expressed by position
(center point coordinates of bounding box bottom plane), dimension (height, width,
length), and rotation around the three axes. An example of 3DOD can be seen in
Figure 2.6. Since the boxes are defined in 3D space, this definition can apply to
3DOD in both point clouds and images.

Figure 2.5: Example of 2D object detection.

Figure 2.6: Example of 3D object detection. Predicted location together with
dimension and facing direction in 3D space is visualized in bird’s eye view to the
right.

13


2. Theory

2.5 2D Assignment Problem

The idea of this thesis is to find images where a DNN trained on images gives different
predictions than a DNN trained on point clouds. We compare detections from a
primary DNNMP (where P stands for primary) trained on images with detections
from a secondary DNNMS (where S stands for secondary) trained on point clouds.
The comparison is done by calculating a discrepancy between detections.

To calculate a discrepancy measure between detections outputted fromMP and
MS, one approach is to calculate the discrepancy per object. If the discrepancy is
calculated per object, each detection fromMP needs to be matched to a detection
inMS and vice versa. Both detections in a match are associated to the same real-
world object. The matching can be solved by solving the 2D Assignment Problem
(2DA) [12].

Let G = 〈V, U ;E〉 be a weighted bipartite graph where V and U consist of
vertices, E consist of edges and each edge e ∈ E is assigned a weight. Furthermore,
let |V | ≤ |U |. An example of a bipartite graph can be seen in Figure 2.7. The
2DA consists of finding the minimum sum of weights such that each vertex v ∈ V
is connected to exactly one vertex u ∈ U . For example, if V correspond to jobs, U
correspond to machines and each edge e ∈ E has a running time, we want to find
the matchings of jobs and machines where the total running time is minimized, all
jobs are run and one machine is assigned exactly one job. An approach to solving
the 2DA is the Hungarian method [13].

1

2

3

4

5

6

7

8

9

V

U

Figure 2.7: An example of a bipartite graph. The 2D assignment problem consists
of finding edges such that each vertex in V (1-4) is connected to exactly one vertex
in U (5-9) and the sum of weights is minimized.

14


2. Theory

2.5.1 Hungarian Method
The Hungarian method can solve the 2DA in O(n3) time [14], where n = |U |. The
2DA can be expressed with a matrix M where the elements are the weights of E.
The rows correspond to V and the columns correspond to U . Therefore, index (i, j)
of M contains weight mij for connecting the ith vertex in V to the jth vertex in U .
If |V | < |U |, |U |−|V | dummy rows with zeros need to be added such that |V | = |U |.
After the dummy rows are added, the 2DA can be expressed as

minimize
∑

(i,j)∈E
yijmij (2.9a)

subject to
n∑
i=1

yij = 1, ∀j ∈ {1, ..., n}, (2.9b)

n∑
j=1

yij = 1, ∀i ∈ {1, ..., n}, (2.9c)

yij ∈ {0, 1}, ∀i, j ∈ {1, ..., n} (2.9d)

where yij = 1 if the ith vertex in V is assigned the jth vertex in U . The constraints
in (2.9b) and (2.9c) ensures that each vertex v ∈ V is assigned exactly one vertex
u ∈ U . The constraint (2.9d) ensures that an assignment cannot be split. In the
example with machines and jobs, the constraint ensures that one job cannot be split
and run by two machines.

We will now describe how the Hungarian method is executed. After that, we will
take a look at the mathematical details as to why the solution is optimal. We use

M =
4 1 5
7 5 6
8 5 8


 (2.10)

as an example problem instance. Let V = {v1, v2, v3} and let U = {u1, u2, u3}.

1. (a) For each row r in M :
Subtract the minimum value of r from every element in r.
Minimum values are marked with a circle.

4 1 5
7 5 6
8 5 8



← −1
← −5
← −5

⇒
3 0 4
2 0 1
3 0 3


 (2.11)

(b) For each column c in M :
Subtract the minimum value of c from every element in c.
Minimum values are marked with a circle.

15


2. Theory

3 0 4
2 0 1
3 0 3




−2
↓
−0
↓
−1
↓

⇒
1 0 3
0 0 0
1 0 2


 (2.12)

2. (a) Cover all zeros in M using minimum number of vertical and horizontal
lines.

1 0 3

0 0 0

1 0 2



 (2.13)

(b) If the number of lines is equal to number of rows in M , go to step 3,
otherwise proceed.

(c) Add the minimum value of all the uncovered (marked with circles) ele-
ments to all elements covered by both a vertical and a horizontal line
(marked with square).

1 0 3

0 0 0

1 0 2





+1
↓

⇒
1 0 3
0 1 0
1 0 2


 (2.14)

(d) Subtract the minimum value of all the uncovered elements from all the
uncovered elements (marked with circles).

1 0 3

0 1 0

1 0 2





−1
↓

−1
↓

⇒
0 0 2
0 1 0
0 0 1


 (2.15)

(e) Remove the lines and go to step 2.

16


2. Theory

3. Find the solution by systematically picking zeros covered by either a vertical
or a horizontal line but not both. Also, only pick one element from each row
and column. If the solution is valid, the solution is also optimal.
The lines in 2.16 comes from doing step 2a again. A valid solution is marked
with squares at indices (1, 1), (2, 3) and (3, 2). Therefore, the resulting match-
ings are (v1, u1), (v2, u3) and (v3, u2).

0 0 2

0 1 0

0 0 1



 ⇒
0 0 2
0 1 0
0 0 1


 (2.16)

M =
4 1 5
7 5 6
8 5 8


 ⇒ total sum of weights: 4 + 6 + 5 = 15 (2.17)

The Hungarian method modifies M in order to find an optimal solution. The
mathematical reasoning to why the Hungarian method gives an optimal solution is
as follows. Let {ai}ni=1 and {bj}nj=1 be values that modifies M . The values modify
M without changing (2.9a) as

∑
(i,j)∈E

yijmij =
∑

(i,j)∈E
(mij − ai − bj + ai + bj)yij (2.18)

=
∑

(i,j)∈E
(mij − ai − bj)yij +

n∑
i=1

ai
n∑
j=1

yij +
n∑
j=1

bj
n∑
i=1

yij (2.19)

=
∑

(i,j)∈E
(mij − ai − bj)yij +

n∑
i=1

ai · 1 +
n∑
j=1

bj · 1 (2.20)

=
∑

(i,j)∈E
(mij − ai − bj)yij +

n∑
i=1

ai +
n∑
j=1

bj (2.21)

where (2.20) is obtained by the constraints (2.9b) and (2.9c). The Hungarian method
finds an optimal solution by modifyingM with {ai}ni=1 and {bj}nj=1, without violating
the constraints, such that mij − ai − bj = 0 for (i, j) where yij = 1. The first term,∑

(i,j)∈E(mij − ai − bj)yij, evaluates to 0, since yij = 0 for all other (i, j), meaning
the resulting objective value is ∑n

i=1 ai +∑n
j=1 bj. The solution is optimal since the

resulting objective value is independent of yij and the term including yij is minimal
since the term is evaluated to 0. We refer to [15] and [16] for details.

17


2. Theory

2.6 Matching Score
In order to match detections betweenMP andMS, a score for a potential match is
needed in order to create M used by the Hungarian method to solve the 2DA. We
define a good match as a large Intersection over Union (IoU) between 2D bounding
box detections. The IoU between two sets A and B is defined as

IoU(A,B) = |A ∩B|
|A ∪B|

(2.22)

and measures the similarity between two sets. In 2DOD, IoU between bounding
boxes measures area of overlap (in pixels) divided by area of union (in pixels). A
visualization can be seen in Figure 2.8. In 3DOD, IoU is calculated with volume (in
cubic meters) instead of area.

Figure 2.8: A visual equation for Intersection over Union (Jaccard Index) [1].

2.7 Evaluation Metrics
Evaluation metrics are used to evaluate and compare different AL algorithms. To
evaluate an AL algorithm for OD, two commonly used metrics for evaluating both
2DOD and 3DOD are F1 score and Average Precision (AP). Both metrics are cal-
culated using

p = TP
TP + FP

r = TP
TP + FN

TP = Number of true positives
FP = Number of false positives
FN = Number of false negatives

(2.23)

18


2. Theory

where p and r denote precision and recall, respectively. In OD, precision measures
how accurate the detections are by the percentage of correct detections (true posi-
tives). IoU in 2DOD and 3DOD is often used to determine whether a detection is a
true positive or false positive by comparing to the annotated object. If IoU is above
a certain threshold, e.g. 0.5, the detection counts as a true positive [17].

Recall is a measure of completeness. In OD, recall measures what percentage of
all the annotated objects are detected. An illustration of precision and recall can be
seen in Figure 2.9.

relevant elements

selected elements

false positivestrue positives

false negatives true negatives

Precision = Recall =

How many selected
items are relevant?

How many relevant
items are selected?

relevant elements

selected elements

false positivestrue positives

false negatives true negatives

Precision = Recall =

How many selected
items are relevant?

How many relevant
items are selected?

Figure 2.9: Illustration of precision and recall which are common components for
evaluation metrics in object detection [2].

2.7.1 F1 Score
F1 score is defined as the harmonic mean between precision and recall, i.e.

F1 = 2
(1/p) + (1/r) = 2 · p · r

p+ r
(2.24)

and ranges between 0 and 1 [18]. With perfect precision and recall, F1-score reaches
its best value, 1.

19


2. Theory

2.7.2 Average Precision
To calculate AP, precision values over the recall range 0 to 1 need to be calculated.
In other words, a precision function p(r) taking recall value as input should output
precision value. The function p(r) is called the precision-recall curve and is estimated
from the predictions for all the samples in the data set.

All the predictions are sorted by their confidence level in descending order. In
OD, the confidence level corresponds to the class prediction of the detection and
is the probability of being a particular class. When all the predictions are sorted,
each prediction is determined whether it is a true positive or not, which means
using an IoU threshold between the detection and the annotated object in OD.
Each prediction is assigned a rank depending on their confidence level. Precision
and recall are calculated for each rank, where only the current rank and lower rank
predictions are taken into account.

In the example in Table 2.1, there are 10 positives to be predicted in the whole
dataset. All predictions are sorted by their confidence level in descending order.
Rank 1 refers to the prediction with largest confidence. Each prediction is deter-
mined whether it is a true or false positive. Up to rank 5, only 3 of the predictions
are true positives, which means the precision value is 3/5 = 0.6 for that rank. The
recall value depends on the total number of positives in the dataset and would in
this example be 3/10 = 0.3 for rank 5.

Confidence level Rank Correct prediction Precision p(r) Recall r
0.98 1 True 1 0.1
0.95 2 False 0.5 0.1
0.93 3 False 0.33 0.1
0.9 4 True 0.5 0.2
0.86 5 True 0.6 0.3

Table 2.1: An example of calculating recall and precision over a small dataset
containing a total of 10 positives to be predicted.

AP measures average precision over the whole recall range 0 to 1 [17], defined by
the area under the precision-recall curve, i.e.

AP =
∫ 1

0
p(r)dr. (2.25)

In reality, the precision-recall curve is not continuous. Instead, the precision-recall
curve is estimated as above and AP is commonly calculated by interpolation. A
commonly used OD benchmark is the VOC2007 challenge. We refer to [19] for
details. In the challenge, AP is calculated as the mean interpolated precision at
r = 0, 0.1, ..., 1, i.e.

AP = 1
11

∑
r∈{0,0.1,...,1}

pinterpolated(r) (2.26)

20


2. Theory

where the interpolated precision is

pinterpolated(r) = max
r′≥r

p(r′). (2.27)

The IoU threshold for determining true positives in VOC2007 is set to 0.5. Also, if
there are several detections for the same object, only the detection with the largest
confidence is counted as a true positive and the rest as false positives.

When there are several classes in the dataset, it is common to calculate AP
for every class individually. The class AP’s are averaged to obtain Mean Average
Precision [17]. Even though it could be confusing, mAP is often referred to as AP.

2.7.3 Mean Absolute Error
To evaluate the accuracy of the 3D bounding box, Mean Absolute Error (MAE) can
be used. The general definition for MAE is

MAE = 1
N

N∑
i=1
|yi − ŷi| (2.28)

where ŷ1, ..., ŷN are predictions and y1, ..., yN are targets. For 3DOD, the predictions
are all the true positives and compared to their corresponding annotations. MAE
can be used to see the accuracy of each 3D bounding box attribute separately, for
example the accuracy of the bounding box height.

21


2. Theory

22


3
Method

This chapter introduces how we select the samples for annotation in AL. The sam-
ples are selected based on the discrepancy between detections outputted from two
different DNN’s, the primary DNN MP and the secondary DNN MS. The objec-
tive is to increase the performance of MP by selecting samples that are assumed
to be those where MP is wrong or uncertain. Samples selected could be samples
where MP is wrong in any of detection, position, dimension, and classification.
By contrast, entropy (as described in Section 2.2.1) only take class prediction into
account.

The dataset is described in Section 3.1. In Section 3.2, the primary DNN,MP ,
is presented. The primary DNN is trained only on annotated images and is the
DNN we try to improve using AL. In Section 3.3, we present two different secondary
DNN’s,MS. The AL algorithm using discrepancy measure as informativeness score
is introduced in Section 3.4 and the calculations of the actual discrepancy are laid
out in Section 3.5 and 3.6. The development process of the discrepancy measure is
explained in Section 3.7. Lastly, two different ways of selecting samples based on
the discrepancy are described in Section 3.8.

3.1 Dataset
The dataset we use is data provided by Zenuity. The primary motivation behind
the choice of data is to make the thesis relevant to the company. Another benefit is
thatMP already is compatible with the dataset, i.e. no modifications forMP are
needed. Also, the dataset is large which makes it suitable to use for AL experiments.

The dataset D is a set of N samples,

D = {X1, ...,XN} (3.1)

where each sample
Xi = (p,x, A) (3.2)

consists of a point cloud p, image x, and a set of annotations A. All coordinates
in A are defined in the metric system with the ego vehicle as the origin. For each
sample, p and x are captured approximately at the same time. The image x is
captured from one of the cameras mounted around the ego vehicle. The annotations
in A consists of classified 2D and 3D bounding boxes. There are ten different classes,
where background is one of them. Since the point clouds and the images are captured
at the same moment, it is possible to train one model with images and one model
with point clouds with the same annotations.

23


3. Method

3.2 Primary Object Detection Network
As the primary DNN, we will use a DNN based on ResNet [9] that is provided
by Zenuity. For simplicity, we call the DNNMP where P stands for primary. The
primary DNNMP is trained solely on annotated images and outputs both predicted
2D and 3D bounding boxes.

The 2D boxes are defined on the format (x, y, w, h)T where (x, y)T defines the
center point (in pixels) and (w, h)T defines the dimension (width and height in
pixels) of the box. The 3D boxes are defined on the format (x, y, z, w, h, l, r)T where
l = (x, y, z)T defines the location (center point), d = (w, h, l)T defines the dimension
(width, height and length in meters) and r defines the rotation around the y-axis
(in radians), also referred to as yaw. Rotation around x-axis and rotation around
z-axis are assumed to be negligible. An illustration of a 3D bounding box can be
seen in Figure 3.1.

PAGE 1

Images from the KITTI dataset

Bounding Box Match detections
Measure discrepancy 

between detections

Handle unmatched 

detections

x

y

z

h

w

l

x, y, z

x

z

r

Figure 3.1: An illustration of a 3D bounding box with axis directions.

The coordinate system is in meters and the position of the camera is used as the
origin. In the camera’s coordinate system, the x-axis is pointing to the right, y-axis
is pointing down and z-axis is pointing forward (see Figure 3.1). For example, an
object located at (x, y, z) = (10,−1, 10) means that it is 10 meters to the right, 1
meter below, and 10 meters in front of the camera. The rotation is defined relative
to the ego vehicle. To exemplify, r = 0 means the object is facing in the same
direction as the ego vehicle, r = π

2 means the object is facing to the right and r = π
means the object is facing in the opposite direction of the ego vehicle.

24


3. Method

3.3 Secondary Object Detection Network
The secondary DNN should in principle be a DNN with high performance in 3DOD.
A DNN with point clouds as input is preferable to get high performance on the 3D
parameters, as described in Section 2.3. We call the secondary DNNMS where S
stands for secondary. Two different versions ofMS are used to evaluate the perfor-
mance of the discrepancy measure. The two DNNs are described in the following
subsections.

3.3.1 Simulated High Performance Network
To mimic a DNN with high performance, we use the annotations as if they were
detections fromMS. Using the annotations to select samples for human annotation
is of course not possible in practice. The annotations are not available before they
are created by a human. However, the results can still show that the discrepancy
measure is worth using if a high-performance DNN is used as MS. We call the
simulated high-performance DNNMA where A stands for annotation.

3.3.2 Camera-LiDAR Fusion Network
To evaluate the practical potential of the discrepancy measure, a real Camera-
LiDAR fusion DNN is used. The DNN is based on a DNN developed in a previous
thesis at Zenuity [20].

The DNN is a fusion DNN which in this case means that it is trained on both
images and point clouds. We call the DNNMF where F stands for fusion. MF is
developed to be robust against sensor degradation, i.e. if one of the sensors (camera
or LiDAR) stops working, the DNN should still be able to perform object detections.
In other words,MF is functioning using solely point clouds as input which gives an
option to investigate if the fusion part, i.e. using both images and point clouds, is
important for calculating discrepancy in AL. The architecture of MF is explained
in Section A.2 and modifications toMF are explained in Section A.3.

3.4 Modified Active Learning Algorithm
We base our experiments on the pool-based AL algorithm described in 2.2. As
mentioned in 2.2, the budget is denoted b, the dataset with annotations is denoted A
and the dataset without annotations is denoted N . We also add a new annotated set
V which is our validation set. Apart from usingMP asM, a few other modifications
are done. These modifications are underlined in the following algorithm.

25


3. Method

1. TrainMP andMS on A.
2. EvaluateMP on V .
3. ApplyMP andMS on N and store the output.
4. Score the output from step 3 using a discrepancy measure between the detec-

tions ofMP andMS.
5. Pick the b samples from N according to some selection strategy based on

discrepancy and send for annotation.
6. Add the annotated data to A and remove it from N .
7. Repeat from step 1 .

First, we train not only MP on A, but also MS. After each training, we eval-
uate the performance of MP on V for comparison between AL iterations. An AL
iteration is one iteration of steps 1-6. In step 4 we use our discrepancy measure as
informativeness score. Lastly, we select samples based on discrepancy using some
selection strategy.

3.5 Main Approach
Two different approaches to quantifying discrepancy are tested. We call them Main
Approach (MA) and Alternative Approach (AA). Both approaches need to

1. Match detections fromMP andMS (as mentioned in Section 2.5).
2. Quantify the discrepancy between the matched detections.
3. Include unmatched detections in the measure.

This section describes in detail how MA matches detections and calculates total
discrepancy for a sample. AA is heavily dependent on MA, which is why it is
important to first read and understand this section before reading Section 3.6.

3.5.1 Matching
Let XP = {XP,1, ..., XP,n} and XS = {XS,1, ..., XS,k} be two sets of detections for
sample X coming from MP and MS, respectively. Before using the Hungarian
method to find the best matchings, some detections will be disregarded. SinceMS

is supposed to detect objects in point clouds, there could be detections that will not
be visible in the corresponding image. However,MP will only use images and will
never be able to detect anything that is not in the image. Therefore, any detection
in XS that does not project into the image plane will be left out.

We will now define three matrices, MIoU,Mobj,Mdepth that are to be used to
calculate the matrix M used by the Hungarian method to solve the 2DA. We define

MIoU =


IoU2D(XP,1, XS,1) . . . IoU2D(XP,1, XS,k)

... . . . ...
IoU2D(XP,n, XS,1) . . . IoU2D(XP,n, XS,k)

 (3.3)

26


3. Method

to be the n × k IoU matrix that contains the 2D IoU scores between all possible
combinations of the detections in XP and XS. IfMS does not output 2D boxes, the
2D box is created by bounding the 3D box that is projected into the image plane.

Furthermore, we define p0 to be the probability of a detection being background,
i.e. the probability that a detection is neither any of the 9 other classes. The
objectness probability is defined as 1− p0, i.e. the probability of the detection being
any of the 9 other classes. Then Mobj is the n × k objectness matrix between XP
and XS and is defined

Mobj =


1− p0,P,1

...
1− p0,P,n

(1− p0,S,1 . . . 1− p0,S,k
)

(3.4)

i.e. the element at position (i, j) is the product of the objectness probabilities of
detection i from XP and detection j from XS.

Furthermore, we define the relative depth difference between detections XP,i and
XS,j to be

∆z
ij = min(zP,i, zS,j)

max(zP,i, zS,j)
(3.5)

where zP,i and zS,j is the center point z-coordinate of detections XP,i and XS,j,
respectively. Mdepth is then defined

Mdepth =


∆z

11 . . . ∆z
1k

... . . . ...
∆z
n1 . . . ∆z

nk

 . (3.6)

The three matrices will now be multiplied together using the Hadamard product ◦
(element-wise multiplication) to form a score matrix

M = MIoU ◦Mobj ◦Mdepth. (3.7)

The purpose of the IoU matrix MIoU is to have matched detections close to each
other in position, which is the most crucial part. The objectness matrix Mobj aims
to ensure that both detections in a match have high confidence of detecting a real-
world object, otherwise, there is a risk of matching a false positive. Finally, Mdepth
prevents that the difference between detections in a match is unrealistically large in
depth, which is possible if only matching with respect to 2D IoU.

The final step before solving the 2DA using the Hungarian method on M is
to filter out any matchings that are not feasible. This is determined using two
thresholds τIoU and τdepth. If an element at position (i, j) in MIoU is smaller than
τIoU, the corresponding element mij in M will be set to 0. The same applies to
Mdepth using τdepth.

Before running the Hungarian method, dummy rows or columns will be added
as described in Section 2.5.1 if M is not a square matrix. Let n′ = max(n, k). The
2DA is originally formulated to minimize the total cost of the problem (see equation
(2.9)). In this case, we want to maximize the total matching score, i.e.

27


3. Method

maximize
n′∑
i=1

n′∑
j=1

yijmij (3.8a)

subject to
n′∑
i=1

yij = 1, ∀j ∈ {1, ..., n′}, (3.8b)

n′∑
j=1

yij = 1, ∀i ∈ {1, ..., n′}, (3.8c)

yij ∈ {0, 1}, ∀i, j ∈ {1, ..., n′} (3.8d)

where yij = 1 if the ith object in XP is assigned the jth object in XS. To be able
to use the Hungarian method for finding the best matchings we need to convert the
problem into a minimization problem. The objective function in (3.8a) is converted
to

minimize
n′∑
i=1

n′∑
j=1

yij

(
(max

(i,j)
mij)−mij

)
. (3.9)

Now, the previously largest value in M has become the smallest value and the
previously smallest values has become the largest.

From the Hungarian method, we receive pairs of indices of the form (i, j) where
i is the index of a detection from XP and j the index of a detection from XS. Any
pair where i > n or j > k is a match coming from dummy rows or dummy columns
and will be discarded. Furthermore, three things need to hold for a match (i, j) to
be valid:

1. The element mij in M needs to be > 0.
2. The element at position (i, j) in MIoU needs to be ≥ τIoU.
3. The element at position (i, j) in Mdepth needs to be ≥ τdepth.

Any match that does not fulfill these conditions will be discarded. The matchings
that are not discarded are put in a set

Xmatch = {(XP,a, XS,b), ..., (XP,c, XS,d)} (3.10)

where a and b are indices of detections from XP and XS, respectively, that are
matched to each other.

Since MS uses point clouds that have a finite range, there might be detections
from MP that will never be matched to anything from MS due to their position
being beyond the range of the point cloud. These unmatched detections fromMS

would then lead to a non-constructive discrepancy which is why they are left out.
This is done by filtering out any unmatched detections from MP that have a z-
coordinate further away than δz, where δz is set to be the maximum range of the
LiDAR sensor. The number of detections from XP and XS that are among the
discarded matchings after the filtering are denoted µP and µS, respectively.

The number of detections µP is assumed to be µP objects whichMP detects but
MS does not and µS is assumed to be µS objects whichMS detects butMP does

28


3. Method

not. Both µP and µS are included in the discrepancy measure where the assumption
is that µP is the number of false detections and µS is the number of detections that
MP missed. By letting µP affect the discrepancy measure, we intend to select
samples that increase precision performance. Similarly, we intend to select samples
that increase recall performance by letting µS affect the discrepancy measure. We
refer to Section 3.5.3 for details regarding how the discrepancy measure is affected
by µP and µS.

3.5.2 Discrepancy between a Pair of Matched Detections

We have two detections, XP and XS coming fromMP andMS, respectively. The
discrepancy in attributes, i.e. the position, dimension, and rotation, are taken into
account by calculating the absolute difference between the attributes of XP and XS.
For a detection X we define the attributes of X as

Xattr =



x
y
z
h
w
l
r


(3.11)

where x, y, z is the center point, h,w, l the dimension and r the rotation around the
y-axis (yaw) of X (as described in Section 3.2). The rotation is modified as r := r
mod π such that r ≥ 0. The absolute differences between the attributes of XP and
XS are then

|Xattr
P −Xattr

S | =



∆x
∆y
∆z
∆h
∆w
∆l
∆r


. (3.12)

To avoid collisions, objects closer to the ego vehicle are often more important to
accurately predict than objects further away. Therefore, we define relative depth
difference as ∆zrel = ∆z/max(zP , zS). Since the output fromMS could be invariant
to heading direction, as inMF (see (A.8)), the rotation difference need to be invari-
ant as well. We define the invariant rotation difference as ∆rinv = min(∆r, π−∆r).
The maximum difference in rotation is therefore π/2, as seen in Figure 3.2. We
define the difference between the attributes as

29


3. Method

∆attr =



∆x
∆y

∆zrel
∆h
∆w
∆l

∆rinv


. (3.13)

π

π
2

∆r

π −∆r

Figure 3.2: Illustration of rotation difference between two bounding boxes. The
difference is largest when the bounding boxes are orthogonal.

The different components of ∆attr differ in magnitude which later will make it
hard to combine them into a single measure. Therefore, we normalize ∆attr using a
vector

β =



βx
βy
βz
βh
βw
βl
βr


(3.14)

containing maximum difference values that are set based on observations in the
dataset D. The normalized vector of attribute differences is the attribute discrepancy
vector

dattr = ∆attr � β =



dx
dy
dz
dh
dw
dl
dr


(3.15)

where � is the Hadamard (element-wise) division operator.

30


3. Method

Furthermore, to include the discrepancy in class prediction, we calculate the
L1-norm of the difference between the probability vectors of XP and XS. The
probability vector of a detection X is

Xprob =


p0
p1
...
p9

 (3.16)

where p0 is the probability of X being background and p1, ..., p9 are probabilities for
the 9 different classes. The L1-norm of the difference (the probability discrepancy)
is

dprob = ||Xprob
P −Xprob

S ||. (3.17)

Combined into one column vector, the attribute discrepancies and the probability
discrepancy is denoted

d =
(

dattr

dprob

)
=



dx
dy
dz
dh
dw
dl
dr
dprob


. (3.18)

3.5.3 Discrepancy for a Sample
Let

k′ = |Xmatch| (3.19)

be the total number of pairs with matched detections for sample X. Then, let

D =
(
d1 ... dk′

)
(3.20)

be the matrix containing all column vector discrepancies (3.18) for the matched
detections of X. We want to avoid only selecting samples that contain many objects
since they take more time to annotate. Samples with many objects are therefore
often more expensive to annotate than samples with fewer objects. To prevent over-
selection of samples where k′ is large, we take the row-wise average of D. If instead,
a row-wise sum is used, the discrepancy of a sample will be directly correlated
to the number of objects. Then, samples that are expensive to annotate will be
selected first. Thereby, let rowavg be the function that calculates the column vector
containing the row-wise average values of a matrix. Then, let

d̄ = rowavg(D). (3.21)

31


3. Method

Now, let

w =



wx
wy
wz
wh
ww
wl
wr
wprob
wP
wS



(3.22)

be a vector of weights. The weighted discrepancy vector is then calculated as

dw = w ◦

 d̄
f(µP )
f(µS)

 (3.23)

where f is a function determining the discrepancy for unmatched objects. We con-
sider w and f to be a parameters which can be set to change the functionality of
the algorithm. The parameter settings we use in our experiments are explained in
Section 4.4.

The final step is now to combine all the components into a single scalar value,
which is done using a squared L2-norm, i.e.

d = ||dw||22 (3.24)

where d denotes the total discrepancy value for sample X.

3.5.4 Main Approach Examples
Figure 3.3 shows an example of a sample with low discrepancy value and Figure
3.4 an example of a sample with a higher discrepancy value. In Appendix A.1 the
matrices MIoU, Mobj and Mdepth corresponding to the problem instance in Figure
3.3 are shown as well as the matrix M used for solving the 2DA with the Hungarian
method.

3.6 Alternative Approach
AA is based on the assumption that a good match between detections is a match
with low discrepancy. Instead of matching the detections usingMIoU ◦Mobj ◦Mdepth,
as in MA, the detections are essentially matched using the discrepancy measure
explained in Section 3.5.2. The Hungarian method is used to match detections such
that the total discrepancy is minimized.

Let XP = {XP,1, ..., XP,n} and XS = {XS,1, ..., XS,k} be two sets of detections for
sample X coming fromMP andMS, respectively. Then

dij = ||wmatched ◦ dij||22 (3.25)

32


3. Method

is the discrepancy between detections XP,i and XS,j where dij is the attribute and
probability discrepancies between XP,i and XS,j according to (3.18). wmatched is
defined as w from (3.22) without the last two elements wP and wS. We then define
the discrepancy matrix to be

Mdisc =


d11 . . . d1k
... . . . ...
dn1 . . . dnk

 . (3.26)

Detections fromMP

Detections fromMF

Figure 3.3: Example of sample where the discrepancy is small. There is only a
small difference in depth between the matched detections and the largest difference
are detections far away from the ego vehicle, meaning small discrepancy since the
depth difference is relative.

33


3. Method

Detections fromMP

Detections fromMF

Figure 3.4: Example of sample where the discrepancy is large. MF is more
accurate in position thanMP as seen in the BEV point cloud to the right.

For solving the 2DA using the Hungarian method, we then construct M by

M = Mdisc � (1−MIoU) (3.27)

where MIoU is the IoU matrix as described in 3.5.1. Since we want to minimze the
total discrepancy and since we use 1 −MIoU instead of MIoU, the 2DA on M is a
minimization problem. Similar to Section 3.5.1, if an element at position (i, j) in
MIoU orMdepth is smaller than τIoU or τdepth, respectively, the corresponding element
mij in M will be changed. Since M now corresponds to a minimization problem,
the values will not be set to 0 as in Section 3.5.1 but instead a very large value, e.g.
99999. Note that M is changed for elements in Mdepth that are smaller than τdepth

34


3. Method

even if Mdepth is not used to calculate M . Dummy rows and columns are added if
necessary before using the Hungarian method to solve the 2DA on M .

The returned pairs of indices from the Hungarian method are filtered in the same
way as in 3.5.1. Let Ematched be the set containing the filtered and matched indices
for the detections from XP and XS. That is, (i, j) ∈ Ematched means detection with
index i in XP is matched to index j in XS. The discrepancy of matched detections,
dmatched, is then

dmatched =
∑

(i,j)∈Ematched mij

k′
(3.28)

where k′ = |Ematched| and mij is the element at position (i, j) in M .
The discrepancy of unmatched detections, dunmatched is

dunmatched = ||
(
wP
wS

)
◦
(
µP
µS

)
||22. (3.29)

The total discrepancy for sample X is then the sum of dmatched and dunmatched

d = dmatched + dunmatched. (3.30)

3.7 Discrepancy Measure Development Process
Section 3.5 and 3.6 describe two approaches to quantifying discrepancy. Both are
built around the same central measure of discrepancy. The different components of
the discrepancy measure, as well as their parameters, are chosen through a develop-
ment process relying heavily on subjective decisions.

The process involves a small set of (∼ 400) randomly selected samples and for
each sample, two sets of cached detections coming fromMP andMF , respectively.
Then, the samples and their detections are all processed using MA or AA (depending
on what is being developed) and each sample is assigned a discrepancy. By manually
looking at the images (example images are shown in Figure 3.5) and subjectively
relating their discrepancy to how valuable they are believed to be forMP to train
on, design choices are made and parameter values are set. The goal is for the
discrepancy measure to assign a large discrepancy to samples whereMP is believed
to benefit from learning.

35


3. Method

Detections fromMP

Detections fromMF

Figure 3.5: Example images used when developing the discrepancy measure. Dis-
crepancy values for each pair of matched detections are shown to the left. The
dark boxes in the bottom left of the image show unweighted and weighted mean
discrepancy values.

36


3. Method

3.8 Selection Strategy
After the discrepancy for each sample in the non-annotated data set is calculated,
the samples to send for annotation are selected. We compare two different selection
strategies described in the following subsections.

3.8.1 Standard Selection
First, N is sorted according to the discrepancy. The b samples from N with largest
discrepancy are selected for annotation, as in Section 2.2. We refer to this selection
strategy as Standard Selection (SS).

By only considering the largest discrepancy when selecting, the resulting selected
set of samples is not necessarily diverse. For example, suppose we evaluate on a
dataset containing solely of front camera images. If the selected set of samples
consists of solely back camera images, the DNN will most likely perform poorly.
However, selecting only front camera images, in this case, is not optimal either, as
there could still be valuable information in the other camera angles. A more diverse
set of samples can be achieved by selecting samples from each camera angle, as
described in the next subsection.

3.8.2 Camera Angle Selection
As in SS, Camera Angle Selection sorts N according to discrepancy. Still, b samples
are selected for annotation but there are restrictions. Only a fraction of b can be
selected for each camera angle and the fractions are fixed and determined before
the AL run begins. For example, suppose b must contain 50% front camera images
and 50% back camera images. The b samples will consist of b/2 images with the
largest discrepancy out of the front camera images and b/2 images with the largest
discrepancy out of the back camera images. We refer to this selection strategy as
Camera Angle Selection (CAS).

37


3. Method

38


4
Experiment Settings

This chapter explains in detail how the experiments are set up. We select subsets of
the whole dataset to speed up experiments and this is explained in Section 4.1. Fur-
thermore, we describe how the method is evaluated and compared to other methods
in Section 4.2. We give details on the setting of the AL algorithm in 4.3 and the
parameter setting for the discrepancy measure is explained in Section 4.4.

4.1 Experiment Data Selection

Our experiments are done on D (see Section 3.1) and contains around 80000 an-
notated samples. We use a 90/10 train/validation split when creating the training
set DT and the validation set DV . The split is done based on the date when the
data is gathered instead of splitting randomly to avoid getting too similar images
in training and validation. The training set DT contains images taken from several
camera angles. However, DV contains images taken from only front cameras. The
reason is that front cameras are more important than any other camera angle in a
self-driving car.1

All data in D have annotations. For the experiments, some of that data is treated
as if it does not have annotations, as described in Section 2.2.2. In that case, the
annotations are simply hidden or not used by the algorithm. The annotations are
then revealed upon the AL algorithms request.

All experiments are initialized with N to be a random subset of DT of size
30000. The reduction in size speeds up the experiments and makes it easier to
iterate different methods. A random subset of N of size 3000 is then moved to A
(and the annotations are revealed) as an initial training set for the AL algorithm.
Furthermore, 1000 randomly selected samples from DV are moved to a set V which
is used in the experiments as the validation set. The random selections that are
done to select the initial data for the non-annotated dataset N , annotated dataset
A, and validation dataset V are provided with a random seed. This enables us to
reproduce experiments with the same initial data sets by using identical random
seeds. It also enables cross-evaluation of our methods with different initial data sets
using different random seeds.

1Just consider the fraction of the time you are looking forward while driving.

39


4. Experiment Settings

4.2 Evaluation
We evaluate our method by measuring Key Performance Indicator (KPI) gains on
MP when evaluating on V . Our main KPIs are AP and F1 score. AP and F1 score
are both dependent on IoU and IoU between 3D bounding boxes are often close to
zero since the position is not always accurately predicted in MP . Therefore, both
KPIs are calculated using IoU between 2D bounding boxes.

AP is calculated as in the VOC2007 challenge, i.e. (2.26), but with a step size
of 101, meaning interpolated precision value for r = 0, 0.01, 0.02, ..., 0.99, 1. The
confidence is calculated as 1−p0, meaning any class that is not background is taken
into account.

The F1 score is calculated for 101 different confidence thresholds. The confidence
thresholds are 0, 0.01, 0.02, ..., 0.99, 1. A detection is only counted as long as it is
above the threshold. The maximum F1 score calculated for all the 101 thresholds is
presented.

Since AP and F1 score are insensitive to 3D position, it is of interest to see if
the performance in position improves. As mentioned in Section 2.3, predicting an
object’s position in a point cloud is less difficult than in an image. Therefore,MS

most likely predicts position for a detection more accurately thanMP and a large
discrepancy in position could expose samples whereMP is inaccurate. We measure
Mean Absolute Error (MAE) in center point coordinates between detections and
the annotated objects. As objects closer to the ego vehicle are more important to
accurately predict, we calculate relative position error. For a given detection X and
an annotation Y having positions

Xpos =

x
X

yX

zX

 , Y pos =

x
Y

yY

zY

 (4.1)

respectively, we have the relative position error

ε = ||Y
pos −Xpos||2
||Y pos||2

(4.2)

where the denominator is the distance from the ego vehicle to the annotated object
since the ego vehicle is the origin (see Section 3.1). We calculate MAE in position
as

MAEpos = 1
N

N∑
i=1

εi (4.3)

where N is the total number of true positives.

4.2.1 Baselines
We compare our method to a main baseline that is based on selecting samples ran-
domly. The main baseline is defined as the average of 3 independent runs (but with

40


4. Experiment Settings

the same initial data as described in Section 4.1). We also compare to uncertainty
sampling with entropy (see Section 2.2). Each sample is scored according to the
formula

−
n∑
i=0

9∑
j=0

pj,P,i log(pj,P,i) (4.4)

where pj,P,i denotes the probability of class j for object i. The samples with largest
entropy are selected for annotation.

4.3 Active Learning Setting
The budget b is set to 3000 samples with the motivation that a sufficient number of
samples need to be added in order to make a difference to the model parameters. The
experiment is run for 5 AL iterations, meaningMP is first trained on the initial data
set A and continues training on 6000, 9000, 12000, 15000 and lastly 18000 samples.
Each AL iteration trains for 100 epochs, including the training on the initial training
set. 100 epochs are not a guarantee for total convergence. However, 100 epochs keep
the time for experiments low and we believe that 100 epochs are enough to measure
performance for an AL iteration. Evaluation, i.e. KPI calculation, on V runs every
5 epochs. For each KPI, the value for an AL iteration is the mean value of the last
5 evaluations, i.e. the last 25 epochs.

4.4 Discrepancy Measure Parameter Setting
Weights are set to define the importance of each attribute. By looking at examples
like Figure 3.5 as described in Section 3.7, weights can be set to determine whether
the discrepancy for a specific sample should be small or large. Furthermore, by
setting an attribute weight to zero, we can investigate whether the attribute has a
positive or negative effect when selecting samples.

The parameters are set to

β =



βx
βy
βz
βh
βw
βl
βr


=



80
10
1
5
5
50
π/2


, w =



wx
wy
wz
wh
ww
wl
wr
wprob
wP
wS



=



1
0.125
0.6
1
1
1

0.25
0.1

0.005
0.016



, f(µ) = µ, δz = 160. (4.5)

An object in D captured in an image has its center point approximately located in a
range of [-40, 40] meters in the x-coordinate and [-5, 5] meters in the y-coordinate.

41


4. Experiment Settings

Since ∆z is relative and already ranges between 0 and 1, we set βz = 1. Therefore,
the maximum difference values for center point are set to (βx, βy, βz) = (80, 10, 1).
For the maximum difference in dimension, (βh, βw, βl) = (5, 5, 50) is the theoretical
maximum size of a large object in D, e.g. a train. The maximum difference in
rotation is βr = π/2 is, as explained in Section 3.5.2.

The choice of weights depends on the discrepancy betweenMP andMF detec-
tions. There is a larger difference in the z-coordinate, rotation r, and probability
than the other attributes. Therefore, we set wz, wr and wprob smaller in order to
put more focus on other attributes.

Almost all the objects are located in the same height relative to the ego vehicle,
which makes the y-coordinate quite easy to predict. Since the y-coordinate is not
as important to improve as the other coordinates we set wy to be small.

To have a linear growth for unmatched objects, we set f(µ) = µ. We set wS > wP
since we value recall higher than precision. Furthermore, we set δz = 160, i.e. the
maximum range of the LiDAR sensor, because MF does not detect objects where
z > 160.

We also conduct experiments using f(µ) = 1.1µ to investigate how an exponential
function affects the performance of the method. When the exponential function is
tested, weights wP = 0.01 and wS = 0.05 are used.

42


5
Results

This chapter presents our results, beginning with using annotations as outputs for
MS in Section 5.1 and continuing with usingMF in a real-case scenario in Section
5.2.

5.1 Annotation Discrepancy

As described in Section 3.3.1, the first experiments measure the discrepancy between
the detections fromMP andMA, i.e. the annotations, to verify the approach before
using a real DNN as MS. In Section 5.1.1 to 5.1.4, the Main Approach (MA) is
tested and different parameter settings are compared. Then, in Section 5.1.3, MA
and AA are compared. Finally, the discrepancy method is compared to the entropy
and random baselines in Section 5.1.5.

Presented KPIs are AP, F1 score for pedestrians, bicyclists, and vehicles, as well
as position MAE. All figures contain plots showing the results from two different
random seeds for initial data selection, respectively. Each plotted line is the average
of 3 independent runs with equal settings. The lines are accompanied by limits,
indicating the maximum and minimum values observed for the 3 runs.

6000 9000 12000 15000 18000
Number of annotated samples

0.525

0.550

0.575

0.600

0.625

0.650

0.675

0.700
AP

Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68

AP
Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(b) Random seed 2

Figure 5.1: AP. Comparison between SS and CAS as well as linear and exponential
growth for unmatched detections.

43


5. Results

5.1.1 Selection Strategy
As described in Section 3.8, two different selection strategies are tested, Camera
Angle Selection (CAS) and Standard Selection (SS). By looking at AP (in Figure
5.1), it is quite clear that CAS contributes to better performance than SS. Position
MAE (Figure 5.2) and F1 score for vehicles (Figure 5.3) also indicate better perfor-
mance using CAS than SS. Furthermore, F1 score for pedestrians (Figure 5.4) and
bicyclists (Figure 5.5) fluctuate too much to give a clear picture of the differences
between CAS and SS. We can, however, note that during the last 2 AL iterations,
CAS is better than SS on F1 score for pedestrians.

5.1.2 Unmatched Detections
Besides comparing CAS to SS, we also compare using f(µ) = µ to using f(µ) = 1.1µ,
i.e. using linear or exponential growth for the number of unmatched objects. AP
(Figure 5.1) shows almost no differences between the usage of a linear or exponential
f . Position MAE (Figure 5.2) shows similar results. The average values of position
MAE hint that a linear f using CAS is the best approach. However, that observation
is considered to be within the margin of error. We consider the exponential function
to be the better choice with the note that the differences are minimal.

6000 9000 12000 15000 18000
Number of annotated samples

0.070

0.075

0.080

0.085

0.090

0.095

0.100

0.105

Position MAE
Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.075

0.080

0.085

0.090

0.095

0.100

0.105

Position MAE
Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(b) Random seed 2

Figure 5.2: Position MAE. Comparison between SS and CAS as well as linear and
exponential growth for unmatched detections.

44


5. Results

6000 9000 12000 15000 18000
Number of annotated samples

0.600

0.625

0.650

0.675

0.700

0.725

0.750

0.775
F1 (vehicle)

Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.62

0.64

0.66

0.68

0.70

0.72

0.74

0.76
F1 (vehicle)

Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(b) Random seed 2

Figure 5.3: F1 score for vehicles. Comparison between SS and CAS as well as
linear and exponential growth for unmatched detections.

6000 9000 12000 15000 18000
Number of annotated samples

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22
F1 (pedestrian)

Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.08

0.10

0.12

0.14

0.16

0.18

F1 (pedestrian)
Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(b) Random seed 2

Figure 5.4: F1 score for pedestrians. Comparison between SS and CAS as well as
linear and exponential growth for unmatched detections.

6000 9000 12000 15000 18000
Number of annotated samples

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

F1 (bicyclist)
Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.0

0.1

0.2

0.3

0.4

0.5
F1 (bicyclist)

Exponential, SS
Exponential, CAS
Linear, SS
Linear, CAS

(b) Random seed 2

Figure 5.5: F1 score for bicyclists. Comparison between SS and CAS as well as
linear and exponential growth for unmatched detections.

45


5. Results

5.1.3 Alternative Approach
Experiments with exponential growth for unmatched detections and CAS were made
using AA to compare its performance to MA. As seen in Figure 5.6, the AP of
using AA is similar to using MA. The same holds for position MAE, according to
Figure 5.7, even though there are spread results. Figure 5.8 shows a slightly better
performance on F1 score for vehicles using MA. F1 score for pedestrians (Figure
5.9) shows better performance using MA towards the last AL iterations while the
results for bicyclists (Figure 5.10) are rather ambiguous.

6000 9000 12000 15000 18000
Number of annotated samples

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68

0.70
AP

MA, Exponential, CAS
AA, Exponential, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68

AP
MA, Exponential, CAS
AA, Exponential, CAS

(b) Random seed 2

Figure 5.6: AP. Comparison between MA and AA.

6000 9000 12000 15000 18000
Number of annotated samples

0.070

0.075

0.080

0.085

0.090

0.095

0.100

0.105

Position MAE
MA, Exponential, CAS
AA, Exponential, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.070

0.075

0.080

0.085

0.090

0.095

0.100

0.105

Position MAE
MA, Exponential, CAS
AA, Exponential, CAS

(b) Random seed 2

Figure 5.7: Position MAE. Comparison between MA and AA.

46


5. Results

6000 9000 12000 15000 18000
Number of annotated samples

0.62

0.64

0.66

0.68

0.70

0.72

0.74

0.76

F1 (vehicle)
MA, Exponential, CAS
AA, Exponential, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.62

0.64

0.66

0.68

0.70

0.72

0.74

0.76

F1 (vehicle)
MA, Exponential, CAS
AA, Exponential, CAS

(b) Random seed 2

Figure 5.8: F1 score for vehicles. Comparison between MA and AA.

6000 9000 12000 15000 18000
Number of annotated samples

0.12

0.14

0.16

0.18

0.20

0.22
F1 (pedestrian)

MA, Exponential, CAS
AA, Exponential, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.06

0.08

0.10

0.12

0.14

0.16

F1 (pedestrian)

MA, Exponential, CAS
AA, Exponential, CAS

(b) Random seed 2

Figure 5.9: F1 score for pedestrians. Comparison between MA and AA.

6000 9000 12000 15000 18000
Number of annotated samples

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

F1 (bicyclist)
MA, Exponential, CAS
AA, Exponential, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.1

0.2

0.3

0.4

0.5

F1 (bicyclist)
MA, Exponential, CAS
AA, Exponential, CAS

(b) Random seed 2

Figure 5.10: F1 score for bicyclists. Comparison between MA and AA.

47


5. Results

5.1.4 Position-only Discrepancy Measure
In Section 4.4 we mention that weights can be set to zero to evaluate their importance
for the discrepancy measure. One such experiment has been conducted where all
weights but wx, wy, wz are set to 0. We do not care about dimension or rotation
and we do not measure probability discrepancy nor unmatched detections. The
experiment aims to investigate if the performance of position estimation increases if
we only consider samples where the position discrepancy is large. It also indicates the
influence that the other components such as dimension and unmatched detections
have on the performance gain.

Figure 5.11 shows that the AP performance is reduced marginally when using
the position-only discrepancy measure, compared to when using the full discrepancy
measure, i.e. where no weights are set to 0. Figure 5.12 shows similar results for
position MAE.

6000 9000 12000 15000 18000
Number of annotated samples

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68

0.70
AP

Exponential, CAS
Only Position, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68

AP
Exponential, CAS
Only Position, CAS

(b) Random seed 2

Figure 5.11: AP. Comparison between full discrepancy and position-only discrep-
ancy.

6000 9000 12000 15000 18000
Number of annotated samples

0.075

0.080

0.085

0.090

0.095

0.100

0.105

Position MAE
Exponential, CAS
Only Position, CAS

(a) Random seed 1

6000 9000 12000 15000 18000
Number of annotated samples

0.075

0.080

0.085

0.090

0.095

0.100

Position MAE
Exponential, CAS
Only Position, CAS

(b) Random seed 2

Figure 5.12: Position MAE. Comparison between full discrepancy and position-
only discrepancy.

48


5. Results

Figure 5.11 and 5.12 show performance with respect to the number of annotated
samples. However, if the total annotation cost is to be reduced it is also relevant to
look at performance with respect to the number of annotated objects. Annotating a
sample that contains many objects takes more time than annotating a sample with
few objects. Therefore, Figure 5.13 shows AP per annotated 2D object (includes
3D objects as well) and Figure 5.14 shows position MAE per annotated 3D object.
These plots show that the discrepancy measure based solely on position achieves
better performance for a fewer number of annotated objects, i.e. most likely to a
lower cost, than the full discrepancy measure.

60000 80000 100000 120000 140000 160000