DF Camera detections LiDAR detections Camera-LiDAR Active Learning for Object Detecting Deep Neural Networks Master’s thesis in Computer Science - algorithms, languages and logic ALFRED GUNNARGÅRD DANIEL ODIN Department of Electrical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2020 Master’s thesis 2020 Camera-LiDAR Active Learning for Object Detecting Deep Neural Networks Alfred Gunnargård Daniel Odin DF Department of Electrical Engineering Chalmers University of Technology Gothenburg, Sweden 2020 Camera-LiDAR Active Learning for Object Detecting Deep Neural Networks Alfred Gunnargård, Daniel Odin © Alfred Gunnargård, Daniel Odin, 2020. Academic Supervisor and Examiner: Lennart Svensson, Electrical Engineering Industrial Supervisors: Dónal Scanlan, Erik Werner, Adam Tonderski, Zenuity Master’s Thesis 2020 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Bird’s eye view 3D detections in LiDAR point cloud. Left: From network trained on images. Right: From network trained on LiDAR point clouds which are more accurate in position. Typeset in LATEX Gothenburg, Sweden 2020 iv Camera-LiDAR Active Learning for Object Detecting Deep Neural Networks Alfred Gunnargård, Daniel Odin Department of Electrical Engineering Chalmers University of Technology Abstract Deep Neural Networks (DNNs) processing images and other types of high dimen- sional data need to train on large annotated datasets to reach usable performance. Retrieving annotations is expensive due to the required involvement of humans. In order to reduce annotation costs, it is desirable to annotate only those samples that help the DNN perform better. However, finding those samples is not trivial. To find and annotate samples where the DNN performs poorly is one way of improving its performance. Still, finding those samples is challenging. In many cases, the output of a DNN is best evaluated when compared to an annotation made by a human. Since the annotation does not exist yet, some other type of metric is needed. The process of choosing samples to annotate based on the performance of the DNN is called Active Learning (AL). We study AL for the task of 3D Object Detec- tion (3DOD) in images, i.e. identifying and classifying objects in 3D space. Further- more, we assume that all images are accompanied by a corresponding LiDAR point cloud. A LiDAR is a rotating sensor that builds a 3D model of its surroundings by measuring the time it takes for laser beams to reflect. The reflected beams result in multiple points in 3D space, referred to as a point cloud. We quantify the performance of our DNN (referred to as primary DNN) by using a secondary DNN that detects objects in point clouds. We then compare the detections from our primary DNN to the detections of the secondary DNN and score each sample according to a discrepancy measure. The discrepancy measure quantifies how much the detections from both DNNs differ from each other for a specific sample, i.e. a pair of image and point cloud. The idea is that a sample with high discrepancy indicates that the primary DNN performs poorly due to the three- dimensional position accuracy of the secondary DNN. We verify our approach using annotations instead of detections from a secondary DNN. According to some important metrics, we are able to perform slightly better than or on par with selecting samples randomly. However, the improved performance over random selection comes from selecting samples that contain many objects, and that raises the annotation cost. We show that a slightly modified discrepancy measure is able to perform on par with random selection while maintaining the same annotation cost. Further research on how to properly measure this type of discrepancy is concluded to be necessary for the method to be of use in any real-case scenario. Such research includes experiments on an additional dataset. Keywords: Machine Learning, Active Learning, 3D Object Detection, LiDAR, Deep Neural Network v Acknowledgements We want to thank our industrial supervisors Dónal Scanlan, Erik Werner, and Adam Tonderski for their support throughout the thesis. Specifically, we want to thank Adam for his patience during technical questions and Erik for his brilliant ideas and feedback that helped us focus on the right things. We also want to thank Lennart Svensson, our academic supervisor, for his sup- port. We thank Lennart specifically for the support towards the end of the thesis when everything seemed dark and hopeless. Furthermore, we thank Zenuity for letting us do our Master’s Thesis at their company. During our first weeks, we were warmly welcomed and many people offered us help to get going and showed interest in our thesis. More specifically, we want to thank Brendan Barrick for his never-ending patience helping us with IT issues and Mats Nordlund for helping us to find a place to sit at AI Innovation of Sweden (until Covid-19 locked us into our homes). We also want to thank the people at AI Innovation for providing us with a nice working environment and delicious pastries (fika). Finally, we thank Niklas Gustafsson and Gustav Rödström that did their Master’s Thesis at Zenuity at the same time we did. They have kept us company during the time at Zenuity, been supportive, and provided feedback as well as suggestions to our study. Furthermore, they also provided a great opposition to our thesis with many interesting discussion subjects. Alfred Gunnargård, Daniel Odin, Gothenburg, June 2020 vii Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Theory 5 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Examples of Informativeness Scores . . . . . . . . . . . . . . . 10 2.2.2 Evaluating an Active Learning Algorithm . . . . . . . . . . . . 11 2.3 LiDAR Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 2D Assignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Hungarian Method . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Matching Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7.1 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7.2 Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7.3 Mean Absolute Error . . . . . . . . . . . . . . . . . . . . . . . 21 3 Method 23 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Primary Object Detection Network . . . . . . . . . . . . . . . . . . . 24 3.3 Secondary Object Detection Network . . . . . . . . . . . . . . . . . . 25 3.3.1 Simulated High Performance Network . . . . . . . . . . . . . . 25 3.3.2 Camera-LiDAR Fusion Network . . . . . . . . . . . . . . . . . 25 3.4 Modified Active Learning Algorithm . . . . . . . . . . . . . . . . . . 25 3.5 Main Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5.1 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 ix Contents 3.5.2 Discrepancy between a Pair of Matched Detections . . . . . . 29 3.5.3 Discrepancy for a Sample . . . . . . . . . . . . . . . . . . . . 31 3.5.4 Main Approach Examples . . . . . . . . . . . . . . . . . . . . 32 3.6 Alternative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 Discrepancy Measure Development Process . . . . . . . . . . . . . . . 35 3.8 Selection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.8.1 Standard Selection . . . . . . . . . . . . . . . . . . . . . . . . 37 3.8.2 Camera Angle Selection . . . . . . . . . . . . . . . . . . . . . 37 4 Experiment Settings 39 4.1 Experiment Data Selection . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Active Learning Setting . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Discrepancy Measure Parameter Setting . . . . . . . . . . . . . . . . 41 5 Results 43 5.1 Annotation Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1 Selection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.2 Unmatched Detections . . . . . . . . . . . . . . . . . . . . . . 44 5.1.3 Alternative Approach . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.4 Position-only Discrepancy Measure . . . . . . . . . . . . . . . 48 5.1.5 Baseline Comparison . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Fusion Network Discrepancy . . . . . . . . . . . . . . . . . . . . . . . 54 6 Discussion 57 6.1 Dataset and Selection Strategy . . . . . . . . . . . . . . . . . . . . . 57 6.2 The Discrepancy Method . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2.1 Main and Alternative Approach . . . . . . . . . . . . . . . . . 58 6.2.2 Unmatched Detections . . . . . . . . . . . . . . . . . . . . . . 58 6.2.3 F1 Score Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2.4 Position-only Discrepancy Measure . . . . . . . . . . . . . . . 60 6.2.5 Discrepancy Aggregation for Matched Detections . . . . . . . 60 6.3 Entropy Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.4 Annotation Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.5 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7 Conclusion 63 Bibliography 65 A Appendix 1 I A.1 Matching Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Architecture of Camera-LiDAR Fusion Network . . . . . . . . . . . . II A.3 Modifications in Camera-LiDAR Fusion Network . . . . . . . . . . . . IV x List of Figures 2.1 A visualisation of a neuron. Each input xi is multiplied with a weight wi and summed together. A bias term b is added to the sum. The final sum is the input for a non-linear activation function a to produce the final output ŷ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 An example of a fully connected feedforward network consisting of input layer x ∈ R5 and output layer ŷ ∈ R3. The hidden layer consists of 7 neurons. Each connection corresponds to a learnable weight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Overview of a pool-based AL algorithm for DNNs. . . . . . . . . . . . 9 2.4 Example of LiDAR point cloud in bird’s eye view with detections of objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Example of 2D object detection. . . . . . . . . . . . . . . . . . . . . . 13 2.6 Example of 3D object detection. Predicted location together with dimension and facing direction in 3D space is visualized in bird’s eye view to the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 An example of a bipartite graph. The 2D assignment problem consists of finding edges such that each vertex in V (1-4) is connected to exactly one vertex in U (5-9) and the sum of weights is minimized. . . 14 2.8 A visual equation for Intersection over Union (Jaccard Index) [1]. . . 18 2.9 Illustration of precision and recall which are common components for evaluation metrics in object detection [2]. . . . . . . . . . . . . . . . . 19 3.1 An illustration of a 3D bounding box with axis directions. . . . . . . 24 3.2 Illustration of rotation difference between two bounding boxes. The difference is largest when the bounding boxes are orthogonal. . . . . . 30 3.3 Example of sample where the discrepancy is small. There is only a small difference in depth between the matched detections and the largest difference are detections far away from the ego vehicle, mean- ing small discrepancy since the depth difference is relative. . . . . . . 33 3.4 Example of sample where the discrepancy is large. MF is more ac- curate in position than MP as seen in the BEV point cloud to the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Example images used when developing the discrepancy measure. Dis- crepancy values for each pair of matched detections are shown to the left. The dark boxes in the bottom left of the image show unweighted and weighted mean discrepancy values. . . . . . . . . . . . . . . . . . 36 xi List of Figures 5.1 AP. Comparison between SS and CAS as well as linear and exponen- tial growth for unmatched detections. . . . . . . . . . . . . . . . . . . 43 5.2 Position MAE. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. . . . . . . . . . . . 44 5.3 F1 score for vehicles. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. . . . . . . . 45 5.4 F1 score for pedestrians. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. . . . . . . 45 5.5 F1 score for bicyclists. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. . . . . . . . 45 5.6 AP. Comparison between MA and AA. . . . . . . . . . . . . . . . . . 46 5.7 Position MAE. Comparison between MA and AA. . . . . . . . . . . . 46 5.8 F1 score for vehicles. Comparison between MA and AA. . . . . . . . 47 5.9 F1 score for pedestrians. Comparison between MA and AA. . . . . . 47 5.10 F1 score for bicyclists. Comparison between MA and AA. . . . . . . . 47 5.11 AP. Comparison between full discrepancy and position-only discrep- ancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.12 Position MAE. Comparison between full discrepancy and position- only discrepancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.13 AP per annotated object. Comparison between full discrepancy and position-only discrepancy. . . . . . . . . . . . . . . . . . . . . . . . . 49 5.14 Position MAE per annotated object. Comparison between full dis- crepancy and position-only discrepancy. . . . . . . . . . . . . . . . . . 49 5.15 AP. Baselines comparison. . . . . . . . . . . . . . . . . . . . . . . . . 50 5.16 AP per annotated object. Baselines comparison. . . . . . . . . . . . . 50 5.17 Position MAE. Baselines comparison. . . . . . . . . . . . . . . . . . . 51 5.18 Position MAE per annotated 3D object. Baselines comparison. . . . . 51 5.19 F1 score for pedestrians. Baselines comparison. . . . . . . . . . . . . 52 5.20 F1 score for bicyclists. Baselines comparison. . . . . . . . . . . . . . . 52 5.21 F1 score for vehicles. Baselines comparison. . . . . . . . . . . . . . . 52 5.22 F1 score for pedestrians, per annotated pedestrian. Baselines com- parison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.23 F1 score for bicyclists, per annotated bicyclist. Baselines comparison. 53 5.24 AP per epoch in one AL iteration. Comparison between Camera- LiDAR fusion, only LiDAR and Annotations. . . . . . . . . . . . . . 54 5.25 AP after last epoch normalized per 100000 annotated objects. Com- parison between Camera-LiDAR fusion, only LiDAR and Annota- tions. Results from two different random seeds are shown. . . . . . . 55 5.26 Position MAE per epoch in one AL iteration. Comparison between Camera-LiDAR fusion, only LiDAR and Annotations. . . . . . . . . . 55 6.1 Plots of the linear and exponential functions for unmatched detections used in the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 AP. Comparison between three different aggregation functions for the discrepancy of matched detections. . . . . . . . . . . . . . . . . . . . 61 xii List of Figures A.1 Example of depth map. Left: Input image. Center: Target as pro- jected LiDAR points in image. Right: Estimated depth map. Blue indicates points close to the ego vehicle and red indicates points fur- ther away. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II A.2 Example of outputted MF detections. Top: BEV with LiDAR and pseudo point cloud. Bottom: Camera view. Red boxes are detections byMF and green boxes are the annotated objects. . . . . . . . . . . III xiii List of Figures xiv List of Tables 2.1 An example of calculating recall and precision over a small dataset containing a total of 10 positives to be predicted. . . . . . . . . . . . 20 xv List of Tables xvi 1 Introduction The potential of self-driving cars cannot be denied. They have both promising social and environmental benefits. Self-driving cars can be safer than cars with human drivers due to shorter reaction times and constant perception around the whole vehicle. They might also be more energy-efficient, e.g. by using optimal speeds and driving behaviors, planning routes, and avoiding traffic jams by utilizing data from other cars. The challenge of developing self-driving cars is demanding in many ways. There are many vital parts in an autonomous driving system and one of them is the perception of the surroundings. Such perception is achieved through different types of sensors mounted all around the vehicle, such as cameras and LiDAR sensors. A LiDAR sensor (Light Detection and Ranging) builds a 3D model of its sur- rounding by measuring the time it takes for a laser beam to be reflected to the sensor. The generated 3D model consists of a large number of unordered points in 3D space which is why it is often referred to as a point cloud. Different types of sensors have different algorithms for interpreting sensor data. To extract useful information from images and point clouds, it is common to use Machine Learning (ML) algorithms. ML algorithms are trained to recognize patterns by looking at multiple examples. If the examples are annotated, the ML algorithm learns by predicting the annotation with the purpose of being able to perform well on unseen and non-annotated data. When applying ML to images, it is common to use Deep Neural Networks (DNN) which requires a tremendous amount of annotated data in order to achieve useful performance. The cost of annotating this data can get quite large due to the required involvement of humans. It is therefore desirable to get as much information out of every annotated image as possible. One way of doing this is to use Active Learning (AL). B. Settles manages to summarize the key idea of AL in one sentence: The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer labeled training instances if it is allowed to choose the training data from which it learns [3]. In most cases, only a subset of all available data has likely been annotated and some ML model has been trained on this annotated subset. It is then up to the AL algorithm to decide, based on the current state of the trained ML model, what subset of the non-annotated data is most informative and should be sent for annotation. The ML model in our case is a 3D object detecting DNN. The task of 3D Object Detection (3DOD) consists of these sub-tasks: 1 1. Introduction 1. Detect all relevant objects. 2. Predict their dimensions; height, width, and length. 3. Predict their positions in 3D space. 4. Classify the objects, e.g. car, pedestrian or bicyclist. 3DOD can be done on images and other types of data such as point clouds. When doing 3DOD on a point cloud, the resulting detections have a larger accuracy in position than detections in an image. The explanation is the fact that point clouds are actual 3D representations whereas images lack explicit data on distance to objects. However, a point cloud is often sparse at distances far away from the vehicle. Thus, 3DOD on images will find objects further away than what 3DOD on point clouds can handle. Now, suppose a 3DOD DNN makes perfect predictions on a non-annotated image. Annotating the image will be unnecessary since the DNN will not learn anything new from the annotation. Therefore, images, where the DNN is inaccurate, are more rewarding to annotate. However, finding images where the DNN is inaccurate is a non-trivial task. Comparing with predictions from DNNs that use data from other sensors, where the predictions are more accurate, could help to find the images where the DNN is inaccurate. Intuitively, since images are 2D representations of the world, it is harder to predict an object’s position in 3D space by looking at an image than looking at a point cloud. Therefore, a DNN trained on point clouds can help find images where the DNN trained on images is inaccurate. 1.1 Thesis Objective The objective of this thesis is to investigate the possibility of utilizing the difference between LiDAR sensors and cameras to do AL for 3DOD in images. In particular, by having each sample in the dataset consisting of an image and a point cloud captured at the same time, we can compare the detections obtained from two different DNNs. We investigate how to use a primary DNN for 3DOD in images in parallel with a secondary DNN for 3DOD in point clouds. We call the primary DNNMP (where P stands for primary) and the secondary DNNMS (where S stands for secondary). The detections from MP are compared to the detections from MS in order to calculate a camera-LiDAR detection discrepancy measure. This measure is then used to score images. In AL, the score is used to select images for annotation that have a large discrepancy, in order to improve the performance of MP . The intuition behind why this is worth investigating is that the samples with the largest discrepancy value are assumed to be those whereMP is wrong or uncertain. 2 1. Introduction 1.2 Limitations The main goal of the thesis is to develop an AL algorithm for 3DOD and not a DNN architecture. Therefore, we limit ourselves to use existing implementations of architectures. The implementations that we use are provided by Zenuity and we only go into brief details regarding the architectures. 1.3 Contributions We contribute with a definition of a discrepancy measure that quantifies how two sets of detected objects differ from each other. Furthermore, we present results showing that the proposed discrepancy measure can be used for scoring samples to do AL. The method performs better than or on par with selecting samples at random but raises the annotation cost. The presented results further contribute with the analysis that the proposed discrepancy measure can be modified to maintain the annotation cost. However, the modified measure fails to do so while performing better than random selection. Instead, the modified measure only performs on par with random selection. Discussed are ways to improve the method with the goal of outperforming random selection while maintaining the annotation cost, which is crucial for the method to be useful in a real-case scenario. 1.4 Related Work When doing AL for ML in general, the most straight forward approach is to use some kind of uncertainty sampling [3]. Uncertainty sampling is a method where the samples selected for annotation are those where the model is least confident in its predictions. The samples selected are often the samples where the model is not confident in classification. The approach is feasible for 3DOD, but the classification is just one part of a prediction in 3DOD, as explained above. Therefore, it would be beneficial to include the other parts of a prediction in the selection process. The uncertainty for the other parts of a 3DOD prediction is difficult to find since 3DOD DNNs do not exhibit uncertainties for those parts. This thesis proposes a method with the potential to include samples where the model is wrong or uncertain for the other components as well. Another well-established approach is Query by Committee where samples are selected based on the disagreement between a committee of models where all models are trained on the same dataset [4]. One could see our approach as a committee of two models. However, even though both models train on samples captured at the same moment, the samples are not of the same format. According to the definition of Query by Committee, the models should train on the same dataset in the same format. However, in our case,MP is using images, andMS is using point clouds. Our approach can also be compared to [5] where a View Divergence Score is proposed as a score of informativeness in AL for semantic segmentation. The View Divergence Score determines what samples are the most informative based on how 3 1. Introduction the predictions differ between different views of the same scene. The different views, in this case, are images taken from different angles. This is similar to what we do but since the scope of [5] is semantic segmentation, it does not cover how to measure the discrepancy between detections. Neither do they use point clouds but rather only images. Another approach uses AL for training an OD DNN for images and proposes two metrics for the informativeness of a sample [6]. The two metrics are designed to model the uncertainty of the detections. However, these metrics do not compare multiple predictions coming from different sources (or views) to do AL, as our dis- crepancy measure does. Furthermore, [6] focus solely on 2DOD while our method is developed for 3DOD. Using both LiDAR and cameras in combination with AL has been done by [7]. However, the actual AL algorithm in [7] does not benefit from the differences between the two types of sensors, as ours does. Furthermore, only the DNN for OD in the point clouds is trained using the proposed AL algorithm. We do the opposite - train the DNN for OD in images. 1.5 Thesis Outline Chapter 2 explains the theory and all the necessary components to understand the thesis. The chapter includes a short description of ML and DNN, the fundamental concept of AL as well as a description of the LiDAR and OD, the Hungarian method for solving the 2D Assignment Problem, and evaluation metrics for OD. In Chapter 3, our method is described including a description of the dataset and the models. The discrepancy measure between detections from two DNNsMP andMS is described in detail. Chapter 4 contains the settings used in the experiments whose results are presented in Chapter 5. The results are discussed in Chapter 6 with a conclusion in Chapter 7. 4 2 Theory This chapter covers the technical details necessary to understand the conducted research. We begin by shortly describing Machine Learning and Artificial Neural Networks in Section 2.1 to motivate the reason for Active Learning. The concept of Active Learning and how traditional Active Learning algorithms work is described in Section 2.2. Then, the LiDAR sensor is described in Section 2.3 followed by 2D and 3D Object Detection in Section 2.4. We also describe the key parts of our method which are the Hungarian Method for solving the 2D Assignment Problem in Section 2.5 and Intersection over Union in Section 2.6. Lastly, evaluation metrics for Object Detection are explained in Section 2.7. 2.1 Machine Learning Machine Learning (ML) is a common technique for solving complex tasks such as detecting objects in images. The task is solved by letting the ML algorithm learn by training on multiple samples. If the samples are annotated, the task is called supervised learning. In supervised learning, the annotation is the desired output of the ML algorithm. The goal in supervised learning is to learn the output y for a specific input x such that the ML algorithm can accurately predict the output for unseen and non-annotated data. More specifically, the ML algorithm tries to approximate a function y = f ∗(x) with f(x; θ) by adjusting some parameters θ. A common way of approximating such a function is by Artificial Neural Networks. 2.1.1 Artificial Neural Networks Inspired by biological neural networks, e.g a human brain, Artificial Neural Networks can solve advanced computational tasks by using a large number of connected pro- cessors, referred to as artificial neurons. The neurons serve as single units where each unit act as a function taking a vector as input and outputs a scalar [8]. All the connected neurons together build up to a complex function ŷ = f(x; θ) that approximates y = f ∗(x). The parameters θ often consist of weights w and biases b. 5 2. Theory Σ a ŷ x1 x2 xn ... w1 w2 wn ... b Figure 2.1: A visualisation of a neuron. Each input xi is multiplied with a weight wi and summed together. A bias term b is added to the sum. The final sum is the input for a non-linear activation function a to produce the final output ŷ. A visualization of a single neuron can be seen in Figure 2.1. A neuron takes a vector x = (x1, ..., xn)T as input and outputs a single value ŷ. The mathematical definition of a neuron is ŷ = a( n∑ i=1 wixi + b) (2.1) where w = (w1, ..., wn)T and b are learnable parameters referred to as weights and bias and a is a non-linear activation function. Commonly used activation functions are Rectified Linear Unit : ReLU(x) = 0 if x ≤ 0 x otherwise (2.2) Sigmoid : σ(x) = 1 1 + e−x (2.3) Hyperbolic tangent : tanh(x) = ex − e−x ex + e−x . (2.4) Using a large number of neurons, the neural network can approximate many complex continuous functions. The activation functions allow the network to learn non-linearities. Without the activation functions, the network is just one linear operation regardless of how many neurons the network consists of. The parameters θ are trained by minimizing the total loss over the whole training dataset using a loss function L, i.e. min θ ∑ i L(yi, f(xi; θ)) (2.5) where yi is the desired output and ŷi = f(xi; θ) is the output from the neural network for sample i. The loss essentially measures how far the prediction ŷi is from the truth yi. The gradient of L with respect to θ is used to optimize θ by 6 2. Theory taking small steps in the opposite direction of the gradient. We refer to [8] for details. The most fundamental network is the fully connected feedforward network [8]. In a fully connected feedforward network, the neurons are connected in several layers, where all neurons are connected between consecutive layers. An example of a fully connected feedforward network can be seen in Figure 2.2. The word feedforward means the input x is evaluated through all the layers in f , as an ordinary function, without any intermediate feedback to itself. Input Layer ∈ ℝ⁵ Hidden Layer ∈ ℝ⁷ Output Layer ∈ ℝ³ Figure 2.2: An example of a fully connected feedforward network consisting of input layer x ∈ R5 and output layer ŷ ∈ R3. The hidden layer consists of 7 neurons. Each connection corresponds to a learnable weight. The fully connected feedforward network consists of an input layer that consti- tutes input x, up to several hidden layers, and an output layer that constitutes output ŷ. Each layer outputs the input for the next layer. The word hidden layer comes from that the training data does not show the desired output from these kinds of layers. The number of layers is often referred to as the depth of the neural network. For complex tasks, such as detecting objects in images, it is common to use neural networks with large depth, i.e. a tremendous amount of layers. Due to the large depth, these kinds of networks are called Deep Neural Networks. 7 2. Theory 2.1.2 Deep Neural Networks A Deep Neural Network (DNN) can consist of hundreds of layers which is crucial for approximating complicated functions. For example, detecting objects in images is demanding in many ways. The DNN needs to learn what the different objects look like as well as what environments the objects interact with. Therefore, to perform the object detection task well, a tremendous amount of features need to be learned. An image can consist of millions of pixels. Due to the millions of pixels, if an image is used as input to a fully connected feedforward network, the input layer (see Figure 2.2) will consist of millions of neurons. Therefore, the number of parameters will be huge since each connection between neurons has an associated weight. Instead, one can use a convolutional neural network [8]. In a convolutional layer, only nearby neurons, i.e. nearby pixels for images, are connected in the neural network. The same weights are used in all neighborhoods, called weight sharing. However, even for convolutional networks, the number of parameters can still be large due to many layers. Many popular DNNs for solving image tasks, as well as the DNNs used in this thesis, are built upon an efficient DNN called ResNet. ResNet is a convolutional neural network using a successful technique called residual learning. We refer to [9] for details. ResNet-152 consists of 152 layers and millions of learnable parameters. To optimize all the millions of parameters, a great number of annotated training data is needed. The annotation process is tedious, expensive, and requires a human. If a DNN makes a perfect prediction on a non-annotated sample, it is most likely not infor- mative to the DNN to train on. Since a large number of annotated training data is needed, it is important to annotate samples that are informative for the DNN. To efficiently choose data for annotation which is believed to be informative, Active Learning can be used. 2.2 Active Learning The core of Active Learning (AL) is to choose what data to use for training an ML model. Based on some informativeness score, the AL algorithm (also known as the learner) can select data which the model believes is informative. The informative- ness score attempts to measure how beneficial a non-annotated sample would be to annotate. If the non-annotated sample is considered informative, the learner queries for an annotation and adds the newly annotated sample to the training set. There are mainly three different types of AL:membership query synthesis, stream- based selective sampling and pool-based active learning [3]. Membership query syn- thesis lets the learner create its own data which it believes the ML model will benefit the most from. Thereafter the learner queries for annotations of the newly cre- ated data. Stream-based selective sampling sequentially analyses a non-annotated dataset. For each sample, the learner makes a decision based on the informative- ness score whether to keep it (i.e. query for its annotation) or discard it. Lastly, pool-based active learning is much like stream-based selective sampling with the difference that it analyses the whole non-annotated dataset before deciding what 8 2. Theory samples to annotate [3]. Membership query synthesis does not suit complex image tasks, such as 3DOD, since synthesizing realistic images can be difficult. In stream-based selective sam- pling, only one sample is drawn at a time. In many real-world cases, large collections of non-annotated data can be collected at once which makes it possible to analyze all the samples. In this thesis, we assume a large collection of non-annotated data can be collected at once and added to a pool of non-annotated data. We also assume that a batch of non-annotated samples are selected and sent for annotation at the same time and added to an existing pool of annotated data. Therefore, we base our method on pool- based active learning. Figure 2.3 shows an overview of how pool-based AL works. The target model in the figure is a DNN. The generic framework for pool-based active learning consists of two datasets A and N where the first has annotations and the second does not. We call them the annotated and non-annotated datasets respectively. We have a budget of b samples and denote the target ML model (the model that is trained by the learner) byM. The AL algorithm is explained step by step below and each step has a reference to the arrows in Figure 2.3. 1. TrainM on A (purple arrow). 2. ApplyM on N and store the output (yellow arrow). 3. Score the output from step 2 using some informativeness score (yellow arrow). 4. Pick the b samples from N with highest score (yellow arrow) and send for annotation (blue arrow). 5. Add the annotated data to A and remove it from N (turquoise arrow). 6. Repeat from step 1. Deep Neural Network Annotator (human or machine) Non-annotated data Annotated data Active Learning Train the network Select samples based on some query Add annotated samples to training dataset Send selected samples for annotation Figure 2.3: Overview of a pool-based AL algorithm for DNNs. 9 2. Theory The main difficulty for AL is to design an informativeness score that correlates well with what the model needs to learn. This is highly dependent on both the task that should be solved and the available underlying data. Since designing an infor- mativeness score is not trivial, we look at a few examples of how the informativeness of a sample can be measured in the next subsection. 2.2.1 Examples of Informativeness Scores One of the most straight forward query approaches is Uncertainty Sampling [3]. In Uncertainty Sampling, the samples that the model is most uncertain about are chosen for annotation. The most uncertain samples are given a large score and assumed to be the most informative. The uncertain samples are samples where the model is not confident in its predictions. For classification tasks, the output is a list of class probabilities. To measure uncertainty in classification tasks, entropy is suitable [3]. The samples with large entropy are selected for annotation. For a random variable Y with a set {yi} of possible outcomes, entropy is defined as Entropy(Y ) = − ∑ i P (Y = yi) logP (Y = yi) (2.6) where P (Y = yi) is the probability that yi occur. For a random variable, the entropy is largest if the probability for all outcomes is equal which means that we cannot confidently tell what the outcome will be. Therefore, the outcome is uncertain. Consider the task of classifying images of cats and dogs as an example. Suppose the output of the model is 52% probability that an image contains a cat and 48% probability that the image contains a dog. Even though the probability of being a cat is larger than being a dog, the similarity indicates that the classification model is uncertain. There could be unseen features in the image that makes it uncertain to predict the class. Therefore, the model could benefit from training on the sample in order to learn new cat and dog features. By contrast, if the probability of a dog is 99% for a decently trained ML algorithm, the image likely contains a dog, in which case annotating this image does not significantly improve the performance of the algorithm. Another query approach is Query by Committee [3]. A committee of models trains on the same annotated data and the samples selected are based on the disagreement between the models’ outputs. If the disagreement between the models’ output is large, the informativeness score is large. For classification tasks, the disagreement can be measured using vote entropy [3]. For a set {yi} of possible classes and C number of committee models, vote entropy for sample x is defined as VoteEntropy(x) = − ∑ i V (yi;x) C log V (yi;x) C (2.7) where V (yi;x) is the number of votes, i.e. number of models predicting class yi for sample x. The disagreement is large if the votes are evenly distributed over 10 2. Theory the different classes. Evenly distributed votes lead to all V (yi;x) C terms being similar which corresponds to large entropy, as explained above. Compared to Uncertainty Sampling and Query by Committee, a more different query approach is Expected Model Change. Expected Model Change measures how much impact the sample has on the current model if it is included in the training set. The sample is assumed to be informative if it makes a great impact on the model, i.e. make large changes to its parameters. An example of Expected Model Change used for classification tasks is Expected Gradient Length. Expected Gradient Length calculates the length of the loss func- tion gradient. Since the sample is non-annotated, we calculate the average length of the gradient over all possible classes. Let {yi} be a set of all possible classes and ∇L(yi, f(x; θ)) be the gradient for the loss between yi and the prediction for f(x; θ) using the current model parameters θ. Expected Gradient Length for sample x is defined as ExpectedGradientLength(x) = ∑ i P (yi|x; θ)||∇L(yi, f(x; θ))||2 (2.8) where P (yi|x; θ) is the probability of x being class yi and || · ||2 is the L2-norm. Empirical studies show that Expected Gradient Length works well for some tasks, but can be computationally expensive if both the number of possible classes and the feature space is large [3]. 2.2.2 Evaluating an Active Learning Algorithm An AL algorithm should be evaluated before used in any real-case scenario due to the annotation procedure being both time-consuming and expensive. The selected samples sent to the annotator must be of some value for the model, otherwise, the time spent on annotation is wasted. When evaluating an AL algorithm, the human annotator is commonly simulated. Instead of having a human annotator, the non- annotated dataset N is not used and the annotated dataset is divided into two parts. The first part is unchanged and interpreted as the annotated dataset A. The second part acts as N by hiding the annotations. The human annotator is simulated by revealing the annotations for some data from the second part and moving it to the first part. 2.3 LiDAR Sensor Our AL method measures the informativeness of a sample by using predictions based on data from a LiDAR sensor. We refer to Chapter 3 for details on our method. A LiDAR sensor (Light Detection and Ranging) can be mounted on a self-driving car to achieve perception around the vehicle while moving. The LiDAR measures the distance to objects by measuring the time it takes for a laser beam to be reflected to the sensor. The laser beams build a 3D model of the surroundings, consisting of a large number of unordered points. Due to the large number of unordered points, the 3D model is referred to as point cloud. 11 2. Theory An example of a point cloud can be seen in Figure 2.4. The point cloud is visualized in bird’s eye view, meaning every point is projected onto the ground and shown from above. How high the points are above the ground is not shown in the figure. Objects in the point cloud are clearly marked with blue and orange boxes, where blue indicates a car and orange a truck. The objects are detected by an Object Detection DNN using point clouds as input. Predicting the position in 3D space using a point cloud is less difficult than predicting the position using an image. An image lack explicit data on distance to objects, whereas a point cloud is an actual 3D representation. Figure 2.4: Example of LiDAR point cloud in bird’s eye view with detections of objects. 2.4 Object Detection Object Detection (OD) is a technique used for detecting objects of certain classes, such as cars and/or pedestrians. It is commonly applied to images, but can also be used in other models of the world, such as point clouds. Objects can be detected both in 2D (images) and 3D (images and point clouds). An example of 2DOD can be seen in Figure 2.5. A detected object is bounded in a box (a bounding box) and the object in the box is predicted with a class [10]. The 2D bounding box expresses where the object is located in the image. 12 2. Theory In 3DOD [11], by contrast to 2DOD, the bounding box is three-dimensional and expresses where the object is located in world coordinates, relative to the ego vehicle. The ego vehicle is the vehicle where all the sensors are mounted and is the vehicle from where the image is captured. As in 2DOD, the object is predicted with a class. The 3D bounding box is defined in 3D space and is commonly expressed by position (center point coordinates of bounding box bottom plane), dimension (height, width, length), and rotation around the three axes. An example of 3DOD can be seen in Figure 2.6. Since the boxes are defined in 3D space, this definition can apply to 3DOD in both point clouds and images. Figure 2.5: Example of 2D object detection. Figure 2.6: Example of 3D object detection. Predicted location together with dimension and facing direction in 3D space is visualized in bird’s eye view to the right. 13 2. Theory 2.5 2D Assignment Problem The idea of this thesis is to find images where a DNN trained on images gives different predictions than a DNN trained on point clouds. We compare detections from a primary DNNMP (where P stands for primary) trained on images with detections from a secondary DNNMS (where S stands for secondary) trained on point clouds. The comparison is done by calculating a discrepancy between detections. To calculate a discrepancy measure between detections outputted fromMP and MS, one approach is to calculate the discrepancy per object. If the discrepancy is calculated per object, each detection fromMP needs to be matched to a detection inMS and vice versa. Both detections in a match are associated to the same real- world object. The matching can be solved by solving the 2D Assignment Problem (2DA) [12]. Let G = 〈V, U ;E〉 be a weighted bipartite graph where V and U consist of vertices, E consist of edges and each edge e ∈ E is assigned a weight. Furthermore, let |V | ≤ |U |. An example of a bipartite graph can be seen in Figure 2.7. The 2DA consists of finding the minimum sum of weights such that each vertex v ∈ V is connected to exactly one vertex u ∈ U . For example, if V correspond to jobs, U correspond to machines and each edge e ∈ E has a running time, we want to find the matchings of jobs and machines where the total running time is minimized, all jobs are run and one machine is assigned exactly one job. An approach to solving the 2DA is the Hungarian method [13]. 1 2 3 4 5 6 7 8 9 V U Figure 2.7: An example of a bipartite graph. The 2D assignment problem consists of finding edges such that each vertex in V (1-4) is connected to exactly one vertex in U (5-9) and the sum of weights is minimized. 14 2. Theory 2.5.1 Hungarian Method The Hungarian method can solve the 2DA in O(n3) time [14], where n = |U |. The 2DA can be expressed with a matrix M where the elements are the weights of E. The rows correspond to V and the columns correspond to U . Therefore, index (i, j) of M contains weight mij for connecting the ith vertex in V to the jth vertex in U . If |V | < |U |, |U |−|V | dummy rows with zeros need to be added such that |V | = |U |. After the dummy rows are added, the 2DA can be expressed as minimize ∑ (i,j)∈E yijmij (2.9a) subject to n∑ i=1 yij = 1, ∀j ∈ {1, ..., n}, (2.9b) n∑ j=1 yij = 1, ∀i ∈ {1, ..., n}, (2.9c) yij ∈ {0, 1}, ∀i, j ∈ {1, ..., n} (2.9d) where yij = 1 if the ith vertex in V is assigned the jth vertex in U . The constraints in (2.9b) and (2.9c) ensures that each vertex v ∈ V is assigned exactly one vertex u ∈ U . The constraint (2.9d) ensures that an assignment cannot be split. In the example with machines and jobs, the constraint ensures that one job cannot be split and run by two machines. We will now describe how the Hungarian method is executed. After that, we will take a look at the mathematical details as to why the solution is optimal. We use M = 4 1 5 7 5 6 8 5 8   (2.10) as an example problem instance. Let V = {v1, v2, v3} and let U = {u1, u2, u3}. 1. (a) For each row r in M : Subtract the minimum value of r from every element in r. Minimum values are marked with a circle. 4 1 5 7 5 6 8 5 8   ← −1 ← −5 ← −5 ⇒ 3 0 4 2 0 1 3 0 3   (2.11) (b) For each column c in M : Subtract the minimum value of c from every element in c. Minimum values are marked with a circle. 15 2. Theory 3 0 4 2 0 1 3 0 3   −2 ↓ −0 ↓ −1 ↓ ⇒ 1 0 3 0 0 0 1 0 2   (2.12) 2. (a) Cover all zeros in M using minimum number of vertical and horizontal lines. 1 0 3 0 0 0 1 0 2   (2.13) (b) If the number of lines is equal to number of rows in M , go to step 3, otherwise proceed. (c) Add the minimum value of all the uncovered (marked with circles) ele- ments to all elements covered by both a vertical and a horizontal line (marked with square). 1 0 3 0 0 0 1 0 2   +1 ↓ ⇒ 1 0 3 0 1 0 1 0 2   (2.14) (d) Subtract the minimum value of all the uncovered elements from all the uncovered elements (marked with circles). 1 0 3 0 1 0 1 0 2   −1 ↓ −1 ↓ ⇒ 0 0 2 0 1 0 0 0 1   (2.15) (e) Remove the lines and go to step 2. 16 2. Theory 3. Find the solution by systematically picking zeros covered by either a vertical or a horizontal line but not both. Also, only pick one element from each row and column. If the solution is valid, the solution is also optimal. The lines in 2.16 comes from doing step 2a again. A valid solution is marked with squares at indices (1, 1), (2, 3) and (3, 2). Therefore, the resulting match- ings are (v1, u1), (v2, u3) and (v3, u2). 0 0 2 0 1 0 0 0 1   ⇒ 0 0 2 0 1 0 0 0 1   (2.16) M = 4 1 5 7 5 6 8 5 8   ⇒ total sum of weights: 4 + 6 + 5 = 15 (2.17) The Hungarian method modifies M in order to find an optimal solution. The mathematical reasoning to why the Hungarian method gives an optimal solution is as follows. Let {ai}ni=1 and {bj}nj=1 be values that modifies M . The values modify M without changing (2.9a) as ∑ (i,j)∈E yijmij = ∑ (i,j)∈E (mij − ai − bj + ai + bj)yij (2.18) = ∑ (i,j)∈E (mij − ai − bj)yij + n∑ i=1 ai n∑ j=1 yij + n∑ j=1 bj n∑ i=1 yij (2.19) = ∑ (i,j)∈E (mij − ai − bj)yij + n∑ i=1 ai · 1 + n∑ j=1 bj · 1 (2.20) = ∑ (i,j)∈E (mij − ai − bj)yij + n∑ i=1 ai + n∑ j=1 bj (2.21) where (2.20) is obtained by the constraints (2.9b) and (2.9c). The Hungarian method finds an optimal solution by modifyingM with {ai}ni=1 and {bj}nj=1, without violating the constraints, such that mij − ai − bj = 0 for (i, j) where yij = 1. The first term,∑ (i,j)∈E(mij − ai − bj)yij, evaluates to 0, since yij = 0 for all other (i, j), meaning the resulting objective value is ∑n i=1 ai +∑n j=1 bj. The solution is optimal since the resulting objective value is independent of yij and the term including yij is minimal since the term is evaluated to 0. We refer to [15] and [16] for details. 17 2. Theory 2.6 Matching Score In order to match detections betweenMP andMS, a score for a potential match is needed in order to create M used by the Hungarian method to solve the 2DA. We define a good match as a large Intersection over Union (IoU) between 2D bounding box detections. The IoU between two sets A and B is defined as IoU(A,B) = |A ∩B| |A ∪B| (2.22) and measures the similarity between two sets. In 2DOD, IoU between bounding boxes measures area of overlap (in pixels) divided by area of union (in pixels). A visualization can be seen in Figure 2.8. In 3DOD, IoU is calculated with volume (in cubic meters) instead of area. Figure 2.8: A visual equation for Intersection over Union (Jaccard Index) [1]. 2.7 Evaluation Metrics Evaluation metrics are used to evaluate and compare different AL algorithms. To evaluate an AL algorithm for OD, two commonly used metrics for evaluating both 2DOD and 3DOD are F1 score and Average Precision (AP). Both metrics are cal- culated using p = TP TP + FP r = TP TP + FN TP = Number of true positives FP = Number of false positives FN = Number of false negatives (2.23) 18 2. Theory where p and r denote precision and recall, respectively. In OD, precision measures how accurate the detections are by the percentage of correct detections (true posi- tives). IoU in 2DOD and 3DOD is often used to determine whether a detection is a true positive or false positive by comparing to the annotated object. If IoU is above a certain threshold, e.g. 0.5, the detection counts as a true positive [17]. Recall is a measure of completeness. In OD, recall measures what percentage of all the annotated objects are detected. An illustration of precision and recall can be seen in Figure 2.9. relevant elements selected elements false positivestrue positives false negatives true negatives Precision = Recall = How many selected items are relevant? How many relevant items are selected? relevant elements selected elements false positivestrue positives false negatives true negatives Precision = Recall = How many selected items are relevant? How many relevant items are selected? Figure 2.9: Illustration of precision and recall which are common components for evaluation metrics in object detection [2]. 2.7.1 F1 Score F1 score is defined as the harmonic mean between precision and recall, i.e. F1 = 2 (1/p) + (1/r) = 2 · p · r p+ r (2.24) and ranges between 0 and 1 [18]. With perfect precision and recall, F1-score reaches its best value, 1. 19 2. Theory 2.7.2 Average Precision To calculate AP, precision values over the recall range 0 to 1 need to be calculated. In other words, a precision function p(r) taking recall value as input should output precision value. The function p(r) is called the precision-recall curve and is estimated from the predictions for all the samples in the data set. All the predictions are sorted by their confidence level in descending order. In OD, the confidence level corresponds to the class prediction of the detection and is the probability of being a particular class. When all the predictions are sorted, each prediction is determined whether it is a true positive or not, which means using an IoU threshold between the detection and the annotated object in OD. Each prediction is assigned a rank depending on their confidence level. Precision and recall are calculated for each rank, where only the current rank and lower rank predictions are taken into account. In the example in Table 2.1, there are 10 positives to be predicted in the whole dataset. All predictions are sorted by their confidence level in descending order. Rank 1 refers to the prediction with largest confidence. Each prediction is deter- mined whether it is a true or false positive. Up to rank 5, only 3 of the predictions are true positives, which means the precision value is 3/5 = 0.6 for that rank. The recall value depends on the total number of positives in the dataset and would in this example be 3/10 = 0.3 for rank 5. Confidence level Rank Correct prediction Precision p(r) Recall r 0.98 1 True 1 0.1 0.95 2 False 0.5 0.1 0.93 3 False 0.33 0.1 0.9 4 True 0.5 0.2 0.86 5 True 0.6 0.3 Table 2.1: An example of calculating recall and precision over a small dataset containing a total of 10 positives to be predicted. AP measures average precision over the whole recall range 0 to 1 [17], defined by the area under the precision-recall curve, i.e. AP = ∫ 1 0 p(r)dr. (2.25) In reality, the precision-recall curve is not continuous. Instead, the precision-recall curve is estimated as above and AP is commonly calculated by interpolation. A commonly used OD benchmark is the VOC2007 challenge. We refer to [19] for details. In the challenge, AP is calculated as the mean interpolated precision at r = 0, 0.1, ..., 1, i.e. AP = 1 11 ∑ r∈{0,0.1,...,1} pinterpolated(r) (2.26) 20 2. Theory where the interpolated precision is pinterpolated(r) = max r′≥r p(r′). (2.27) The IoU threshold for determining true positives in VOC2007 is set to 0.5. Also, if there are several detections for the same object, only the detection with the largest confidence is counted as a true positive and the rest as false positives. When there are several classes in the dataset, it is common to calculate AP for every class individually. The class AP’s are averaged to obtain Mean Average Precision [17]. Even though it could be confusing, mAP is often referred to as AP. 2.7.3 Mean Absolute Error To evaluate the accuracy of the 3D bounding box, Mean Absolute Error (MAE) can be used. The general definition for MAE is MAE = 1 N N∑ i=1 |yi − ŷi| (2.28) where ŷ1, ..., ŷN are predictions and y1, ..., yN are targets. For 3DOD, the predictions are all the true positives and compared to their corresponding annotations. MAE can be used to see the accuracy of each 3D bounding box attribute separately, for example the accuracy of the bounding box height. 21 2. Theory 22 3 Method This chapter introduces how we select the samples for annotation in AL. The sam- ples are selected based on the discrepancy between detections outputted from two different DNN’s, the primary DNN MP and the secondary DNN MS. The objec- tive is to increase the performance of MP by selecting samples that are assumed to be those where MP is wrong or uncertain. Samples selected could be samples where MP is wrong in any of detection, position, dimension, and classification. By contrast, entropy (as described in Section 2.2.1) only take class prediction into account. The dataset is described in Section 3.1. In Section 3.2, the primary DNN,MP , is presented. The primary DNN is trained only on annotated images and is the DNN we try to improve using AL. In Section 3.3, we present two different secondary DNN’s,MS. The AL algorithm using discrepancy measure as informativeness score is introduced in Section 3.4 and the calculations of the actual discrepancy are laid out in Section 3.5 and 3.6. The development process of the discrepancy measure is explained in Section 3.7. Lastly, two different ways of selecting samples based on the discrepancy are described in Section 3.8. 3.1 Dataset The dataset we use is data provided by Zenuity. The primary motivation behind the choice of data is to make the thesis relevant to the company. Another benefit is thatMP already is compatible with the dataset, i.e. no modifications forMP are needed. Also, the dataset is large which makes it suitable to use for AL experiments. The dataset D is a set of N samples, D = {X1, ...,XN} (3.1) where each sample Xi = (p,x, A) (3.2) consists of a point cloud p, image x, and a set of annotations A. All coordinates in A are defined in the metric system with the ego vehicle as the origin. For each sample, p and x are captured approximately at the same time. The image x is captured from one of the cameras mounted around the ego vehicle. The annotations in A consists of classified 2D and 3D bounding boxes. There are ten different classes, where background is one of them. Since the point clouds and the images are captured at the same moment, it is possible to train one model with images and one model with point clouds with the same annotations. 23 3. Method 3.2 Primary Object Detection Network As the primary DNN, we will use a DNN based on ResNet [9] that is provided by Zenuity. For simplicity, we call the DNNMP where P stands for primary. The primary DNNMP is trained solely on annotated images and outputs both predicted 2D and 3D bounding boxes. The 2D boxes are defined on the format (x, y, w, h)T where (x, y)T defines the center point (in pixels) and (w, h)T defines the dimension (width and height in pixels) of the box. The 3D boxes are defined on the format (x, y, z, w, h, l, r)T where l = (x, y, z)T defines the location (center point), d = (w, h, l)T defines the dimension (width, height and length in meters) and r defines the rotation around the y-axis (in radians), also referred to as yaw. Rotation around x-axis and rotation around z-axis are assumed to be negligible. An illustration of a 3D bounding box can be seen in Figure 3.1. PAGE 1 Images from the KITTI dataset Bounding Box Match detections Measure discrepancy between detections Handle unmatched detections x y z h w l x, y, z x z r Figure 3.1: An illustration of a 3D bounding box with axis directions. The coordinate system is in meters and the position of the camera is used as the origin. In the camera’s coordinate system, the x-axis is pointing to the right, y-axis is pointing down and z-axis is pointing forward (see Figure 3.1). For example, an object located at (x, y, z) = (10,−1, 10) means that it is 10 meters to the right, 1 meter below, and 10 meters in front of the camera. The rotation is defined relative to the ego vehicle. To exemplify, r = 0 means the object is facing in the same direction as the ego vehicle, r = π 2 means the object is facing to the right and r = π means the object is facing in the opposite direction of the ego vehicle. 24 3. Method 3.3 Secondary Object Detection Network The secondary DNN should in principle be a DNN with high performance in 3DOD. A DNN with point clouds as input is preferable to get high performance on the 3D parameters, as described in Section 2.3. We call the secondary DNNMS where S stands for secondary. Two different versions ofMS are used to evaluate the perfor- mance of the discrepancy measure. The two DNNs are described in the following subsections. 3.3.1 Simulated High Performance Network To mimic a DNN with high performance, we use the annotations as if they were detections fromMS. Using the annotations to select samples for human annotation is of course not possible in practice. The annotations are not available before they are created by a human. However, the results can still show that the discrepancy measure is worth using if a high-performance DNN is used as MS. We call the simulated high-performance DNNMA where A stands for annotation. 3.3.2 Camera-LiDAR Fusion Network To evaluate the practical potential of the discrepancy measure, a real Camera- LiDAR fusion DNN is used. The DNN is based on a DNN developed in a previous thesis at Zenuity [20]. The DNN is a fusion DNN which in this case means that it is trained on both images and point clouds. We call the DNNMF where F stands for fusion. MF is developed to be robust against sensor degradation, i.e. if one of the sensors (camera or LiDAR) stops working, the DNN should still be able to perform object detections. In other words,MF is functioning using solely point clouds as input which gives an option to investigate if the fusion part, i.e. using both images and point clouds, is important for calculating discrepancy in AL. The architecture of MF is explained in Section A.2 and modifications toMF are explained in Section A.3. 3.4 Modified Active Learning Algorithm We base our experiments on the pool-based AL algorithm described in 2.2. As mentioned in 2.2, the budget is denoted b, the dataset with annotations is denoted A and the dataset without annotations is denoted N . We also add a new annotated set V which is our validation set. Apart from usingMP asM, a few other modifications are done. These modifications are underlined in the following algorithm. 25 3. Method 1. TrainMP andMS on A. 2. EvaluateMP on V . 3. ApplyMP andMS on N and store the output. 4. Score the output from step 3 using a discrepancy measure between the detec- tions ofMP andMS. 5. Pick the b samples from N according to some selection strategy based on discrepancy and send for annotation. 6. Add the annotated data to A and remove it from N . 7. Repeat from step 1 . First, we train not only MP on A, but also MS. After each training, we eval- uate the performance of MP on V for comparison between AL iterations. An AL iteration is one iteration of steps 1-6. In step 4 we use our discrepancy measure as informativeness score. Lastly, we select samples based on discrepancy using some selection strategy. 3.5 Main Approach Two different approaches to quantifying discrepancy are tested. We call them Main Approach (MA) and Alternative Approach (AA). Both approaches need to 1. Match detections fromMP andMS (as mentioned in Section 2.5). 2. Quantify the discrepancy between the matched detections. 3. Include unmatched detections in the measure. This section describes in detail how MA matches detections and calculates total discrepancy for a sample. AA is heavily dependent on MA, which is why it is important to first read and understand this section before reading Section 3.6. 3.5.1 Matching Let XP = {XP,1, ..., XP,n} and XS = {XS,1, ..., XS,k} be two sets of detections for sample X coming from MP and MS, respectively. Before using the Hungarian method to find the best matchings, some detections will be disregarded. SinceMS is supposed to detect objects in point clouds, there could be detections that will not be visible in the corresponding image. However,MP will only use images and will never be able to detect anything that is not in the image. Therefore, any detection in XS that does not project into the image plane will be left out. We will now define three matrices, MIoU,Mobj,Mdepth that are to be used to calculate the matrix M used by the Hungarian method to solve the 2DA. We define MIoU =  IoU2D(XP,1, XS,1) . . . IoU2D(XP,1, XS,k) ... . . . ... IoU2D(XP,n, XS,1) . . . IoU2D(XP,n, XS,k)  (3.3) 26 3. Method to be the n × k IoU matrix that contains the 2D IoU scores between all possible combinations of the detections in XP and XS. IfMS does not output 2D boxes, the 2D box is created by bounding the 3D box that is projected into the image plane. Furthermore, we define p0 to be the probability of a detection being background, i.e. the probability that a detection is neither any of the 9 other classes. The objectness probability is defined as 1− p0, i.e. the probability of the detection being any of the 9 other classes. Then Mobj is the n × k objectness matrix between XP and XS and is defined Mobj =  1− p0,P,1 ... 1− p0,P,n (1− p0,S,1 . . . 1− p0,S,k ) (3.4) i.e. the element at position (i, j) is the product of the objectness probabilities of detection i from XP and detection j from XS. Furthermore, we define the relative depth difference between detections XP,i and XS,j to be ∆z ij = min(zP,i, zS,j) max(zP,i, zS,j) (3.5) where zP,i and zS,j is the center point z-coordinate of detections XP,i and XS,j, respectively. Mdepth is then defined Mdepth =  ∆z 11 . . . ∆z 1k ... . . . ... ∆z n1 . . . ∆z nk  . (3.6) The three matrices will now be multiplied together using the Hadamard product ◦ (element-wise multiplication) to form a score matrix M = MIoU ◦Mobj ◦Mdepth. (3.7) The purpose of the IoU matrix MIoU is to have matched detections close to each other in position, which is the most crucial part. The objectness matrix Mobj aims to ensure that both detections in a match have high confidence of detecting a real- world object, otherwise, there is a risk of matching a false positive. Finally, Mdepth prevents that the difference between detections in a match is unrealistically large in depth, which is possible if only matching with respect to 2D IoU. The final step before solving the 2DA using the Hungarian method on M is to filter out any matchings that are not feasible. This is determined using two thresholds τIoU and τdepth. If an element at position (i, j) in MIoU is smaller than τIoU, the corresponding element mij in M will be set to 0. The same applies to Mdepth using τdepth. Before running the Hungarian method, dummy rows or columns will be added as described in Section 2.5.1 if M is not a square matrix. Let n′ = max(n, k). The 2DA is originally formulated to minimize the total cost of the problem (see equation (2.9)). In this case, we want to maximize the total matching score, i.e. 27 3. Method maximize n′∑ i=1 n′∑ j=1 yijmij (3.8a) subject to n′∑ i=1 yij = 1, ∀j ∈ {1, ..., n′}, (3.8b) n′∑ j=1 yij = 1, ∀i ∈ {1, ..., n′}, (3.8c) yij ∈ {0, 1}, ∀i, j ∈ {1, ..., n′} (3.8d) where yij = 1 if the ith object in XP is assigned the jth object in XS. To be able to use the Hungarian method for finding the best matchings we need to convert the problem into a minimization problem. The objective function in (3.8a) is converted to minimize n′∑ i=1 n′∑ j=1 yij ( (max (i,j) mij)−mij ) . (3.9) Now, the previously largest value in M has become the smallest value and the previously smallest values has become the largest. From the Hungarian method, we receive pairs of indices of the form (i, j) where i is the index of a detection from XP and j the index of a detection from XS. Any pair where i > n or j > k is a match coming from dummy rows or dummy columns and will be discarded. Furthermore, three things need to hold for a match (i, j) to be valid: 1. The element mij in M needs to be > 0. 2. The element at position (i, j) in MIoU needs to be ≥ τIoU. 3. The element at position (i, j) in Mdepth needs to be ≥ τdepth. Any match that does not fulfill these conditions will be discarded. The matchings that are not discarded are put in a set Xmatch = {(XP,a, XS,b), ..., (XP,c, XS,d)} (3.10) where a and b are indices of detections from XP and XS, respectively, that are matched to each other. Since MS uses point clouds that have a finite range, there might be detections from MP that will never be matched to anything from MS due to their position being beyond the range of the point cloud. These unmatched detections fromMS would then lead to a non-constructive discrepancy which is why they are left out. This is done by filtering out any unmatched detections from MP that have a z- coordinate further away than δz, where δz is set to be the maximum range of the LiDAR sensor. The number of detections from XP and XS that are among the discarded matchings after the filtering are denoted µP and µS, respectively. The number of detections µP is assumed to be µP objects whichMP detects but MS does not and µS is assumed to be µS objects whichMS detects butMP does 28 3. Method not. Both µP and µS are included in the discrepancy measure where the assumption is that µP is the number of false detections and µS is the number of detections that MP missed. By letting µP affect the discrepancy measure, we intend to select samples that increase precision performance. Similarly, we intend to select samples that increase recall performance by letting µS affect the discrepancy measure. We refer to Section 3.5.3 for details regarding how the discrepancy measure is affected by µP and µS. 3.5.2 Discrepancy between a Pair of Matched Detections We have two detections, XP and XS coming fromMP andMS, respectively. The discrepancy in attributes, i.e. the position, dimension, and rotation, are taken into account by calculating the absolute difference between the attributes of XP and XS. For a detection X we define the attributes of X as Xattr =  x y z h w l r  (3.11) where x, y, z is the center point, h,w, l the dimension and r the rotation around the y-axis (yaw) of X (as described in Section 3.2). The rotation is modified as r := r mod π such that r ≥ 0. The absolute differences between the attributes of XP and XS are then |Xattr P −Xattr S | =  ∆x ∆y ∆z ∆h ∆w ∆l ∆r  . (3.12) To avoid collisions, objects closer to the ego vehicle are often more important to accurately predict than objects further away. Therefore, we define relative depth difference as ∆zrel = ∆z/max(zP , zS). Since the output fromMS could be invariant to heading direction, as inMF (see (A.8)), the rotation difference need to be invari- ant as well. We define the invariant rotation difference as ∆rinv = min(∆r, π−∆r). The maximum difference in rotation is therefore π/2, as seen in Figure 3.2. We define the difference between the attributes as 29 3. Method ∆attr =  ∆x ∆y ∆zrel ∆h ∆w ∆l ∆rinv  . (3.13) π π 2 ∆r π −∆r Figure 3.2: Illustration of rotation difference between two bounding boxes. The difference is largest when the bounding boxes are orthogonal. The different components of ∆attr differ in magnitude which later will make it hard to combine them into a single measure. Therefore, we normalize ∆attr using a vector β =  βx βy βz βh βw βl βr  (3.14) containing maximum difference values that are set based on observations in the dataset D. The normalized vector of attribute differences is the attribute discrepancy vector dattr = ∆attr � β =  dx dy dz dh dw dl dr  (3.15) where � is the Hadamard (element-wise) division operator. 30 3. Method Furthermore, to include the discrepancy in class prediction, we calculate the L1-norm of the difference between the probability vectors of XP and XS. The probability vector of a detection X is Xprob =  p0 p1 ... p9  (3.16) where p0 is the probability of X being background and p1, ..., p9 are probabilities for the 9 different classes. The L1-norm of the difference (the probability discrepancy) is dprob = ||Xprob P −Xprob S ||. (3.17) Combined into one column vector, the attribute discrepancies and the probability discrepancy is denoted d = ( dattr dprob ) =  dx dy dz dh dw dl dr dprob  . (3.18) 3.5.3 Discrepancy for a Sample Let k′ = |Xmatch| (3.19) be the total number of pairs with matched detections for sample X. Then, let D = ( d1 ... dk′ ) (3.20) be the matrix containing all column vector discrepancies (3.18) for the matched detections of X. We want to avoid only selecting samples that contain many objects since they take more time to annotate. Samples with many objects are therefore often more expensive to annotate than samples with fewer objects. To prevent over- selection of samples where k′ is large, we take the row-wise average of D. If instead, a row-wise sum is used, the discrepancy of a sample will be directly correlated to the number of objects. Then, samples that are expensive to annotate will be selected first. Thereby, let rowavg be the function that calculates the column vector containing the row-wise average values of a matrix. Then, let d̄ = rowavg(D). (3.21) 31 3. Method Now, let w =  wx wy wz wh ww wl wr wprob wP wS  (3.22) be a vector of weights. The weighted discrepancy vector is then calculated as dw = w ◦  d̄ f(µP ) f(µS)  (3.23) where f is a function determining the discrepancy for unmatched objects. We con- sider w and f to be a parameters which can be set to change the functionality of the algorithm. The parameter settings we use in our experiments are explained in Section 4.4. The final step is now to combine all the components into a single scalar value, which is done using a squared L2-norm, i.e. d = ||dw||22 (3.24) where d denotes the total discrepancy value for sample X. 3.5.4 Main Approach Examples Figure 3.3 shows an example of a sample with low discrepancy value and Figure 3.4 an example of a sample with a higher discrepancy value. In Appendix A.1 the matrices MIoU, Mobj and Mdepth corresponding to the problem instance in Figure 3.3 are shown as well as the matrix M used for solving the 2DA with the Hungarian method. 3.6 Alternative Approach AA is based on the assumption that a good match between detections is a match with low discrepancy. Instead of matching the detections usingMIoU ◦Mobj ◦Mdepth, as in MA, the detections are essentially matched using the discrepancy measure explained in Section 3.5.2. The Hungarian method is used to match detections such that the total discrepancy is minimized. Let XP = {XP,1, ..., XP,n} and XS = {XS,1, ..., XS,k} be two sets of detections for sample X coming fromMP andMS, respectively. Then dij = ||wmatched ◦ dij||22 (3.25) 32 3. Method is the discrepancy between detections XP,i and XS,j where dij is the attribute and probability discrepancies between XP,i and XS,j according to (3.18). wmatched is defined as w from (3.22) without the last two elements wP and wS. We then define the discrepancy matrix to be Mdisc =  d11 . . . d1k ... . . . ... dn1 . . . dnk  . (3.26) Detections fromMP Detections fromMF Figure 3.3: Example of sample where the discrepancy is small. There is only a small difference in depth between the matched detections and the largest difference are detections far away from the ego vehicle, meaning small discrepancy since the depth difference is relative. 33 3. Method Detections fromMP Detections fromMF Figure 3.4: Example of sample where the discrepancy is large. MF is more accurate in position thanMP as seen in the BEV point cloud to the right. For solving the 2DA using the Hungarian method, we then construct M by M = Mdisc � (1−MIoU) (3.27) where MIoU is the IoU matrix as described in 3.5.1. Since we want to minimze the total discrepancy and since we use 1 −MIoU instead of MIoU, the 2DA on M is a minimization problem. Similar to Section 3.5.1, if an element at position (i, j) in MIoU orMdepth is smaller than τIoU or τdepth, respectively, the corresponding element mij in M will be changed. Since M now corresponds to a minimization problem, the values will not be set to 0 as in Section 3.5.1 but instead a very large value, e.g. 99999. Note that M is changed for elements in Mdepth that are smaller than τdepth 34 3. Method even if Mdepth is not used to calculate M . Dummy rows and columns are added if necessary before using the Hungarian method to solve the 2DA on M . The returned pairs of indices from the Hungarian method are filtered in the same way as in 3.5.1. Let Ematched be the set containing the filtered and matched indices for the detections from XP and XS. That is, (i, j) ∈ Ematched means detection with index i in XP is matched to index j in XS. The discrepancy of matched detections, dmatched, is then dmatched = ∑ (i,j)∈Ematched mij k′ (3.28) where k′ = |Ematched| and mij is the element at position (i, j) in M . The discrepancy of unmatched detections, dunmatched is dunmatched = || ( wP wS ) ◦ ( µP µS ) ||22. (3.29) The total discrepancy for sample X is then the sum of dmatched and dunmatched d = dmatched + dunmatched. (3.30) 3.7 Discrepancy Measure Development Process Section 3.5 and 3.6 describe two approaches to quantifying discrepancy. Both are built around the same central measure of discrepancy. The different components of the discrepancy measure, as well as their parameters, are chosen through a develop- ment process relying heavily on subjective decisions. The process involves a small set of (∼ 400) randomly selected samples and for each sample, two sets of cached detections coming fromMP andMF , respectively. Then, the samples and their detections are all processed using MA or AA (depending on what is being developed) and each sample is assigned a discrepancy. By manually looking at the images (example images are shown in Figure 3.5) and subjectively relating their discrepancy to how valuable they are believed to be forMP to train on, design choices are made and parameter values are set. The goal is for the discrepancy measure to assign a large discrepancy to samples whereMP is believed to benefit from learning. 35 3. Method Detections fromMP Detections fromMF Figure 3.5: Example images used when developing the discrepancy measure. Dis- crepancy values for each pair of matched detections are shown to the left. The dark boxes in the bottom left of the image show unweighted and weighted mean discrepancy values. 36 3. Method 3.8 Selection Strategy After the discrepancy for each sample in the non-annotated data set is calculated, the samples to send for annotation are selected. We compare two different selection strategies described in the following subsections. 3.8.1 Standard Selection First, N is sorted according to the discrepancy. The b samples from N with largest discrepancy are selected for annotation, as in Section 2.2. We refer to this selection strategy as Standard Selection (SS). By only considering the largest discrepancy when selecting, the resulting selected set of samples is not necessarily diverse. For example, suppose we evaluate on a dataset containing solely of front camera images. If the selected set of samples consists of solely back camera images, the DNN will most likely perform poorly. However, selecting only front camera images, in this case, is not optimal either, as there could still be valuable information in the other camera angles. A more diverse set of samples can be achieved by selecting samples from each camera angle, as described in the next subsection. 3.8.2 Camera Angle Selection As in SS, Camera Angle Selection sorts N according to discrepancy. Still, b samples are selected for annotation but there are restrictions. Only a fraction of b can be selected for each camera angle and the fractions are fixed and determined before the AL run begins. For example, suppose b must contain 50% front camera images and 50% back camera images. The b samples will consist of b/2 images with the largest discrepancy out of the front camera images and b/2 images with the largest discrepancy out of the back camera images. We refer to this selection strategy as Camera Angle Selection (CAS). 37 3. Method 38 4 Experiment Settings This chapter explains in detail how the experiments are set up. We select subsets of the whole dataset to speed up experiments and this is explained in Section 4.1. Fur- thermore, we describe how the method is evaluated and compared to other methods in Section 4.2. We give details on the setting of the AL algorithm in 4.3 and the parameter setting for the discrepancy measure is explained in Section 4.4. 4.1 Experiment Data Selection Our experiments are done on D (see Section 3.1) and contains around 80000 an- notated samples. We use a 90/10 train/validation split when creating the training set DT and the validation set DV . The split is done based on the date when the data is gathered instead of splitting randomly to avoid getting too similar images in training and validation. The training set DT contains images taken from several camera angles. However, DV contains images taken from only front cameras. The reason is that front cameras are more important than any other camera angle in a self-driving car.1 All data in D have annotations. For the experiments, some of that data is treated as if it does not have annotations, as described in Section 2.2.2. In that case, the annotations are simply hidden or not used by the algorithm. The annotations are then revealed upon the AL algorithms request. All experiments are initialized with N to be a random subset of DT of size 30000. The reduction in size speeds up the experiments and makes it easier to iterate different methods. A random subset of N of size 3000 is then moved to A (and the annotations are revealed) as an initial training set for the AL algorithm. Furthermore, 1000 randomly selected samples from DV are moved to a set V which is used in the experiments as the validation set. The random selections that are done to select the initial data for the non-annotated dataset N , annotated dataset A, and validation dataset V are provided with a random seed. This enables us to reproduce experiments with the same initial data sets by using identical random seeds. It also enables cross-evaluation of our methods with different initial data sets using different random seeds. 1Just consider the fraction of the time you are looking forward while driving. 39 4. Experiment Settings 4.2 Evaluation We evaluate our method by measuring Key Performance Indicator (KPI) gains on MP when evaluating on V . Our main KPIs are AP and F1 score. AP and F1 score are both dependent on IoU and IoU between 3D bounding boxes are often close to zero since the position is not always accurately predicted in MP . Therefore, both KPIs are calculated using IoU between 2D bounding boxes. AP is calculated as in the VOC2007 challenge, i.e. (2.26), but with a step size of 101, meaning interpolated precision value for r = 0, 0.01, 0.02, ..., 0.99, 1. The confidence is calculated as 1−p0, meaning any class that is not background is taken into account. The F1 score is calculated for 101 different confidence thresholds. The confidence thresholds are 0, 0.01, 0.02, ..., 0.99, 1. A detection is only counted as long as it is above the threshold. The maximum F1 score calculated for all the 101 thresholds is presented. Since AP and F1 score are insensitive to 3D position, it is of interest to see if the performance in position improves. As mentioned in Section 2.3, predicting an object’s position in a point cloud is less difficult than in an image. Therefore,MS most likely predicts position for a detection more accurately thanMP and a large discrepancy in position could expose samples whereMP is inaccurate. We measure Mean Absolute Error (MAE) in center point coordinates between detections and the annotated objects. As objects closer to the ego vehicle are more important to accurately predict, we calculate relative position error. For a given detection X and an annotation Y having positions Xpos = x X yX zX  , Y pos = x Y yY zY  (4.1) respectively, we have the relative position error ε = ||Y pos −Xpos||2 ||Y pos||2 (4.2) where the denominator is the distance from the ego vehicle to the annotated object since the ego vehicle is the origin (see Section 3.1). We calculate MAE in position as MAEpos = 1 N N∑ i=1 εi (4.3) where N is the total number of true positives. 4.2.1 Baselines We compare our method to a main baseline that is based on selecting samples ran- domly. The main baseline is defined as the average of 3 independent runs (but with 40 4. Experiment Settings the same initial data as described in Section 4.1). We also compare to uncertainty sampling with entropy (see Section 2.2). Each sample is scored according to the formula − n∑ i=0 9∑ j=0 pj,P,i log(pj,P,i) (4.4) where pj,P,i denotes the probability of class j for object i. The samples with largest entropy are selected for annotation. 4.3 Active Learning Setting The budget b is set to 3000 samples with the motivation that a sufficient number of samples need to be added in order to make a difference to the model parameters. The experiment is run for 5 AL iterations, meaningMP is first trained on the initial data set A and continues training on 6000, 9000, 12000, 15000 and lastly 18000 samples. Each AL iteration trains for 100 epochs, including the training on the initial training set. 100 epochs are not a guarantee for total convergence. However, 100 epochs keep the time for experiments low and we believe that 100 epochs are enough to measure performance for an AL iteration. Evaluation, i.e. KPI calculation, on V runs every 5 epochs. For each KPI, the value for an AL iteration is the mean value of the last 5 evaluations, i.e. the last 25 epochs. 4.4 Discrepancy Measure Parameter Setting Weights are set to define the importance of each attribute. By looking at examples like Figure 3.5 as described in Section 3.7, weights can be set to determine whether the discrepancy for a specific sample should be small or large. Furthermore, by setting an attribute weight to zero, we can investigate whether the attribute has a positive or negative effect when selecting samples. The parameters are set to β =  βx βy βz βh βw βl βr  =  80 10 1 5 5 50 π/2  , w =  wx wy wz wh ww wl wr wprob wP wS  =  1 0.125 0.6 1 1 1 0.25 0.1 0.005 0.016  , f(µ) = µ, δz = 160. (4.5) An object in D captured in an image has its center point approximately located in a range of [-40, 40] meters in the x-coordinate and [-5, 5] meters in the y-coordinate. 41 4. Experiment Settings Since ∆z is relative and already ranges between 0 and 1, we set βz = 1. Therefore, the maximum difference values for center point are set to (βx, βy, βz) = (80, 10, 1). For the maximum difference in dimension, (βh, βw, βl) = (5, 5, 50) is the theoretical maximum size of a large object in D, e.g. a train. The maximum difference in rotation is βr = π/2 is, as explained in Section 3.5.2. The choice of weights depends on the discrepancy betweenMP andMF detec- tions. There is a larger difference in the z-coordinate, rotation r, and probability than the other attributes. Therefore, we set wz, wr and wprob smaller in order to put more focus on other attributes. Almost all the objects are located in the same height relative to the ego vehicle, which makes the y-coordinate quite easy to predict. Since the y-coordinate is not as important to improve as the other coordinates we set wy to be small. To have a linear growth for unmatched objects, we set f(µ) = µ. We set wS > wP since we value recall higher than precision. Furthermore, we set δz = 160, i.e. the maximum range of the LiDAR sensor, because MF does not detect objects where z > 160. We also conduct experiments using f(µ) = 1.1µ to investigate how an exponential function affects the performance of the method. When the exponential function is tested, weights wP = 0.01 and wS = 0.05 are used. 42 5 Results This chapter presents our results, beginning with using annotations as outputs for MS in Section 5.1 and continuing with usingMF in a real-case scenario in Section 5.2. 5.1 Annotation Discrepancy As described in Section 3.3.1, the first experiments measure the discrepancy between the detections fromMP andMA, i.e. the annotations, to verify the approach before using a real DNN as MS. In Section 5.1.1 to 5.1.4, the Main Approach (MA) is tested and different parameter settings are compared. Then, in Section 5.1.3, MA and AA are compared. Finally, the discrepancy method is compared to the entropy and random baselines in Section 5.1.5. Presented KPIs are AP, F1 score for pedestrians, bicyclists, and vehicles, as well as position MAE. All figures contain plots showing the results from two different random seeds for initial data selection, respectively. Each plotted line is the average of 3 independent runs with equal settings. The lines are accompanied by limits, indicating the maximum and minimum values observed for the 3 runs. 6000 9000 12000 15000 18000 Number of annotated samples 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 AP Exponential, SS Exponential, CAS Linear, SS Linear, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 AP Exponential, SS Exponential, CAS Linear, SS Linear, CAS (b) Random seed 2 Figure 5.1: AP. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. 43 5. Results 5.1.1 Selection Strategy As described in Section 3.8, two different selection strategies are tested, Camera Angle Selection (CAS) and Standard Selection (SS). By looking at AP (in Figure 5.1), it is quite clear that CAS contributes to better performance than SS. Position MAE (Figure 5.2) and F1 score for vehicles (Figure 5.3) also indicate better perfor- mance using CAS than SS. Furthermore, F1 score for pedestrians (Figure 5.4) and bicyclists (Figure 5.5) fluctuate too much to give a clear picture of the differences between CAS and SS. We can, however, note that during the last 2 AL iterations, CAS is better than SS on F1 score for pedestrians. 5.1.2 Unmatched Detections Besides comparing CAS to SS, we also compare using f(µ) = µ to using f(µ) = 1.1µ, i.e. using linear or exponential growth for the number of unmatched objects. AP (Figure 5.1) shows almost no differences between the usage of a linear or exponential f . Position MAE (Figure 5.2) shows similar results. The average values of position MAE hint that a linear f using CAS is the best approach. However, that observation is considered to be within the margin of error. We consider the exponential function to be the better choice with the note that the differences are minimal. 6000 9000 12000 15000 18000 Number of annotated samples 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 Position MAE Exponential, SS Exponential, CAS Linear, SS Linear, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.075 0.080 0.085 0.090 0.095 0.100 0.105 Position MAE Exponential, SS Exponential, CAS Linear, SS Linear, CAS (b) Random seed 2 Figure 5.2: Position MAE. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. 44 5. Results 6000 9000 12000 15000 18000 Number of annotated samples 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 F1 (vehicle) Exponential, SS Exponential, CAS Linear, SS Linear, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 F1 (vehicle) Exponential, SS Exponential, CAS Linear, SS Linear, CAS (b) Random seed 2 Figure 5.3: F1 score for vehicles. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. 6000 9000 12000 15000 18000 Number of annotated samples 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 F1 (pedestrian) Exponential, SS Exponential, CAS Linear, SS Linear, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.08 0.10 0.12 0.14 0.16 0.18 F1 (pedestrian) Exponential, SS Exponential, CAS Linear, SS Linear, CAS (b) Random seed 2 Figure 5.4: F1 score for pedestrians. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. 6000 9000 12000 15000 18000 Number of annotated samples 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F1 (bicyclist) Exponential, SS Exponential, CAS Linear, SS Linear, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.0 0.1 0.2 0.3 0.4 0.5 F1 (bicyclist) Exponential, SS Exponential, CAS Linear, SS Linear, CAS (b) Random seed 2 Figure 5.5: F1 score for bicyclists. Comparison between SS and CAS as well as linear and exponential growth for unmatched detections. 45 5. Results 5.1.3 Alternative Approach Experiments with exponential growth for unmatched detections and CAS were made using AA to compare its performance to MA. As seen in Figure 5.6, the AP of using AA is similar to using MA. The same holds for position MAE, according to Figure 5.7, even though there are spread results. Figure 5.8 shows a slightly better performance on F1 score for vehicles using MA. F1 score for pedestrians (Figure 5.9) shows better performance using MA towards the last AL iterations while the results for bicyclists (Figure 5.10) are rather ambiguous. 6000 9000 12000 15000 18000 Number of annotated samples 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 AP MA, Exponential, CAS AA, Exponential, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 AP MA, Exponential, CAS AA, Exponential, CAS (b) Random seed 2 Figure 5.6: AP. Comparison between MA and AA. 6000 9000 12000 15000 18000 Number of annotated samples 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 Position MAE MA, Exponential, CAS AA, Exponential, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 Position MAE MA, Exponential, CAS AA, Exponential, CAS (b) Random seed 2 Figure 5.7: Position MAE. Comparison between MA and AA. 46 5. Results 6000 9000 12000 15000 18000 Number of annotated samples 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 F1 (vehicle) MA, Exponential, CAS AA, Exponential, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 F1 (vehicle) MA, Exponential, CAS AA, Exponential, CAS (b) Random seed 2 Figure 5.8: F1 score for vehicles. Comparison between MA and AA. 6000 9000 12000 15000 18000 Number of annotated samples 0.12 0.14 0.16 0.18 0.20 0.22 F1 (pedestrian) MA, Exponential, CAS AA, Exponential, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.06 0.08 0.10 0.12 0.14 0.16 F1 (pedestrian) MA, Exponential, CAS AA, Exponential, CAS (b) Random seed 2 Figure 5.9: F1 score for pedestrians. Comparison between MA and AA. 6000 9000 12000 15000 18000 Number of annotated samples 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 F1 (bicyclist) MA, Exponential, CAS AA, Exponential, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.1 0.2 0.3 0.4 0.5 F1 (bicyclist) MA, Exponential, CAS AA, Exponential, CAS (b) Random seed 2 Figure 5.10: F1 score for bicyclists. Comparison between MA and AA. 47 5. Results 5.1.4 Position-only Discrepancy Measure In Section 4.4 we mention that weights can be set to zero to evaluate their importance for the discrepancy measure. One such experiment has been conducted where all weights but wx, wy, wz are set to 0. We do not care about dimension or rotation and we do not measure probability discrepancy nor unmatched detections. The experiment aims to investigate if the performance of position estimation increases if we only consider samples where the position discrepancy is large. It also indicates the influence that the other components such as dimension and unmatched detections have on the performance gain. Figure 5.11 shows that the AP performance is reduced marginally when using the position-only discrepancy measure, compared to when using the full discrepancy measure, i.e. where no weights are set to 0. Figure 5.12 shows similar results for position MAE. 6000 9000 12000 15000 18000 Number of annotated samples 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 AP Exponential, CAS Only Position, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 AP Exponential, CAS Only Position, CAS (b) Random seed 2 Figure 5.11: AP. Comparison between full discrepancy and position-only discrep- ancy. 6000 9000 12000 15000 18000 Number of annotated samples 0.075 0.080 0.085 0.090 0.095 0.100 0.105 Position MAE Exponential, CAS Only Position, CAS (a) Random seed 1 6000 9000 12000 15000 18000 Number of annotated samples 0.075 0.080 0.085 0.090 0.095 0.100 Position MAE Exponential, CAS Only Position, CAS (b) Random seed 2 Figure 5.12: Position MAE. Comparison between full discrepancy and position- only discrepancy. 48 5. Results Figure 5.11 and 5.12 show performance with respect to the number of annotated samples. However, if the total annotation cost is to be reduced it is also relevant to look at performance with respect to the number of annotated objects. Annotating a sample that contains many objects takes more time than annotating a sample with few objects. Therefore, Figure 5.13 shows AP per annotated 2D object (includes 3D objects as well) and Figure 5.14 shows position MAE per annotated 3D object. These plots show that the discrepancy measure based solely on position achieves better performance for a fewer number of annotated objects, i.e. most likely to a lower cost, than the full discrepancy measure. 60000 80000 100000 120000 140000 160000