DF Drone Detection Using Deep Neural Networks and Semi-Supervised Learning Alice Karlsson Gustav Rosin Department of Electrical Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2020 Master’s thesis 2020 Drone Detection Using Deep Neural Networks and Semi-Supervised Learning ALICE KARLSSON GUSTAV ROSIN DF Department of Electrical Engineering Chalmers University of Technology Gothenburg, Sweden 2020 Drone Detection Using Deep Neural Networks and Semi-Supervised Learning ALICE KARLSSON GUSTAV ROSIN © ALICE KARLSSON, 2020. © GUSTAV ROSIN, 2020. Supervisor: Lucas Brynte, Department of Electrical Engineering Examiner: Fredrik Kahl, Department of Electrical Engineering Master’s Thesis 2020 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Drone in a rural area. Typeset in LATEX, template by David Frisk Printed by Chalmers Reproservice Gothenburg, Sweden 2020 iv Drone Detection Using Deep Neural Networks and Semi-Supervised Learning ALICE KARLSSON GUSTAV ROSIN Department of Electrical Engineering Chalmers University of Technology Abstract The usage of drones has in recent years increased for both civilian and military purposes. With their small size and tractability, weaponized drones pose a major threat and are difficult to detect and classify with modern equipment such as radar. Since drones share many similar features with other common objects such as birds in its operation space, radar struggle with accurate classification of drones. An- other approach to detect drones, as proposed in this thesis, is to utilize a camera based deep-learning object detection algorithm to detect and classify drones. To utilize a deep-learning algorithm, extensive computer resources and a vast amount of annotated data are required. However the availability of resources and annotated data is often limited. This thesis optimizes and adapts a RetinaNet and imple- ments temporal information using three different methods. The implementations of temporal information utilize a pre-trained backbone to minimize the demand of annotated data. Furthermore a semi-supervised learning framework is developed to enable the use of unannotaded data and background data. The framework generates annotations for unannotated data and thus expanding the amount of available data. The methods for integrating temporal information and the semi-supervised learn- ing framework was evaluated against the same test data as other state-of-the-art algorithms. The results show that the proposed methods for integrating temporal information were not advantageous with regards to the AP-score. However, by in- corporating the generated annotated data and background data, the performance of the algorithm vastly improved with regards to the F1-score. It could not outperform state-of-the-art methods, however the resulting framework has shown great promise of being able to be used as an annotation tool for unannotated data. Keywords: Deep Learning, Object detection, RetinaNet, Temporal information, De- tectron2, Semi-supervised learning. v Acknowledgements We would like to express our sincere thanks to our supervisor Lucas Brynte who has helped us during our thesis and provided invaluable ideas and insights. We would also like to greatly thank Angelo Coluccia and the team over at SafeShore. Without the data provided by them, this master thesis would not have been possible. We would also like to thank our supervisors at Saab, Stefan Eriksson and Stefan Holm- gren for interesting discussions and help with shaping our project. Furthermore, we would also like to thank Per Johansson at Saab for helping us organize as well as collect data. Lastly, we would like to thank Fredrik Kahl for being our examiner. Alice Karlsson, Gothenburg, June 2020 Gustav Rosin, Gothenburg, June 2020 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4.1 Integration of Temporal Information . . . . . . . . . . . . . . 3 1.4.2 Use of Unannotated Data . . . . . . . . . . . . . . . . . . . . 3 1.5 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6 Tools and Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6.1 Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6.2 PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6.3 Detectron2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.7 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.8 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Theory 7 2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2.1 Cross-Entropy Loss . . . . . . . . . . . . . . . . . . . 9 2.1.2.2 Focal Loss . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2.3 Smooth L1-Loss . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 13 2.1.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Two-Stage Detection . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 One-Stage Detection . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 IoU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.4 Non-Maximum Suppression . . . . . . . . . . . . . . . . . . . 16 2.3 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Vanishing and Exploding Gradient Problem . . . . . . . . . . 16 ix Contents 2.3.2 Performance Degradation . . . . . . . . . . . . . . . . . . . . 17 2.3.3 ResNet Architectures . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Bottom-up Pathway . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2 Top-down Pathway with Lateral Connections . . . . . . . . . 19 2.5 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Classification Subnet . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.2 Regression Subnet . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.3 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.4 Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Methods 25 3.1 Visualization and Analysis of Dataset . . . . . . . . . . . . . . . . . . 25 3.1.1 Visualization of Data . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Analysis of Data . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Implementation of RetinaNet . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Overview of Implementation . . . . . . . . . . . . . . . . . . . 30 3.2.2 Anchor Placement in Implementation . . . . . . . . . . . . . . 30 3.2.3 Sub-networks in the Implementation . . . . . . . . . . . . . . 31 3.2.4 Anchor Matching with Ground Truth . . . . . . . . . . . . . . 31 3.2.5 Training with Predictions and Ground Truths . . . . . . . . . 31 3.2.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Modification of RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.1 Defining a Baseline for the RetinaNet . . . . . . . . . . . . . . 32 3.3.2 Defining a Baseline for Training . . . . . . . . . . . . . . . . . 32 3.3.3 Utilizing Transfer Learning . . . . . . . . . . . . . . . . . . . . 33 3.3.4 Inclusion of P2 . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.5 Defining a New Baseline for the RetinaNet . . . . . . . . . . . 33 3.3.6 Anchor Placements . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.7 Pruning of P6 and P7 . . . . . . . . . . . . . . . . . . . . . . 33 3.3.8 Relaxation of IoU Thresholds . . . . . . . . . . . . . . . . . . 34 3.3.9 Tuning of Inference Parameters . . . . . . . . . . . . . . . . . 34 3.3.10 Final RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Temporal Information . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.1 Concatenation of Feature Maps . . . . . . . . . . . . . . . . . 35 3.4.2 Siamese Networks with Addition Merge . . . . . . . . . . . . . 35 3.4.3 Siamese Networks with Concatenation Merge . . . . . . . . . 36 3.5 Implementation of Semi-supervised Learning Framework . . . . . . . 36 3.5.1 Semi-supervised Learning Framework . . . . . . . . . . . . . . 37 3.5.2 Evaluation of Generated Annotations . . . . . . . . . . . . . . 37 4 Results 39 4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 x Contents 4.1.1 Evaluation Drone vs Bird Detection Challenge . . . . . . . . . 39 4.1.2 Evaluation of Semi-Supervised Learning Framework . . . . . . 39 4.2 Results RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.1 Evaluation of Baseline . . . . . . . . . . . . . . . . . . . . . . 40 4.2.2 Utilizing Transfer Learning . . . . . . . . . . . . . . . . . . . . 40 4.2.3 Inclusion of P2 . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.4 Definition of New Baseline . . . . . . . . . . . . . . . . . . . . 41 4.2.5 Anchor Placement . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.6 Pruning of P6 and P7 . . . . . . . . . . . . . . . . . . . . . . 43 4.2.7 Relaxation of IoU Thresholds . . . . . . . . . . . . . . . . . . 43 4.2.8 Tuning of Inference Parameters . . . . . . . . . . . . . . . . . 44 4.2.9 Results Drone vs Bird Detection Challenge . . . . . . . . . . . 45 4.3 Results Temporal Information . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 Concatenation of Feature Maps . . . . . . . . . . . . . . . . . 46 4.3.2 Siamese Networks with Addition Merge . . . . . . . . . . . . . 46 4.3.3 Siamese Networks with Concatenation Merge . . . . . . . . . 47 4.4 Results Semi-Supervised Learning Framework . . . . . . . . . . . . . 47 4.4.1 Visual Inspection of the Semi-Supervised Learning Framework 48 4.4.2 Semi-Supervised Learning Framework with Regards to the Drone vs Bird Detection Challenge . . . . . . . . . . . . . . . 51 5 Discussion 53 5.1 Discussion of Results of RetinaNet . . . . . . . . . . . . . . . . . . . 53 5.2 Discussion of Result Drone vs Bird Detection Challenge . . . . . . . . 54 5.3 Discussion of Results of Temporal Information . . . . . . . . . . . . . 55 5.3.1 Concatenation of Feature Maps . . . . . . . . . . . . . . . . . 55 5.3.2 Siamese Networks with Addition Merge . . . . . . . . . . . . . 56 5.3.3 Siamese Networks with Concatenation . . . . . . . . . . . . . 56 5.3.4 Utilizing Temporal Information . . . . . . . . . . . . . . . . . 57 5.4 Discussion of the Semi-supervised Learning Framework . . . . . . . . 57 5.4.1 Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4.2 Semi-Supervised Learning Framework with Regards to the Drone vs Bird Detection Challenge . . . . . . . . . . . . . . . 58 5.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6 Conclusion 61 Bibliography 63 A Appendix 1 I xi Contents xii List of Figures 2.1 Visualization of an artificial neuron and a biological neuron [13]. CC-BY 8 2.2 Visualization of different commonly used non-linear activation func- tions [15]. CC-BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Comparison of focal and cross entropy loss. Here, γ is a parameter that is tuned by the user. The standard cross entropy loss is repre- sented when γ = 0. From [7]. CC-BY . . . . . . . . . . . . . . . . . . 10 2.4 Illustration of L1-loss, L2-loss and smooth L1-loss. Note how deriva- tive is undefined for L1-loss when the loss is 0. From [18]. CC-BY . . 11 2.5 Illustration of smooth L1-loss. When the loss falls below a certain threshold, it switches to L2-loss. From [19]. CC-BY . . . . . . . . . . 11 2.6 Illustration of gradient descent with one parameter, w. From [21]. CC-BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.7 Visualization of how an image is proccessed and classified in a con- volutional nerural network [23]. CC-BY . . . . . . . . . . . . . . . . . 14 2.8 Visual representation of IoU From [27]. CC-BY . . . . . . . . . . . . 15 2.9 A residual block. From [35]. CC-BY . . . . . . . . . . . . . . . . . . 17 2.10 Table of the layers in different ResNet architectures. From [36]. CC-BY 18 2.11 The SSD-architecture. The auxilary layers are added at the output of the backbone. From[38]. CC-BY . . . . . . . . . . . . . . . . . . . 18 2.12 Example of FPN structure. From[39]. CC-BY . . . . . . . . . . . . . 19 2.13 Example of FPN structure with ResNet. From[39]. CC-BY . . . . . . 20 2.14 An example of RetianNet. From [41]. CC-BY . . . . . . . . . . . . . 21 2.15 An example of PR-curve. From [44]. CC-BY . . . . . . . . . . . . . . 23 3.1 Drone flying during the night with LED-lights and a dark background 26 3.2 Drone flying during the day with no lights on and a light background. 26 3.3 A drone flying in a park far away. . . . . . . . . . . . . . . . . . . . . 27 3.4 A drone flying in a field far away. . . . . . . . . . . . . . . . . . . . . 27 3.5 Original image containing one drone far away flying close to a bird. . 28 3.6 Zooming in on the drone, one can see the contours of the bird flying close. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Histogram illustrating the distribution of ground truth sizes. . . . . . 29 3.8 Histogram illustrating the distribution of aspect ratios in the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 xiii List of Figures 3.9 Illustration of how feature pixels relate to the input pixels. The low level feature map has a stride of 2 and the high level feature map has a stride of 4 [47]. CC-BY . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Comparison of true positives, false positives and false negatives with the final RetinaNet and the networks presented in the Drone vs Bird Detection Challenge. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Comparison of predictions from the Saab-test dataset on the same image by the two networks. . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Comparison of misclassifications between the first and second network on the same image from the Saab-test dataset. . . . . . . . . . . . . . 50 4.4 Comparison of misclassifications of the first and second network. . . . 50 A.1 Visual results from video 1. The drone is detected, however a false positive also occur . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Closer inspection of the detected drone from video 1 . . . . . . . . . . I A.3 Visual results from video 2. The two drones are detected, however three false positive also occur . . . . . . . . . . . . . . . . . . . . . . II A.4 Closer inspection of the detected drones from video 2 . . . . . . . . . II A.5 Visual results from video 3. The drone is detected and the nearby bird is not classified as a drone. However there is also a false positive III A.6 Closer inspection of the detected drone from video 3 . . . . . . . . . . III xiv List of Tables 4.1 The standard RetinaNet parameters used in this work. . . . . . . . . 40 4.2 Performance of the baseline. . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Parameters for experimenting with which layer to freeze until. . . . . 40 4.4 Results of freezing the baseline at different levels. . . . . . . . . . . . 41 4.5 Parameters for the experiments when including p2 and removing p6 and p7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.6 Results of including p2 compared to not having p2. . . . . . . . . . . 41 4.7 Parameters of the old baseline compared to the new baseline. . . . . . 41 4.8 Comparison of results from the old baseline and the new baseline. . . 42 4.9 Experiments with anchor placement. These experiments have the specified anchors across all feature levels. . . . . . . . . . . . . . . . . 42 4.10 Comparison of results for the anchor placements. . . . . . . . . . . . 42 4.11 Experiment with anchors placed on individual feature levels. . . . . . 42 4.12 Result of placing individual anchors across each feature level. . . . . . 42 4.13 Comparison of the new baseline and the new baseline with p6 and p7 removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.14 Comparison of results of the new baseline and the new baseline with p6 and p7 removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.15 Comparison of the new baseline and the new baseline with the IoU thresholds reduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.16 Results of the new baseline and the new baseline with the IoU thresh- olds reduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.17 Comparision of results after tuning the score threshold. . . . . . . . . 44 4.18 Comparison of results after tuning the NMS parameter. . . . . . . . . 44 4.19 Comparison of results after tuning the topK parameter. . . . . . . . . 44 4.20 A comparison of the highest AP-scores achieved for the three versions of the new baseline after tuning the inference parameters. . . . . . . . 44 4.21 Comparison of F1-scores from the Drone vs Bird Detection Challenge. 45 4.22 Results of concatenation of feature maps for the two time steps com- pared with the final RetinaNet. . . . . . . . . . . . . . . . . . . . . . 46 4.23 Results of Siamese networks with addition merge for the two time steps compared to the final RetinaNet. . . . . . . . . . . . . . . . . . 46 4.24 Results of concatenation merge for the two time steps at Res3. . . . . 47 4.25 Results of concatenation merge for the two time steps at Res4. . . . . 47 4.26 Amount of available data for the three datasets. . . . . . . . . . . . . 48 4.27 Comparison of the amount of data the two networks can annotate. . . 48 xv List of Tables 4.28 Comparison of amount of falsely annotated birds. . . . . . . . . . . . 51 4.29 F1-score on the annotated Drone vs Bird Detection Challenge test data with the inclusion of generated annotations and background. . . 51 xvi 1 Introduction This chapter will present the problems investigated in this master thesis, related work, and the contributions made. Lastly, an outline for the report is presented. 1.1 Background In recent years, the availability of drones has increased. With their low price and easy access, accessibility has expanded beyond military powers to civilian users as well. Drones are difficult to detect even with fairly modern equipment due to their small size. Traditional methods such as radar struggle to separate drones from birds. One viable approach to overcome the problem of identification and detection is to use camera vision to detect and classify the drones. In 2019 the 16-th IEEE International Conference on Advanced Video and Signal- based Surveillance (AVSS) held their annual Drone vs Bird Detection Challenge [1]. The goal of this challenge was the creation of a deep learning algorithm able to detect and classify drones in a video sequence where birds and motion in the fore and background may be present. Often when training neural networks a vast amount of data is required. Since collecting and annotating said data can be onerous, the amount of data is often limited. This thesis proposes a method of identifying drones using a deep learning algorithm that maximizes the use of limited usable data by integrating temporal information. Additionally, a semi-supervised learning framework for incorporation of unannotated data will be investigated. 1.2 Related Work In the Drone vs Bird Detection Challenge, different teams have created varying solutions to the challenge of drone detection. M. Nalamati et al. present a solution to the challenge using a Faster-RCNN with a base of ResNet-101 [2]. Another solution was to incorporate Super-Resolution techniques in order to increase the recall and thus increase the number of detected drones. This solution was presented by V. Magoulianitis et al. [3]. C. Craye et al won the competition 2019 and used a U-Net with ResNet110v2 and divided the detection and recognition paths into two networks [4]. D. Iglesia et al. presented a solution using a version of a RetinaNet in order to detect drones [5]. Outside of the Drone vs Bird Detection Challenge, further work has been made with the RetinaNet. C. Fu et al presented a modified version of a RetinaNet that ob- 1 1. Introduction tained a higher accuracy without increasing the computational cost. The presented approach adds mask predictions and a different loss function [6]. This master thesis utilizes a version of a RetinaNet. The original RetinaNet was proposed by T. Lin et al. The network consists of a backbone for feature extraction and two sub-networks for classification and regression. A loss function for the classi- fication known as the focal loss was also proposed. This loss function aims to tackle the problem of heavy class imbalance between the foreground and the background. This resulted in the detector having the same accuracy as a two-stage detector but with the speed of a one-stage detector [7]. In order to incorporate temporal information into a neural network, G Sistu et al. proposed multi-stream fully convolutional networks. In their work a two stream FCN, a three stream FCN, and a network with two streams combined with an additional LSTM architecture were presented [8]. D. Chahyati et al. implemented a Siamese network based on a RetinaNet and incorporated the Hungarian algorithm for tracking humans in moving images [9]. X. Wang investigated the impact of different time steps for temporal information in a RetinaNet, and which combination is preferred in order to utilize temporal information [10]. Labeling a large amount of data can be very expensive when training a neural network, several methods exist that aim to reduce the amount of work needed. E. Sangineto et al. suggested a self-paced training protocol for object detection using only image level annotations. The region proposals of a Fast-RCNN were utilized to acquire proposal boxes and the box with the highest confidence score was marked as a pseudo-label. "Easy" examples were used in the early network in order to prevent the training from diverging and to reliably expand their training set [11]. 1.3 Purpose The purpose of this thesis is to detect and classify drones while simultaneously not classifying birds as drones. To accomplish this, this master thesis focuses on the study, development, and implementation of an object detection algorithm. The thesis also endeavors to improve an object detection algorithm without using large amounts of annotated data. Additionally, this thesis investigates a semi-supervised learning framework for the incorporation of unannotated data. It also aims to investigate whether the framework can be utilized to iteratively and reliably expand the amount of available annotated training data while utilizing unannotated data. 1.4 Proposed Approach The project will start by modifying a standard RetinaNet to better fit the available data. Once complete, different strategies for incorporating temporal information and simultaneously utilize a pre-trained backbone will be developed and implemented. Finally, a semi-supervised learning framework for the incorporation of unannotated data will be created and evaluated. At the end of the thesis, the developed algo- rithm should be able to detect and classify drones in a setting occupied by birds. 2 1. Introduction The framework will be used to investigate whether the algorithm can be used to it- eratively and reliably expand the amount of available annotated training data while only utilizing unannotated data. This framework is expected to be able to generate annotations for easy scenarios and be able to create a few annotations for more difficult scenarios. 1.4.1 Integration of Temporal Information In object detection, the integration of temporal information has been proven to increase accuracy significantly. Temporal information refers to information from both space and time simultaneously. For example, in between consecutive frames of a video sequence, an object has moved in both space and time. The information in both of these frames is highly correlated and several methods exist that aim to utilize this information. A pre-trained backbone often improves the accuracy of a detection algorithm when only a small amount of training data is available. Since pre-trained backbones are not typically trained with the inclusion of temporal information, the parameters are only trained as feature extractors. Therefore, to utilize the pre-trained backbone, temporal information needs to be integrated in such a way that the pre-trained parameters are still used as feature extractors. For example in the work of [5], a frame difference channel was added to the three RGB- channels of the input image. Since the backbone is only trained with a three-channel input, this implementation could not utilize pre-trained parameters. This master thesis will study and evaluate different ways temporal information can be integrated, while also being able to utilize a pre-trained backbone. This will be accomplished by incorporating the difference between feature maps for consecutive timesteps into the neural network, both after and within the backbone. 1.4.2 Use of Unannotated Data Annotated data is usually hard to find as well as expensive since it requires extensive manual labor. Therefore different ways to incorporate unannotated data will be in- vestigated. This will be accomplished through the development of a semi-supervised learning framework. This framework will utilize the algorithm to generate annota- tions on unannotated examples. 1.5 Scope and Limitations Due to time constraints, this master thesis will implement and evaluate three dif- ferent methods of integrating temporal information. A RetinaNet will be used as a starting point and optimized to the best of our ability. The results of the op- timized RetinaNet will be specifically tailored to the problem at hand and should not be seen as a general purpose solution. Due to the limited amount of data avail- able, the methods used to improve the algorithm will utilize a pre-trained backbone. Limited access to GPU resources will also restrict how computationally expensive the suggested algorithm can be. The object detection algorithm will only be tuned for the classification of drones with other objects such as birds being regarded as 3 1. Introduction background. Only one version of the semi-supervised learning framework will be considered. The aim of this framework is to investigate whether the algorithm can be utilized as a simple annotation tool for drones. This investigation will be limited to a limited amount of unannotated collected data since the goal is to investigate the rough performance of the framework. The framework will further be limited to drones of roughly the same size as in the available annotated dataset from the Drone vs Bird Detection Challenge [1]. Furthermore, the evaluation of the framework will be preformed through visual inspection of a sample of generated annotations to de- termine if the annotations are comparable to hand annotated data. Further manual evaluation with regards to the confidence score will be made. Additionally, these annotations will be evaluated against the Drone vs Bird Detection Challenge test dataset. No additional hand-annotated data will be utilized to investigate the per- formance of the framework. 1.6 Tools and Equipment This section presents the tools and equipment used in this Master thesis. The section begins with an introduction of Google Colab and PyTorch and lastly the object detection library Detectron2 is presented. 1.6.1 Google Colab Google Colab is a service from Google that allows users to execute written Python code in a browser and provides free access to GPU resources. However, the amount of resources available varies day to day and is dependent on the current service usage and how much the user has recently used. The maximum continuous runtime is 12 hours and commonly available GPUs are Nvidia K80s, T4s, P4s, and P100s, however, there is not possible to choose which GPU one is assigned. The memory of the virtual machine also varies between runtimes, however, it does not vary during runtime [12]. 1.6.2 PyTorch PyTorch is a Python based open-source framework developed by Facebook’s artificial- intelligence research group (FAIR). It can be utilized in conjunction with GPUs and is commonly used when developing deep learning projects. It has a simple and easy to use API making it user friendly. 1.6.3 Detectron2 The code was written in Python and used the platform Detectron2. Detectron2 is an open source PyTorch based modular object detection library created by Facebook’s artificial-intelligence research group. State of the art object detection algorithms such as versions of Faster R-CNN and RetinaNet are implementable and easy to fur- ther modify thanks to Detectron2’s modular design. Pre-trained model weights are 4 1. Introduction easily obtainable and usable. Furthermore, the datasets COCO, LVIS, CityScapes, and PascalVOC are integrated in Detectron2. 1.7 Contribution The contribution of this thesis is an investigation of different methods for integrat- ing temporal information with limited GPU resources and data for drone detection. The suggested semi-supervised learning framework has to the best of our knowl- edge, not been utilized for object detection before. Therefore the development and investigation of this framework will also be seen as a contribution. 1.8 Report Outline Chapter 2 will cover the relevant theory in this thesis. The theory will cover the basics behind the neural network, object detection, the basic RetinaNet model as well as the evaluation metrics and Semi-supervised learning. Chapter 3 will cover the methodology used in this thesis. It will present the available data, describe how the standard RetinaNet was implemented and modi- fied, how the three different methods of temporal information were integrated, as well as a description of the semi-supervised learning framework. Chapter 4 will cover the results of the Sections presented in the methodology as well as a more detailed description of the methods. Chapter 5 will discuss the results in more detail and present suggested future work. Chapter 6 will contain conclusions drawn from the results and discussion. 5 1. Introduction 6 2 Theory This chapter presents the relevant theory for this Master Thesis and aims to ease the understanding of the project. Section 2.1 covers the theoretical basics of artificial convolutional neural networks. Section 2.2 introduces object detection and IoU. ResNet will be presented in Section 2.3 and FPN in Section 2.4. RetinaNet is introduced in Section 2.5. Finally, different evaluations methods are presented in Section 2.6 and semi-supervised learning is introduced in Section 2.7. 2.1 Neural Networks Neural networks are algorithms modeled to recognize patterns from an input with a design inspired by the neural network structure inside the human brain. A neuron consists of a cell body, an axiom, and dendrites. Most neurons receive an input signal via its dendrites and then produces an output signal through its axon. The information between two neurons is transferred via synapses, which enables the passing of an electrical or chemical signal to the other cell. An artificial neuron operates in much the same way as a biological neuron, see Figure 2.1. The input signal of an artificial neuron can be represented as xi and the synapse is represented by the weight Wi. A neuron may have inputs from multiple sources, thus the i in these expressions represent each input. The information transferred into the dendrites of the target neuron can then be represented by Wixi + b where b is the bias. In an artificial neural network, information is encoded by the frequency of which the neuron is sending information. The inputs to the neuron are summed and put through an activation function, that defines if the neuron should be considered to have sent information or not. The artificial neural networks are able to learn from information because the pa- rameter W0 and b are trainable. Thus when receiving an input, W0 and b should be selected so the information received is properly decoded. This process of selecting the proper value forW0 and b is referred to as training. In a neural network, multiple neurons are used simultaneously. A group of neurons forms a layer and there are several layers in a complete network. The layers can be divided into the input layer, the hidden layers as well as the output layer. The input layer receives the initial input to the network and sends it through the hidden layers. The output of the network is produced after the hidden layers by the output layer. A standard neural network has all the neurons in the previous layers connected to all the neurons in the upcoming layer. This structure is known as fully connected layers [13]. 7 2. Theory Figure 2.1: Visualization of an artificial neuron and a biological neuron [13]. CC- BY 2.1.1 Activation Functions The activation function is responsible for deciding whether a neuron has sent in- formation or not. The activation function is required to be non-linear to enable stacking of multiple layers and the usage of gradient based optimization methods. With a linear activation function, the stacked layers would simply be a linear com- bination of each other and can thus be replaced by a single layer. Furthermore, with gradient based optimization of the gradient with respect to the input would always be constant if a linear activation function was utilized. This leads to an optimization that is not dependent on the input [14]. The ReLU and the Sigmoid function are two of the most commonly used activation functions, see Figure 2.1.1. The ReLU function can be written as: ReLU(x) = max(0, x). (2.1) The Sigmoid function can be written as: Sigmoid(x) = 1 1 + e−x . (2.2) Figure 2.2: Visualization of different commonly used non-linear activation func- tions [15]. CC-BY 8 2. Theory 2.1.2 Loss Functions Once an input has been given to a neural network and an output is received, one needs to calculate whether this output was correct or not. This accomplished by the inclusion of a loss function. The goal of the loss function is to minimize the error between the predicted output and the expected output. The expected output can be referred to as the ground truth. There are different kinds of loss functions adapted for solving specific problems, for example regression and classification problems. 2.1.2.1 Cross-Entropy Loss The cross-entropy loss can be defined by either binary or multi-class cross entropy. Multi-class cross entropy can be split into multiple binary cross entropy functions were each function corresponds to one class of interest. In this thesis, binary cross entropy have been considered since only the drone class is of interest. The binary cross entropy loss can be written as: CE(p, y) = −log(p), if y = 1 −log(1− p), otherwise. In this equation, y is the ground truth and p is the estimated probability that the class has the label y = 1. The equation for binary cross entropy can be rewritten as: CE(pt) = −log(pt). (2.3) Where pt is defined as: pt = p, if y = 1 1− p, otherwise. The binary cross entropy loss receives its inputs from the last layer of the underlying neural network. The prediction is the estimated probability that the input was either foreground or background. Here, the foreground refers to the particular class that the binary problem corresponds to and the background refers to the other classes and no class [16]. 2.1.2.2 Focal Loss The focal loss is based on the commonly used cross entropy loss although it is designed to combat class imbalance, see Figure 2.3. Class imbalance, the imbalance between objects labeled as foreground and background, is a common issue. In a dataset there might be far more instances that are labeled as background instead of foreground. Should this be the case, the neural network might become overconfident in cases of background since the contribution of background is far larger than the contribution of foreground. This would result in poor performance when classifying a foreground example. The focal loss aims to improve upon this problem by introducing a modulating factor to the cross entropy loss. The definition of focal loss can written as: FL (pt) = − (1− pt)γ log (pt) . (2.4) 9 2. Theory The modulation factor − (1− pt)γ is responsible for suppressing losses that have a large probability, because if a prediction is very accurate, the impact it will have on the loss should be considerably smaller compared to a prediction with a lot of uncertainty [7]. Figure 2.3: Comparison of focal and cross entropy loss. Here, γ is a parameter that is tuned by the user. The standard cross entropy loss is represented when γ = 0. From [7]. CC-BY 2.1.2.3 Smooth L1-Loss Smooth L1-loss is a variant of the standard L1-loss that combines the properties of L1-loss when the loss is large and then switches to L2-loss as the loss gets smaller. The L1-loss can be written as: S = n∑ i=1 |yi − f (xi)| . (2.5) The L2-loss can be written as: S = n∑ i=1 (yi − f (xi))2 . (2.6) In these equations, yi is the ground truth and f(xi) is the approximated value. The L1-loss takes the absolute value of the difference between the target and the predic- tion while the L2-loss takes the square of the difference. The L1-loss is advantageous when the loss is large since it is robust to outliers and produces sparser solutions, however it is not differentiable when the loss is zero, see Figure 2.4. Thus the gradient-based optimization will be sub-optimal. Due to its quadratic nature, the L2-loss is differentiable when the loss is zero. The L2-loss produces more accurate results compared to the L1-loss because it penalizes larger errors to a higher extent, however, the L2-loss is more sensitive to outliers compared to the L1-loss [17]. 10 2. Theory Figure 2.4: Illustration of L1-loss, L2-loss and smooth L1-loss. Note how derivative is undefined for L1-loss when the loss is 0. From [18]. CC-BY The smooth L1-loss combines the L1 and L2-loss and uses the parameter β to distinguish between them. The smooth L1-loss can be written as: smoothL1(x) = { 0.5x2 if |x| < β |x| − 0.5 otherwise. (2.7) When the loss is above a certain threshold decided by β, the regression loss will be L1-loss. When the loss falls below this threshold, the loss changes and instead becomes L2-loss, see Figure 2.5. Figure 2.5: Illustration of smooth L1-loss. When the loss falls below a certain threshold, it switches to L2-loss. From [19]. CC-BY 2.1.3 Optimizers Once the neural network has processed an input and the losses for the predictions are calculated, the trainable parameters need to be tuned in order to minimize the loss. This is done through an optimizer and the back-propagation algorithm. It is common to use gradient based optimization methods, the most common of these methods is known as gradient descent, see Figure 2.6. In a neural network the amount of parameters depends on the number of layers, for notational simplicity, the weights and biases of all the layers are included in the parameter θ. The equation for gradient descent can be written as: 11 2. Theory θt+1 = θt − α∂E (X, θt) ∂θ . (2.8) In this equation, α is the learning rate, which is a hyperparameter that indicates how much the weights are allowed to be updated during training. A too small α will result in an optimization that is very slow since the weights are updated by a small amount each pass. A too large α might result in the parameters changing too much and jumping over the optimum. E(X, θt) is the expected value for the loss function with network parameters θ at time t and input-output pairs: (xi, yi) ∈ X. This equation illustrates how the parameters in θ are updated by using the previous weights and the gradient of the loss function for those parameters given an input- output pair [20]. Figure 2.6: Illustration of gradient descent with one parameter, w. From [21]. CC-BY In gradient descent, the weight update only occurs when the entire dataset has been processed. Due to the sheer amount of parameters in a neural network, the convergence of the optimizer to the global minimum may be very slow. Therefore an alternative form of gradient descent is utilized in neural networks. This form is known as stochastic gradient descent (SGD). The difference between gradient de- scent and stochastic gradient descent is that instead of processing the entire dataset before updating the weights, SGD takes a sample and updates the weights based on this sample. Neural networks usually contain many layers which all contain a set of parameters. When using stochastic gradient descent to perform optimization, it is necessary to know how each of these parameters affects the final loss function. To calculate this, the back-propagation algorithm is used. Back-propagation consists of four parts: the forward pass, the loss function, the backward pass, and the weight update. The forward pass passes an input through the network to get a prediction. The loss function calculates the error of the prediction with respect to the expected output. The backward pass is then performed in order to relate how all the parameters in all layers contribute to the loss function. Once the relation between the parameters for the layers is known for the specific input, the weights are updated in such a way that the loss is reduced [22]. 12 2. Theory 2.1.4 Convolutional Neural Networks A convolutional neural network (CNN) includes neurons, activation functions, a loss function, and an optimizer but the layers differ from a basic neural network. A con- volutional neural network utilizes convolutional layers instead of fully connected. The number of layers varies but the network consists of an input layer, multiple hidden layers, and an output layer. Convolutional layers are beneficial when deal- ing with images compared to standard fully connected layers since they utilize the concept of filters. An image can be thought of as a matrix with a specific width and height. In this matrix, each pixel is given a value based on its color gradient. Similarly, each filter is a matrix that also contains values, however, these values are trainable. If an image were to be processed in a fully connected fashion, the image matrix would need to be flattened into an array and each element would be assigned to a separate neuron as input. This would not only greatly increase the required parameters for the network, but the highly correlated information between adjacent pixels would not be utilized. Therefore, a convolution layer is more suitable for pro- cessing images compared to a fully connected one. A fully convolutional network is a convolutional network without any fully connected layers. These types of networks are very common in image segmentation. Given an input image, the filters move as a sliding window across the pixels in the image. The amount of pixels that the filter is moved each iteration is specified by the stride. When the filter moves, the numerical values of the image are multiplied by the parameters inside the filter with element wise multiplication. These values are then summed up and the resulting value represents the information that was extracted by the filter at a specific location in the image. By applying filters to the inputs for the different layers, the network can differentiate and recognize the different object and features in images, see Figure 2.7. The filters range from being able to detect basic features such as brightness in an image to more detailed features and characteristics of an object. Each layer usually contains more than one filter since each filter will only be trained to recognize a specific feature. The number of filters in a convolutional layer corresponds to the number of channels. Once the filter has slid or convolved over the entire image, the output is an array of numbers. This array is referred to as a feature map. In a convolutional network maxpooling layers are added to reduce the spatial size to reduce the computational expense. This by dividing the output of the layers into regions and only keep the regions with the highest value, thus reducing the size. The intuition behind this is that the pixel of a feature map that has the highest value contains the most useful information, therefore the other adjacent pixels can be discarded [23]. 13 2. Theory Figure 2.7: Visualization of how an image is proccessed and classified in a convo- lutional nerural network [23]. CC-BY 2.1.5 Transfer Learning Transfer learning is when a model, trained and adapted for a specific task, is reused as base for another assignment. To train a convolutional network from scratch requires a large amount of data, thus it is advantageous to utilize pre-trained models. The pre-trained models are often good at extracting common features because they are trained on large data set, with images of different kinds of objects. Even though the specific image class is missing from the pre-training, the basic features, such as basic shapes can be used and then fine-tune the network for the specific task. This can be done by retraining only the deepest layers on the specific data [24]. 2.2 Object Detection Object detection can be divided into two parts, object localization, and object clas- sification. Localization is used to point out where objects are located in the image and classification is used to decide what type of objects are present in an image. There are two different types of object detection techniques, two-stage object detec- tion, and one-stage object detection. The output of an object detection algorithm is usually a box that covers the object as well as the predicted class label of that box. These boxes are referred to as bounding boxes. This section will cover two-stage detectors, one-stage detectors, and explain IoU and non-maximum suppression and how they are used in training and evaluation. 2.2.1 Two-Stage Detection Two-Stage Object detectors perform localization and classification of an object in two stages, R-CNN, Fast R-CNN and Faster R-CNN are examples of two-stage detectors. The first stage is to decide regions on the image in which objects of interest may be present, so called regions of interest. These regions can be generated in different ways such as utilizing the simple search algorithm or use a neural network known as a region proposal network. Once these regions are decided the second stage is performed by passing these regions into a neural network, performing object classification. In addition, regression is performed to fit the proposed regions closer to the actual object. Since the model needs to perform two passes over the image, this method is not very fast [25]. 14 2. Theory 2.2.2 One-Stage Detection YOLO, SSD, and RetinaNet are one-stage object detectors that combine localization and classification into one step. Therefore, one-stage detectors only require a single stage. The regions of interest in one-stage detectors are acquired through the use of anchor boxes. One can think of the anchor box as an initial guess of where an object might be present. These anchors are evenly distributed on the input image, and for each of the anchors, a prediction is made whether they contain an object of interest or not. If an anchor is deemed to contain an object with the help of a classifier, the placement of the anchor is fine-tuned by a regressor to fit the object more accurately. The placement and sizes of these anchors are decided beforehand by the user, and thus no region proposal pass is required since the regions to investigate are decided beforehand. This often makes the object detection faster but less accurate compared to the two-stage object detection [26]. 2.2.3 IoU The intersection over union (IoU) is a measurement of how much one box overlaps with another box, see Figure 2.8. This can be written as: IoU(box1, box2) = |box1 ∩ box2| |box1 ∪ box2| . (2.9) . Figure 2.8: Visual representation of IoU From [27]. CC-BY The IoU measurement is used to decide whether a proposed anchor box during train- ing or predicted bounding box during evaluation is deemed to be correct. During training the model has access to all the ground truth bounding boxes and when the anchors are placed, the IoU is calculated for all the anchors and ground truths. If an anchor box has an IoU larger than the defined threshold over a ground truth, the model learns that this anchor contained an object with a specific class and how far away this anchor was from the ground truth. If the anchor box did not have an IoU over the threshold, the model learns that this anchor box just contained the background [28]. In the prediction step, anchor boxes are generated and a class is predicted for all anchor boxes. If an anchor box is deemed to contain an object, its position is 15 2. Theory adjusted with regards to the predicted offset by the regressor in order to acquire the final bounding boxes used in prediction. The predicted bounding boxes are filtered with non-maximum suppression (NMS) in order to reduce multiple predictions of the same object [29]. When evaluating a model, the IoU is used to determine whether a predicted bound- ing box is correct or not. If a predicted bounding box is placed with an IoU larger than the threshold over a ground truth of the correct class, this prediction is con- sidered to be correct. If a predicted bounding box does not have an IoU larger than the threshold over a ground truth, or the predicted class is incorrect, this prediction is considered to be incorrect. IoU on its own is not used to calculate any evaluation score, however, it is utilized in some common evaluation metrics. 2.2.4 Non-Maximum Suppression To filter out multiple predictions of the same object, non-maximum suppression is conventionally used by object detectors. The algorithm compares all of the pre- dictions confidence scores and the one with the highest score is selected. If any prediction has an IoU over some threshold with this prediction, it is discarded. This process is done for all the remaining predictions until no prediction has an IoU score over the threshold with another prediction [30]. 2.3 ResNet To improve the performance of neural networks, a common strategy is to add more layers and thus making it deeper. However, stacking more layers on top of another is problematic. When more layers are added, one encounters problems that are known as Vanishing and exploding gradients [31]. The problem with vanishing and exploding gradients has been addressed before ResNet. The problem was addressed in such a way that vanishing and exploding gradients did not hinder the network to converge from the beginning. A common way to combat this problem is to use the ReLU activation function since it has shown to be able to effectively handle this problem [32]. However, when deeper networks were evaluated, another problem manifested in the form of performance degradation. Even though it was possible to make the network converge from the start, an increase of layers surprisingly showed that both training and testing errors were increased with the addition of more layers. The purpose of ResNet was, therefore, to solve the vanishing and exploding gradient problem, while also taking care of the performance degradation problem by using skip connections to add the output from one layer to the other layers, skipping over some layers [33]. 2.3.1 Vanishing and Exploding Gradient Problem The problem with vanishing and exploding gradients occur during the backpropa- gation of the error function. During backpropagation, the gradient of the error is calculated backward through the network to tune the parameters of the network such that the loss function is minimized. Backpropagation is calculated by utilizing 16 2. Theory the chain rule to represent how each layer affects the final loss function [34]. Some- times, however, the gradients become very large or very small. Since the chain rule multiplies the gradients of all the layers, this might cause the final product to be close to zero or very large. With a too large or too small gradient, the network is unable to gather useful information of how each layer affect the loss function, and thus it is unable to learn [32]. 2.3.2 Performance Degradation The performance degradation problem was exposed when the previously utilized methods for combating the vanishing and exploding gradients revealed that the training and testing errors increased with deeper networks. The reason for this was that the currently available solvers simply could not find the desired mapping between the layers. The problem can be solved by utilizing so called residual blocks, see Figure 2.9. These blocks use a skip-connection between the non-linear layers and reformulate the problem slightly. Previously the goal was to find the desired underlaying mapping between the stacked layers. This mapping can be described as H(x). Since this mapping proved to be hard for solvers to find, the stacked layers were recast to fit another mapping, namley the residual mapping. This mapping can be described as F(x) := H(x) − x. This equation is then rewritten as H(x) = F(x)+x. It is easier for the solvers to find F(x) rather than finding the unreferenced H(x), because F(x) generally has a small response and thus the identity mapping, x provides a strong precondition [33]. Figure 2.9: A residual block. From [35]. CC-BY 2.3.3 ResNet Architectures The architectures of ResNet are based on stacking residual blocks on top of each other. ResNet-50, ResNet-101 and ResNet-152 are different kinds of ResNet archi- tectures, where different numbers of layers are utilized. For example, ResNet-50 consists of 50 convolutional layers and ResNet-101 has 101, excluding the final fully connected layers. 17 2. Theory Figure 2.10: Table of the layers in different ResNet architectures. From [36]. CC-BY In a common ResNet each residual block consists of either two 3x3 layers or two 1x1 layers with a 3x3 layer in between, see Figure 2.10. The former of these two structures is called a Bottleneck block, however, the two structures work in the same way as a residual block. Each layer in Figure 2.10 is referenced as conv2_x up to conv5_x, however, for the remainder of this report, these layers will be referenced to as Res2 up to Res5. 2.4 Feature Pyramid Network The concept of feature pyramids is used in many object detection algorithms, such as SSD, that utilizes a deep convolutional network in order to compute a feature hierarchy, see Figure 2.11. The deep convolutional network sub-samples layers which in turn create feature maps of varying spatial resolution. This inherent pyramidal feature hierarchy is used by the SSD in the form of auxiliary layers that are added on top of its backbone. These auxiliary layers then perform predictions on their own in conjunction with the output from the backbone. This enables the SSD to utilize multiple scales of spatial resolution when performing detections. However, a major downside comes with this approach, by adding these auxiliary layers to the output of the network the architecture does not utilize the higher resolution feature maps. The higher resolution feature maps do not have rich enough semantics since they have not been processed entirely by the network. Therefore it would not be beneficial to add them to the detection output [37]. By omitting these feature maps, the SSD struggles with the detection of small objects. Figure 2.11: The SSD-architecture. The auxilary layers are added at the output of the backbone. From[38]. CC-BY 18 2. Theory The feature pyramid network (FPN) is an improvement of the shortcomings of the SSD-architecture. The FPN architecture enables the use of high resolution feature maps that is simultaneous semantically strong. To achieve this, the architecture consists of a bottom-up pathway for feature extraction and a top-way pathway for up-sampling low resolution feature maps, see Figure 2.12. These feature maps are then combined through the use of lateral connections. Figure 2.12: Example of FPN structure. From[39]. CC-BY 2.4.1 Bottom-up Pathway The bottom-up pathway consists of a backbone that is used as a feature extractor. This backbone can, for example, consist of the feed forward computations of a ResNet, see Figure 2.13. At each level the feature map is sub-sampled with a factor of two and thus its dimensions are halved. This is represented by the different strides of the feature levels. The strides at different feature levels are {4, 8, 16, 32} at {Res2, Res3, Res4, Res5} respectively. The receptive field specifies the region of an image that is visible at each feature level, once the stride increases the receptive field increases. Since the receptive field increases with higher levels, these levels are more suitable for detecting larger objects. At these levels, the feature map is semantically strong due to its many convolutions, however, this comes with the cost that the spatial resolution is low. On the lower levels, the feature map has a higher spatial resolution and a smaller receptive field and is therefore more suitable for detecting smaller objects. 2.4.2 Top-down Pathway with Lateral Connections In the top-down pathway, feature maps are merged in order to create feature maps that are both semantically strong and have high resolution. This is done by up- sampling the feature maps by a factor of 2 for each of the levels in the FPN. Nearest neighbor up-sampling is utilized and the up-sampled feature maps are then enhanced with the feature map of the corresponding size from the bottom-up pathway. The feature map from the bottom-up pathway lacks the strong semantics compared to the up-sampled ones from the top-down pathway, however they have higher resolution. Through the use of lateral connections between the two pathways, the semantically strong and high resolution feature maps are merged. The lateral connections is a convolutional layer with a kernel size of 1 × 1 and are used to set the channel depth to 256 for all feature maps across all layers from the bottom-up pathway. The feature maps are merged using element-wise addition and to reduce aliasing from the upsampling, each merged feature map is put through a final convolutional 19 2. Theory layer with a kernel size of 3 × 3. After this process, each feature map have the strongest semantics possible since they all originates from the deepest layer and their resolution has been enhanced with the corresponding feature map from the bottom-up pathway [40]. Figure 2.13: Example of FPN structure with ResNet. From[39]. CC-BY 2.5 RetinaNet RetinaNet is an algorithm heavily based on the FPN architecture, it uses the FPN as its backbone but makes slight modifications to it, see Figure 2.14. The original FPN- backbone has feature levels {p2, p3, p4, p5} that are related to the corresponding feed forward computations from the ResNet backbone {Res2, Res3, Res4, Res5}. The RetinaNet however elects not to use p2 as it is rather computationally expensive. Furthermore it includes two additional feature levels and thus making the RetinaNet consist of {p3, p4, p5, p6, p7}. The last two feature levels are based on the output from Res5, p6 is obtained by performing a convolution on Res5 with a kernel size of 3×3 and with a stride of 2. The last level, p7 is obtained by applying a convolution on p6 with a kernel size of 3× 3 and a stride of 2. By including these two additional levels the detection of large object is improved [7]. Similarly to the FPN architecture, RetinaNet attached sub-networks to each feature level in order to make predictions. These sub-networks are designed for classification as well as regression and one of each is attached in parallel to each feature level. 20 2. Theory Figure 2.14: An example of RetianNet. From [41]. CC-BY 2.5.1 Classification Subnet Identical classification sub-networks are placed on each feature level and all of them share parameters and consist of five fully convolutional layers with a kernel size of 3×3. The channel depth for the first four layers is 256, corresponding to the channel depth of the FPN output. The fifth layer has a channel depth of A ×K where K is the number of classes to predict and A is the number of anchors for each spatial location. Between the first four layers there is a ReLU activation function and the output of the final layer is passed through a sigmoid activation function to obtain the binary predictions. The size of the last layer can be written as (N,A×K,Hi,Wi), N represent the batch size, Hi and Wi represent the height and width of the input feature map where i indicates the FPN level. The number of channels on the last layer corresponds to A ×K, thus for each spatial location A amount of anchors is defined. Each of these anchors is responsible for detecting any of the K available classes. Thus each output channel represents the probability of an anchor containing any of the classes. In the classification part of the RetinaNet, the focal loss function is utilized. 2.5.2 Regression Subnet The classification sub-network is responsible for evaluating each anchor in order to decide if they contain an object or not, whereas the regression sub-network is respon- sible for deciding how much the anchor box should be shifted to more accurately fit the object. The regression sub-network is almost identical to the classification sub- network with five fully convolutional layers, however, the final layer differs slightly. The regression sub-network has a channel depth of A× 4, four values that need to be predicted to fit the edges of the anchor box closer to the object, and the size of the output layer is (N,A× 4, Hi,Wi) [7]. The L1-loss function is used for the bounding box regression in the original imple- mentation of the RetinaNet, however, in this work a smooth L1-loss is used instead. The smooth L1-loss combines the positive properties of both L1 and L2 loss and thus contributes to a better approximation of the bounding box regression. 2.6 Evaluation This section covers two different ways of how to evaluate the performance of an object detector. This thesis will use F1-score and AP to evaluate the results. 21 2. Theory 2.6.1 Precision Precision is a measurement of how well the model is able to only detect objects of interest. For example, if a model for detecting diseases have low precision, it will yield a high amount of positive detections even if only a few have the disease. The precision of an object detector can be calculated as: Precision = TP TP + FP . (2.10) TP is the true positives, the detections that were deemed correct and FP stands for false positive. IoU is utilized to determine the nature of the detection. If the predicted bounding box has an IoU larger than the threshold over a ground truth, the detection is deemed to be related to this ground truth. If the predicted class of the bounding box is the same as the class of the ground truth, the prediction is deemed to be correct. Otherwise, the detection is labeled as a false positive. Multiple predictions of the same object are also deemed to be false positives [42]. 2.6.2 Recall Recall is a measurement of how well the model is able to detect all objects of interest. A model for detecting disease in humans with a low recall will result in a lot of missed diagnoses where people are told they are healthy even though they are sick. The recall can be written as: Recall = TP TP + FN . (2.11) FN stands for false negative, the number of ground truths undetected by the predic- tor. If a predictor outputs a set of bounding boxes and none of the bounding boxes have an IoU over the ground truth that is larger than the threshold, this indicates that the model was not able to detect this object. Thus it is referred to as a false negative [42]. 2.6.3 F1-Score The F1-score combines both of these metrics to evaluate how well the model is able to prevent false detections and not miss actual detections [43]. The F1 score is the harmonic mean of precision and recall, thus both values are taken into consideration. The F1-score is defined as: F1 = 2× Precision×Recall Precision+Recall . (2.12) 2.6.4 Average Precision In object detection problems, calculating the mean average precision is a common way to evaluate the performance of an algorithm. Average precision is the area under the curve of a precision-recall curve, AP = ∫ 1 0 p(r)dr. (2.13) 22 2. Theory This curve plots values of its recall and precision on its x and y-axis correspondingly, see Figure 2.15. The precision and recall pairs are calculated for a confidence score threshold that is slowly decreasing. This threshold is taken from the confidence of the predictions, where highly confident predictions are investigated first and then as the threshold decreases, more of the not so confident predictions are considered. Figure 2.15: An example of PR-curve. From [44]. CC-BY As the confidence score is decreased, lower quality predictions are allowed. This results in an increase in recall since more predictions are made and thus the number of false negatives is decreased. However, the precision is reduced when lower quality matches are allowed since more false positives occur with a lower confidence score. Since precision is reduced with increased recall, the maximum AP-score is achieved with a compromise between these two metrics. Mean average precision is simply the average of the average precision for all the available classes. In this work, only one class of drones has been considered and therefore average precision and mean average precision is the same [42]. 2.7 Semi-Supervised Learning Supervised learning methods refer to methods that utilize fully labeled datasets and aim to approximate a mapping function that given input, produces the desired output. This process is utilizing the labeled data as a reference during training to assert to the model whether a prediction was correct or not. Unsupervised learning does not have access to a labeled dataset and thus cannot utilize a reference during training. The aim of an unsupervised learning method is to find a relationship describing the structure in the data. An example of this is k-means clustering that aims to assign data points into one of K groups, to partition similar data points into the same category [45]. Semi-Supervised learning is a combination of supervised learning and unsupervised learning. It is a methodology for utilizing a small amount of annotated data and a large amount of unannotated data. One way to implement Semi-Supervised learning is by utilizing pseudo-labeling. First the network is trained in a supervised fashion, 23 2. Theory using annotated data. Thereafter the model is used on unannotated data and gen- erates predictions on this data. The prediction with the highest confidence score is considered to represent the true label and is thereafter utilized as annotated data [46]. 24 3 Methods This chapter will present the methods utilized to fit the existing RetinaNet archi- tecture to suit the problem of small drone detection. Furthermore, the work done with temporal information and the semi-supervised learning framework will be pre- sented. Section 3.1 will describe the analysis of the datasets from the Drone vs Bird Detection Challenge and provide a visualization of the training data with empha- sis on the challenges to overcome. Section 3.2 presents the implementation of the RetinaNet. Section 3.3 will describe the modification of the RetinaNet. Section 3.4 describes how temporal information was incorporated and Section 3.5 presents the Semi-Supervised learning framework. 3.1 Visualization and Analysis of Dataset The data set used for both training and evaluation of the algorithm comes from the competition Drone vs Bird Detection Challenge [1]. It contains eleven short video sequences for training and three short videos for testing. The images were taken with a static camera and contain both drones, birds, and background. However, only annotations for drones are available. The training data consists of 7245 images and the test data consists of 2079 images. The resolution of the images extracted was 1080 × 1920. 3.1.1 Visualization of Data The data provided from the Drone vs Bird Detection Challenge depicts drones flying in a wide variety of different environments and contains challenging scenarios such as occlusion, moving objects in the background, flying within close proximity of birds, and a combination of these. As well as these challenges, the small size of the drones present an additional difficulty. The figures presented below illustrate a couple of samples of the data that were used for training the algorithm. 25 3. Methods Drones are featured in different environments, see Figure 3.1 and Figure 3.2. In these figures the environment change with regards to the lightning of the background. One drone has clear LED-lights on and another has no lights on thus the appearance of the drone itself is changed. The environment can include land and urban features, see Figure 3.3 and Figure 3.4. The size of the drones was small and thus the algorithm needs to take into account both the change in the environment as well as the reduced size. Figure 3.1: Drone flying during the night with LED-lights and a dark background Figure 3.2: Drone flying during the day with no lights on and a light background. 26 3. Methods Figure 3.3: A drone flying in a park far away. Figure 3.4: A drone flying in a field far away. 27 3. Methods Another challenge was the inclusion of birds flying next to the drones, see Figure 3.5 and Figure 3.6. The drone is far away and a bird is flying next to it. In this example, the algorithm needs to handle the small distant drone and simultaneously not misclassify the bird. Figure 3.5: Original image containing one drone far away flying close to a bird. Figure 3.6: Zooming in on the drone, one can see the contours of the bird flying close. 28 3. Methods 3.1.2 Analysis of Data The majority of the annotations for the training data from the Drone vs Bird Detec- tion Challenge have an area of 3-32 pixels, see Figure 3.7. The aspect ratios of the annotations were defined as the height of the annotation over the width, see Figure 3.8. It was most common for the width and height of the annotations to be equal, otherwise, the width is generally larger than the height. Figure 3.7: Histogram illustrating the distribution of ground truth sizes. Figure 3.8: Histogram illustrating the distribution of aspect ratios in the ground truth. 29 3. Methods 3.2 Implementation of RetinaNet The implementation of the RetinaNet as defined in [7] was found in the Detectron2 library. The following section aims to explain the standard implementation in detail. 3.2.1 Overview of Implementation The standard RetinaNet, a part of Detectron2, is designed to be able to detect objects in a wide range of shapes and sizes. The RetinaNet is able to detect objects from eighty different classes and with sizes varying between 32-813 pixels [7]. The RetinaNet used in this thesis, before any modifications, consist of an FPN-backbone which has a bottom-up architecture of ResNet-101. The outputs of the bottom- up ResNet-101 consisted of [Res3, Res4, Res5] and the corresponding output of the FPN was [p3, p4, p5, p6, p7]. The number of channels from the FPN was 256, and the features were merged with element-wise addition in the FPN. The pre-trained weights for ResNet-101 come from ImageNet MRSA R101. 3.2.2 Anchor Placement in Implementation Anchors were placed on each feature level independently. In Detectron2 the anchors defined for each feature level were centered around each pixel on their corresponding feature map. The sizes of the feature maps were decided by the strides for each level. The strides for the outputs of the FPN were [8, 16, 32, 64, 128], corresponding to level [p3, p4, p5, p6, p7]. These values indicate how much the input image was down- sampled at each level. The pixels on the feature map will henceforth be referred to as Feature pixels in order to avoid confusion from the pixels in the input image, also known as Input pixels. The anchors were centered around the feature pixels, on each feature map. Since ResNet is fully convolutional, the spatial positions on the feature maps can be related to the spatial position on the input, see Figure 3.9. Depending on the stride of the feature level, each feature pixel corresponds to an area that was defined as Stride×Stride on the input pixels. Once the feature pixels were related to the corresponding area on the input images, the anchor coordinates were defined with respect to their corresponding coordinates on the input image. Figure 3.9: Illustration of how feature pixels relate to the input pixels. The low level feature map has a stride of 2 and the high level feature map has a stride of 4 [47]. CC-BY 30 3. Methods 3.2.3 Sub-networks in the Implementation When the anchors were defined, the feature maps from the FPN-backbone was fed into the classification and regression sub-networks. In Detectron2, the output layer of the classification sub-network has channels corresponding to the number of anchors around each feature pixel times the number of classes that are to be predicted. This is represented as a tensor of size (N,A×K,Hi,Wi) were the index i denotes the feature level. The output of the regression sub-network is almost identical to the classifier output but the number of channels was defined as the number of anchors around each feature pixel times four. The four represent the edges of the anchor box and the values in these channels represent how much the edges should regress in order to fit the ground truth. This is represented by a tensor of size (N,A×4, Hi,Wi). Given that an anchor was predicted to contain an object of interest, the goal of the regression sub-network was to decide how much this anchor box was to be regressed in order to fit the detected object properly. 3.2.4 Anchor Matching with Ground Truth In order for the network to utilize a ground truth annotation, at least one of the anchors placed needs to overlap with the ground truth. Whether an anchor was considered a match with a ground truth was based on an IoU threshold. For a Reti- naNet in Detectron2, anchor matching was performed individually for each anchor from all feature levels, for each of the ground truths in the image. If an anchor was labeled as foreground, it was assigned the correct label in the range [0, K-1] where K was the number of classes to predict. If it was labeled as the background it was assigned the label K. If the anchor was decided to be discarded, it was given the label -1. If an anchor was labeled to be the foreground, the distance for the four edges of this anchor to the ground truth was used as the ground truth for the regression sub-network. 3.2.5 Training with Predictions and Ground Truths By matching the predictions from the sub-networks with the corresponding ground truth elements, the algorithm was trained. The predictions from the sub-networks were performed on each feature level of the image separately, thus needed to be interpolated and concatenated. Anchors with labels -1 were discarded. The clas- sification loss was calculated with the focal loss, with regards to the predictions containing background and foreground. The focal loss parameter γ, was set to 2.0, and α to 0.25 to focus learning on hard negative examples. These values were taken from the work by T. Lin et. al, where the best combination was investigated [7]. The regression loss was calculated with the smooth L1 loss for anchors labeled as foreground. The β parameter in the smooth L1 loss was set to 0.1. Once the loss was between (-0.1,0.1) the L2 loss function was utilized. When the loss was outside the (-0.1, 0.1) range, L1 loss was utilized. Other values of the loss parameters were not investigated in this project. 31 3. Methods 3.2.6 Inference During inference, the model utilized the predictions from the sub-networks and the pre-defined anchors. In Detectron2 inference was performed on each feature level independently and predictions on each feature level were concatenated and then put through non-maximum suppression. A Sigmoid function was used to receive the probability that the anchors contained an object of interest. A filter was applied to the probabilities to only keep a certain amount of the top- scoring predictions. Another filter was applied to the remaining predictions that discarded predictions below a certain threshold. The anchor boxes remaining at this stage were fine tuned by their corresponding predictions from the regression sub-network. The predictions were concatenated and put through non-maximum suppression. The anchors left after non-maximum suppression was the final out- put from the model which contained both the bounding box coordinates for the prediction, the predicted class, and the confidence score of the bounding box. 3.3 Modification of RetinaNet This Section will cover the different modifications of the RetinaNet to fit the model to the task of small object detection. At the end of this Section the final RetinaNet is introduced. 3.3.1 Defining a Baseline for the RetinaNet The anchor sizes for the standard RetinaNet implemented by FAIR in Detectron2 needed to be resized to enable detection of smaller objects since the vast majority of training data from the Drone vs Bird Detection Challenge was below 32 pixels. The original implementation, by FAIR, placed anchors of different sizes on different feature levels however, in this work, the base RetinaNet has anchors of the same size placed on each feature map. The size of the input images were resized from 1080p down to 800x1333 to enable faster training. 3.3.2 Defining a Baseline for Training The training for the experiments was conducted with 30.000 iterations due to limited GPU resources and based on the work by M. Nalamati et al [2]. The team utilized the dataset from the Drone vs Bird Detection Challenge and the same backbone of ResNet-101, as utilized by the RetinaNet in this thesis. It was discovered that the optimal training range was between 30.000 to 70.000 iterations. A linear warm up method with 1000 iteration was utilized. By linearly increasing the learning rate the influence of early training images was reduced, thus reducing the over-fitting in the early stages of training. However, in this project different warm-up parameters were not tested and evaluated. The warm up factor was 0.001 however, further experiments with the warm up factor for this project were not investigated. A random horizontal flip of the training images was utilized as data augmentation. The probability of this flip occurring was 0.5. 32 3. Methods Due to heavy class imbalance, a prior probability of 0.01 was utilized by the fore- ground class i.e the drone class in the classifier. The reason for this was that the background class tends to dominate the loss function and result in unstable training early on for datasets with heavy class imbalance [7]. Only one GPU was available thus the batch size was set to 2. 3.3.3 Utilizing Transfer Learning Due to the small amount of available data from the Drone vs Bird Detection chal- lenge, transfer learning was utilized. A study, investigating the impact of freezing the backbone at different layers, was performed. Four versions of the baseline Reti- naNet were trained, for each of these networks, the backbone was frozen until Res1, Res2, Res3, and Res4. 3.3.4 Inclusion of P2 An additional feature level, p2 was added to the network thus enabling additional predictions on a feature map with higher resolution. Higher resolution feature maps are more suitable for detecting smaller objects, therefore an experiment was con- ducted to investigate whether the inclusion of an additional high resolution feature level would improve the accuracy. 3.3.5 Defining a New Baseline for the RetinaNet A new baseline was defined where the input sizes were not resized. Furthermore, the standard aspect ratios were changed to better fit the available data. 3.3.6 Anchor Placements Anchors of different sizes are often placed on different feature levels for feature pyra- mids such as FPN. However, this strategy advises a thorough study of the receptive field of each level. Due to time constraints, this study has been omitted, instead, experiments with anchor placements were performed. Different sizes, amounts, and placement of anchors were investigated. 3.3.7 Pruning of P6 and P7 In the feature pyramid architecture predictions were made on each feature level in- dividually. The top levels p6 and p7 are semantically strong however the resolution of these feature maps is low. Lower resolution feature maps tend to struggle with accurate detections of smaller objects [5]. An experiment was conducted to inves- tigate whether these levels could be removed to reduce the computational expense without reducing accuracy. 33 3. Methods 3.3.8 Relaxation of IoU Thresholds In the standard RetinaNet, implemented by FAIR, an anchor box is considered a match with an annotation if the IoU is 0.5 or higher. For small objects, this threshold may be too strict. This would result in fewer or no anchors matching the ground truth and thus further increase the class imbalance. An experiment was conducted to investigate how a more relaxed threshold would affect performance, the IoU threshold was reduced from 0.5 down to 0.4. 3.3.9 Tuning of Inference Parameters During inference, three parameters can be tuned to increase the performance of the algorithm. The parameters are TopK, Non-maximum suppression, and a score threshold. The TopK parameter specifies the maximum amount of predictions al- lowed by the network, three Topk values were evaluated. By default, the network keeps only the top 1000 scoring results. Non-maximum suppression was used to filter out multiple predictions of the same object. Four different values of the IoU threshold were evaluated. By default, the network has an IoU threshold of 0.5. The score threshold removes predictions with a confidence score smaller than some threshold. Four different values of this threshold were investigated. By default, the lowest allowed confidence score is 0.05. Due to time constraints, these parameters were only experimented with for a subset of the trained networks. The networks that were experimented with were the new baseline, the network with reduced IoU- threshold, and the network with p6 and p7 removed. The reason for experimenting with the network with reduced IoU threshold comes from the fact that more anchors will be matched with the ground truth. Since an- chors of the same size are placed on each feature level, this might cause an increase in false positives since there might be matches of the same object across different levels. Should this be the case, the NMS threshold during inference might be able to filter out these double matches. It is also interesting to investigate the network with p6 and p7 removed. The reason for this comes from the fact that when removing p6 and p7 the performance is expected to decrease somewhat. However, this will make the network lighter and faster. With experiments of the inference parameters, it might be possible to reduce this decrease in performance. 3.3.10 Final RetinaNet Once the optimal parameters for anchor placements, pruning of p6 and p7, relax- ation of IoU thresholds and tuning of inference parameters were decided, they were implemented in the final RetinaNet. The final RetinaNet was used for the imple- mentation of temporal information, and the semi-supervise learning framework. 34 3. Methods 3.4 Temporal Information Once the final RetinaNet was chosen, methods for integrating temporal informa- tion were investigated. Due to the limited available data from the Drone vs Bird Detection Challenge, the suggested methods required to be compatible with the pre- trained backbone. The parameters of the pre-trained backbone have been trained as feature extractors of images. Therefore to utilize the pre-training, the methods needed to be integrated to enable the utilization of the pre-trained parameters as feature extractors in roughly the same way as they were trained. Three methods of integrating temporal information were implemented and evaluated. Each method was based on integrating information from one frame at time t, with the information from a previous time step. Two time steps were investigated, t− 1 and t− 4. 3.4.1 Concatenation of Feature Maps This method utilized two identical copies of the FPN backbone. One backbone was given an input image of a frame at time t and the other one was given an input image of a previous frame. The backbone computed the feature maps for the two images and the feature maps from t were concatenated with the difference between the fea- ture maps at time t and the previous time step. The concatenation was performed along the channels of the feature maps. The approach was inspired by the work of C. Craye et al [4]. The team concatenated grayscaled images of different time steps at the input to the network. Instead of performing the concatenation at the input, the concatenation was performed after the backbone. The additional channels con- tributed with additional information to the classifier and regression head, regarding the change of the feature map between frames. Let P denote the set of feature maps from the backbone. Here P can be described by: P = {p3, p4, p5, p6, p7} (3.1) The concatenation of feature maps was done for two sets of feature maps, from time step t and the difference between the feature maps at time step t and t − i where i ∈ {1, 4}. The set of difference between feature maps is denoted by: Psub = Pt − Pt−i (3.2) The concatenation was performed on the feature maps in P and Psub and can be written as: P‖Psub = {p3 ‖ p3sub, p4 ‖ p4sub , . . . , p7 ‖ p7sub} (3.3) 3.4.2 Siamese Networks with Addition Merge An implementation of a Siamese network was made. The goal was to utilize the frozen layers as shared parameters between two consecutive frames. These frames were merged before the gradient enabled layers. This method aimed to be a more lightweight approach to integrating temporal information compared to the method 35 3. Methods of concatenating the feature maps. With this method, only one set of model param- eters needed to be stored in the memory. Furthermore, instead of having to save two full sets of feature maps, only one full set and one additional Res3 was required. Backpropagation was only needed to be performed on one set of trainable layers. Let R represent the set of feature maps from the backbone. This set can be di- vided into two subsets with one set representing the feature maps from the frozen layers and the other represents feature maps from the trainable layers. These sets can be written as RFrozen and RTrainable respectively they are described as: RFrozen = {Res1, Res2, Res3} (3.4) RTrainable = {Res4, Res5} (3.5) The proposed methodology was to feed two consecutive frames through the layers in RFrozen and then merge the feature maps before the feature maps were fed into the layers in RTrainable. The merge was performed by adding the feature maps and take the average by dividing by two. The succeeding layers were pre-trained and expected an input in the form of a feature map from an image. By receiving the average of the two feature maps, the input to the pre-trained parameters was similar to the expected input. 3.4.3 Siamese Networks with Concatenation Merge The final proposed implementation of temporal information was a combination of the previous methods. In the same way as the previous Siamese network, the set of feature maps, R was divided into RFrozen and RTrainable. Two consecutive frames were fed into the frozen layers. The merge was done by concatenating the feature map at time t with the difference between the feature map at time t and the feature map at time t − i. The choice of the merge was based on the common practice of merging through concatenation rather than addition, thus the difference between these two types of merge was investigated. The concatenation merge contributed with additional information regarding the change of the feature map between frames. An auxiliary convolutional layer with kernel size 1 × 1 was utilized to reduce the number of channels to the appropriate amount expected by the next coming layer. This method, in addition to merging at Res3, performed an experiment with merging at Res4. The reason for this was to investigate how the location of the merge in the network affected performance. 3.5 Implementation of Semi-supervised Learning Framework The goal of the semi-supervised learning framework suggested in this thesis was to enable the use of unannotated data in a modern object detection algorithm. Since gathering data is considered rather easy at Saab, the difficult part would be to 36 3. Methods manually annotate a vast amount of data. Therefore it was investigated whether the algorithm could reliably be utilized as an annotation tool for unlabeled data. The data collected for this thesis contains scenes that were largely different compared to the data found in the Drone vs Bird Detection Challenge datasets. This will challenge the generality of the algorithm since the collected data can be considered belonging to another domain. Furthermore, the data collected and annotated by the framework was incorporated into the Drone vs Bird Detection Challenge datasets, to investigate whether data from another domain than the Drone vs Bird Detection Challenge could be utilized to further improve the performance of the algorithm. 3.5.1 Semi-supervised Learning Framework The semi-supervised learning framework draws inspiration from the work of E. Sangineto et al [11]. The team utilized a Fast-RCNN to generate bounding box annotations on images using the region proposal network in the first iteration and later utilizing the predicted bounding boxes as annotations. The annotations were performed on easily classified samples for the first iterations and more difficult clas- sified examples were successively added between iterations. By utilizing easily clas- sified samples in the first iterations the training was kept from diverging in the early stages due to noisy annotations. The framework in this thesis utilized easily classified samples of drones flying in an area that differs from the provided train- ing data from the Drone vs Bird Detection Challenge. Predictions were made on the easy samples using the algorithm and if these samples proved to be compara- ble to hand annotated data, they were utilized as training data for the next iteration. The collected data consisted of three parts, background, training and test data. The background dataset consisted of a scene where birds may be present, however, no drones were included thus the scene was annotated as background. The training data consisted of a video with a rather easy case where the drone was flying towards an area that was occupied with traditional Swedish houses as well as pine trees. The test dataset was similar to the training dataset, however, with more emphasis on difficult cases such as flying the drone around treetops and multiple cases of the drone flying in front of buildings. 3.5.2 Evaluation of Generated Annotations The evaluation of the network trained with the generated annotations was made with visual inspection, investigation of confidence scores and misclassifications, and through evaluation against the Drone vs Bird Detection Challenge test data. The visual inspection was performed by sampling a section of the proposed annotations and an investigation whether the annotations were competitive to work done by hu- man annotators was made. The quality of the annotations was investigated between iterations of the framework, by investigating the confidence scores of predictions made between iterations and how accurately the annotations were placed on the drone. Furthermore, misclassified objects were investigated between iterations, to investigate whether these misclassifications were reduced between iterations. 37 3. Methods Furthermore, the number of frames that were deemed to be usable between training iterations was a factor when evaluating the performance of the framework. Since only a sample of the generated annotations was evaluated through visual inspection, further evaluation of the quality of the annotations was needed. The performance of the network trained with the generated annotated dataset was eval- uated with the Drone vs Bird Detection Challenge test data. The reason for this was to investigate whether the generated annotations could contribute to an increase in performance on any of the test videos from the Drone vs Bird Detection Challenge. The generated annotations might prove to contain noisy labels that were not dis- covered in the visual inspection. Therefore this evaluation was used to determine whether the benefits of the generated annotations outweigh the shortcomings. The F1-score was used as the evaluation metric. 38 4 Results This chapter will give an in-depth explanation of the experiments conducted in the methodology and present the results of these experiments. Section 4.1 will present the evaluation metrics. In Section 4.2 the results regarding the modifications of the RetinaNet will be presented. Section 4.3 will present the results of the implementa- tion of temporal information. In Section 4.4, the results from the semi-supervised learning framework will be covered. In this thesis, five different datasets have been used. Section 4.2 and 4.3 will only utilize the training and test data from the Drone vs Bird Detection Challenge. In Section 4.4, three new datasets will be intro- duced. These datasets will be utilized by the semi-supervised learning framework in conjunction with the training and test datasets from the Drone vs Bird Detection Challenge. 4.1 Evaluation Metrics Two main evaluation metrics have been utilized to evaluate the performance, the COCO AP metric, and the F1-score. The results regarding the tuning of the Reti- naNet and temporal information were evaluated with the implemented COCO AP function. The AP-score calculated for each evaluation was based on an average over multiple IoU-thresholds. Theses thresholds ranged from 0.50 : 0.95 with an incre- ment of 0.05 between each threshold. AP-score for small, medium, and large objects was obtained for each evaluation. An object was considered small if the area of its ground truth bounding box was smaller than 32 pixels, medium if it was between 32-96 pixels and finally large if it was larger than 96 pixels [48]. 4.1.1 Evaluation Drone vs Bird Detection Challenge In the Drone vs Bird Detection Challenge, another metric was utilized to evaluate the final algorithms, the F1-score. To compare the final algorithm with the state of the art algorithms, the F1-score of the final RetinaNet was evaluated using this metric. 4.1.2 Evaluation of Semi-Supervised Learning Framework To evaluate the semi-supervised learning framework, visual inspection was utilized. Furthermore, the semi-supervised learning framework was evaluated with the F1- score to compare how the use of this framework improved the results in the Drone vs Bird Detection Challenge. 39 4. Results 4.2 Results RetinaNet This section presents the results of the original implemented baseline, how transfer learning was utilized, and the inclusion of P2. Thereafter a new baseline is intro- duced followed by the results of the anchor placement and the pruning of p6 and p7. Consequently, the results of the relaxation of the IoU thresholds and the tun- ing of inference parameters will be presented. Finally the results of the Drone vs Bird Detection Challenge will be presented. The datasets used in Section 4.2 is the training and test data from the Drone vs Bird Detection Challenge. 4.2.1 Evaluation of Baseline The baseline parameters of the RetinaNet are the default settings from the Reti- naNet implemented by FAIR, see Table 4.1. The anchor sizes were selected based on the sizes of the annotations in the Drone vs Bird Detection Challenge dataset. The initial input size was down-sampled from the native resolution of 1080 × 1920 to the size 800 × 1333. The AP-scores for this baseline were calculated, see Table 4.2. Table 4.1: The standard RetinaNet parameters used in this work. Frozen until IoU Feature levels Aspect ratios Anchors Input size Res4 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333 Table 4.2: Performance of the baseline. AP AP(50) AP-s AP-m AP-l 11.659 30.705 6.954 38.517 36.583 4.2.2 Utilizing Transfer Learning In these experiments, all parameters were kept constant according to the baseline, and only the point where the backbone was frozen until was changed, see Table 4.3. These experiments indicated that to achieve the highest overall AP-score, the backbone should be frozen until Res3, see Table 4.4. Table 4.3: Parameters for experimenting with which layer to freeze until. Frozen until IoU Feature levels Aspect ratios Anchors Input size Experiment 1 Res1 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333 Experiment 2 Res2 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333 Experiment 3 Res3 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333 Experiment 4 Res4 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333 40 4. Results Table 4.4: Results of freezing the baseline at different levels. AP AP(50) AP-s AP-m AP-l Experiment 1 13.678 38.248 10.566 32.029 30.898 Experiment 2 13.221 31.392 6.231 43.556 38.677 Experiment 3 14.583 38.792 9.575 37.673 43.811 Experiment 4 11.659 30.705 6.954 38.517 36.583 4.2.3 Inclusion of P2 When including p2 it became apparent that this feature level was computationally expensive. To reduce the training time, the top feature levels p6 and p7 were re- m