DF

Drone Detection Using
Deep Neural Networks
and Semi-Supervised Learning

Alice Karlsson
Gustav Rosin

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2020


Master’s thesis 2020

Drone Detection Using
Deep Neural Networks

and Semi-Supervised Learning

ALICE KARLSSON
GUSTAV ROSIN

DF

Department of Electrical Engineering
Chalmers University of Technology

Gothenburg, Sweden 2020


Drone Detection Using
Deep Neural Networks
and Semi-Supervised Learning
ALICE KARLSSON
GUSTAV ROSIN

© ALICE KARLSSON, 2020.
© GUSTAV ROSIN, 2020.

Supervisor: Lucas Brynte, Department of Electrical Engineering
Examiner: Fredrik Kahl, Department of Electrical Engineering

Master’s Thesis 2020
Department of Electrical Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Drone in a rural area.

Typeset in LATEX, template by David Frisk
Printed by Chalmers Reproservice
Gothenburg, Sweden 2020

iv


Drone Detection Using
Deep Neural Networks
and Semi-Supervised Learning
ALICE KARLSSON
GUSTAV ROSIN
Department of Electrical Engineering
Chalmers University of Technology

Abstract
The usage of drones has in recent years increased for both civilian and military
purposes. With their small size and tractability, weaponized drones pose a major
threat and are difficult to detect and classify with modern equipment such as radar.
Since drones share many similar features with other common objects such as birds
in its operation space, radar struggle with accurate classification of drones. An-
other approach to detect drones, as proposed in this thesis, is to utilize a camera
based deep-learning object detection algorithm to detect and classify drones. To
utilize a deep-learning algorithm, extensive computer resources and a vast amount
of annotated data are required. However the availability of resources and annotated
data is often limited. This thesis optimizes and adapts a RetinaNet and imple-
ments temporal information using three different methods. The implementations
of temporal information utilize a pre-trained backbone to minimize the demand of
annotated data. Furthermore a semi-supervised learning framework is developed to
enable the use of unannotaded data and background data. The framework generates
annotations for unannotated data and thus expanding the amount of available data.
The methods for integrating temporal information and the semi-supervised learn-
ing framework was evaluated against the same test data as other state-of-the-art
algorithms. The results show that the proposed methods for integrating temporal
information were not advantageous with regards to the AP-score. However, by in-
corporating the generated annotated data and background data, the performance of
the algorithm vastly improved with regards to the F1-score. It could not outperform
state-of-the-art methods, however the resulting framework has shown great promise
of being able to be used as an annotation tool for unannotated data.

Keywords: Deep Learning, Object detection, RetinaNet, Temporal information, De-
tectron2, Semi-supervised learning.

v


Acknowledgements
We would like to express our sincere thanks to our supervisor Lucas Brynte who has
helped us during our thesis and provided invaluable ideas and insights. We would
also like to greatly thank Angelo Coluccia and the team over at SafeShore. Without
the data provided by them, this master thesis would not have been possible. We
would also like to thank our supervisors at Saab, Stefan Eriksson and Stefan Holm-
gren for interesting discussions and help with shaping our project. Furthermore, we
would also like to thank Per Johansson at Saab for helping us organize as well as
collect data. Lastly, we would like to thank Fredrik Kahl for being our examiner.

Alice Karlsson, Gothenburg, June 2020
Gustav Rosin, Gothenburg, June 2020

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4.1 Integration of Temporal Information . . . . . . . . . . . . . . 3
1.4.2 Use of Unannotated Data . . . . . . . . . . . . . . . . . . . . 3

1.5 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Tools and Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6.1 Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.2 PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.3 Detectron2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.7 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 7
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2.1 Cross-Entropy Loss . . . . . . . . . . . . . . . . . . . 9
2.1.2.2 Focal Loss . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2.3 Smooth L1-Loss . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 13
2.1.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Two-Stage Detection . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 One-Stage Detection . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 IoU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Non-Maximum Suppression . . . . . . . . . . . . . . . . . . . 16

2.3 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Vanishing and Exploding Gradient Problem . . . . . . . . . . 16

ix


Contents

2.3.2 Performance Degradation . . . . . . . . . . . . . . . . . . . . 17
2.3.3 ResNet Architectures . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Bottom-up Pathway . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Top-down Pathway with Lateral Connections . . . . . . . . . 19

2.5 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Classification Subnet . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Regression Subnet . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.3 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.4 Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Methods 25
3.1 Visualization and Analysis of Dataset . . . . . . . . . . . . . . . . . . 25

3.1.1 Visualization of Data . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Analysis of Data . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Implementation of RetinaNet . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Overview of Implementation . . . . . . . . . . . . . . . . . . . 30
3.2.2 Anchor Placement in Implementation . . . . . . . . . . . . . . 30
3.2.3 Sub-networks in the Implementation . . . . . . . . . . . . . . 31
3.2.4 Anchor Matching with Ground Truth . . . . . . . . . . . . . . 31
3.2.5 Training with Predictions and Ground Truths . . . . . . . . . 31
3.2.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Modification of RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Defining a Baseline for the RetinaNet . . . . . . . . . . . . . . 32
3.3.2 Defining a Baseline for Training . . . . . . . . . . . . . . . . . 32
3.3.3 Utilizing Transfer Learning . . . . . . . . . . . . . . . . . . . . 33
3.3.4 Inclusion of P2 . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.5 Defining a New Baseline for the RetinaNet . . . . . . . . . . . 33
3.3.6 Anchor Placements . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.7 Pruning of P6 and P7 . . . . . . . . . . . . . . . . . . . . . . 33
3.3.8 Relaxation of IoU Thresholds . . . . . . . . . . . . . . . . . . 34
3.3.9 Tuning of Inference Parameters . . . . . . . . . . . . . . . . . 34
3.3.10 Final RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Temporal Information . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Concatenation of Feature Maps . . . . . . . . . . . . . . . . . 35
3.4.2 Siamese Networks with Addition Merge . . . . . . . . . . . . . 35
3.4.3 Siamese Networks with Concatenation Merge . . . . . . . . . 36

3.5 Implementation of Semi-supervised Learning Framework . . . . . . . 36
3.5.1 Semi-supervised Learning Framework . . . . . . . . . . . . . . 37
3.5.2 Evaluation of Generated Annotations . . . . . . . . . . . . . . 37

4 Results 39
4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

x


Contents

4.1.1 Evaluation Drone vs Bird Detection Challenge . . . . . . . . . 39
4.1.2 Evaluation of Semi-Supervised Learning Framework . . . . . . 39

4.2 Results RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 Evaluation of Baseline . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Utilizing Transfer Learning . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Inclusion of P2 . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Definition of New Baseline . . . . . . . . . . . . . . . . . . . . 41
4.2.5 Anchor Placement . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.6 Pruning of P6 and P7 . . . . . . . . . . . . . . . . . . . . . . 43
4.2.7 Relaxation of IoU Thresholds . . . . . . . . . . . . . . . . . . 43
4.2.8 Tuning of Inference Parameters . . . . . . . . . . . . . . . . . 44
4.2.9 Results Drone vs Bird Detection Challenge . . . . . . . . . . . 45

4.3 Results Temporal Information . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Concatenation of Feature Maps . . . . . . . . . . . . . . . . . 46
4.3.2 Siamese Networks with Addition Merge . . . . . . . . . . . . . 46
4.3.3 Siamese Networks with Concatenation Merge . . . . . . . . . 47

4.4 Results Semi-Supervised Learning Framework . . . . . . . . . . . . . 47
4.4.1 Visual Inspection of the Semi-Supervised Learning Framework 48
4.4.2 Semi-Supervised Learning Framework with Regards to the

Drone vs Bird Detection Challenge . . . . . . . . . . . . . . . 51

5 Discussion 53
5.1 Discussion of Results of RetinaNet . . . . . . . . . . . . . . . . . . . 53
5.2 Discussion of Result Drone vs Bird Detection Challenge . . . . . . . . 54
5.3 Discussion of Results of Temporal Information . . . . . . . . . . . . . 55

5.3.1 Concatenation of Feature Maps . . . . . . . . . . . . . . . . . 55
5.3.2 Siamese Networks with Addition Merge . . . . . . . . . . . . . 56
5.3.3 Siamese Networks with Concatenation . . . . . . . . . . . . . 56
5.3.4 Utilizing Temporal Information . . . . . . . . . . . . . . . . . 57

5.4 Discussion of the Semi-supervised Learning Framework . . . . . . . . 57
5.4.1 Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.2 Semi-Supervised Learning Framework with Regards to the

Drone vs Bird Detection Challenge . . . . . . . . . . . . . . . 58
5.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Conclusion 61

Bibliography 63

A Appendix 1 I

xi


Contents

xii


List of Figures

2.1 Visualization of an artificial neuron and a biological neuron [13]. CC-BY 8
2.2 Visualization of different commonly used non-linear activation func-

tions [15]. CC-BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Comparison of focal and cross entropy loss. Here, γ is a parameter

that is tuned by the user. The standard cross entropy loss is repre-
sented when γ = 0. From [7]. CC-BY . . . . . . . . . . . . . . . . . . 10

2.4 Illustration of L1-loss, L2-loss and smooth L1-loss. Note how deriva-
tive is undefined for L1-loss when the loss is 0. From [18]. CC-BY . . 11

2.5 Illustration of smooth L1-loss. When the loss falls below a certain
threshold, it switches to L2-loss. From [19]. CC-BY . . . . . . . . . . 11

2.6 Illustration of gradient descent with one parameter, w. From [21].
CC-BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Visualization of how an image is proccessed and classified in a con-
volutional nerural network [23]. CC-BY . . . . . . . . . . . . . . . . . 14

2.8 Visual representation of IoU From [27]. CC-BY . . . . . . . . . . . . 15
2.9 A residual block. From [35]. CC-BY . . . . . . . . . . . . . . . . . . 17
2.10 Table of the layers in different ResNet architectures. From [36]. CC-BY 18
2.11 The SSD-architecture. The auxilary layers are added at the output

of the backbone. From[38]. CC-BY . . . . . . . . . . . . . . . . . . . 18
2.12 Example of FPN structure. From[39]. CC-BY . . . . . . . . . . . . . 19
2.13 Example of FPN structure with ResNet. From[39]. CC-BY . . . . . . 20
2.14 An example of RetianNet. From [41]. CC-BY . . . . . . . . . . . . . 21
2.15 An example of PR-curve. From [44]. CC-BY . . . . . . . . . . . . . . 23

3.1 Drone flying during the night with LED-lights and a dark background 26
3.2 Drone flying during the day with no lights on and a light background. 26
3.3 A drone flying in a park far away. . . . . . . . . . . . . . . . . . . . . 27
3.4 A drone flying in a field far away. . . . . . . . . . . . . . . . . . . . . 27
3.5 Original image containing one drone far away flying close to a bird. . 28
3.6 Zooming in on the drone, one can see the contours of the bird flying

close. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Histogram illustrating the distribution of ground truth sizes. . . . . . 29
3.8 Histogram illustrating the distribution of aspect ratios in the ground

truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

xiii


List of Figures

3.9 Illustration of how feature pixels relate to the input pixels. The low
level feature map has a stride of 2 and the high level feature map has
a stride of 4 [47]. CC-BY . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Comparison of true positives, false positives and false negatives with
the final RetinaNet and the networks presented in the Drone vs Bird
Detection Challenge. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Comparison of predictions from the Saab-test dataset on the same
image by the two networks. . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Comparison of misclassifications between the first and second network
on the same image from the Saab-test dataset. . . . . . . . . . . . . . 50

4.4 Comparison of misclassifications of the first and second network. . . . 50

A.1 Visual results from video 1. The drone is detected, however a false
positive also occur . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

A.2 Closer inspection of the detected drone from video 1 . . . . . . . . . . I
A.3 Visual results from video 2. The two drones are detected, however

three false positive also occur . . . . . . . . . . . . . . . . . . . . . . II
A.4 Closer inspection of the detected drones from video 2 . . . . . . . . . II
A.5 Visual results from video 3. The drone is detected and the nearby

bird is not classified as a drone. However there is also a false positive III
A.6 Closer inspection of the detected drone from video 3 . . . . . . . . . . III

xiv


List of Tables

4.1 The standard RetinaNet parameters used in this work. . . . . . . . . 40
4.2 Performance of the baseline. . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Parameters for experimenting with which layer to freeze until. . . . . 40
4.4 Results of freezing the baseline at different levels. . . . . . . . . . . . 41
4.5 Parameters for the experiments when including p2 and removing p6

and p7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Results of including p2 compared to not having p2. . . . . . . . . . . 41
4.7 Parameters of the old baseline compared to the new baseline. . . . . . 41
4.8 Comparison of results from the old baseline and the new baseline. . . 42
4.9 Experiments with anchor placement. These experiments have the

specified anchors across all feature levels. . . . . . . . . . . . . . . . . 42
4.10 Comparison of results for the anchor placements. . . . . . . . . . . . 42
4.11 Experiment with anchors placed on individual feature levels. . . . . . 42
4.12 Result of placing individual anchors across each feature level. . . . . . 42
4.13 Comparison of the new baseline and the new baseline with p6 and p7

removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.14 Comparison of results of the new baseline and the new baseline with

p6 and p7 removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.15 Comparison of the new baseline and the new baseline with the IoU

thresholds reduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.16 Results of the new baseline and the new baseline with the IoU thresh-

olds reduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.17 Comparision of results after tuning the score threshold. . . . . . . . . 44
4.18 Comparison of results after tuning the NMS parameter. . . . . . . . . 44
4.19 Comparison of results after tuning the topK parameter. . . . . . . . . 44
4.20 A comparison of the highest AP-scores achieved for the three versions

of the new baseline after tuning the inference parameters. . . . . . . . 44
4.21 Comparison of F1-scores from the Drone vs Bird Detection Challenge. 45
4.22 Results of concatenation of feature maps for the two time steps com-

pared with the final RetinaNet. . . . . . . . . . . . . . . . . . . . . . 46
4.23 Results of Siamese networks with addition merge for the two time

steps compared to the final RetinaNet. . . . . . . . . . . . . . . . . . 46
4.24 Results of concatenation merge for the two time steps at Res3. . . . . 47
4.25 Results of concatenation merge for the two time steps at Res4. . . . . 47
4.26 Amount of available data for the three datasets. . . . . . . . . . . . . 48
4.27 Comparison of the amount of data the two networks can annotate. . . 48

xv


List of Tables

4.28 Comparison of amount of falsely annotated birds. . . . . . . . . . . . 51
4.29 F1-score on the annotated Drone vs Bird Detection Challenge test

data with the inclusion of generated annotations and background. . . 51

xvi


1
Introduction

This chapter will present the problems investigated in this master thesis, related
work, and the contributions made. Lastly, an outline for the report is presented.

1.1 Background
In recent years, the availability of drones has increased. With their low price and
easy access, accessibility has expanded beyond military powers to civilian users as
well. Drones are difficult to detect even with fairly modern equipment due to their
small size. Traditional methods such as radar struggle to separate drones from birds.
One viable approach to overcome the problem of identification and detection is to
use camera vision to detect and classify the drones.
In 2019 the 16-th IEEE International Conference on Advanced Video and Signal-
based Surveillance (AVSS) held their annual Drone vs Bird Detection Challenge [1].
The goal of this challenge was the creation of a deep learning algorithm able to
detect and classify drones in a video sequence where birds and motion in the fore
and background may be present.
Often when training neural networks a vast amount of data is required. Since
collecting and annotating said data can be onerous, the amount of data is often
limited. This thesis proposes a method of identifying drones using a deep learning
algorithm that maximizes the use of limited usable data by integrating temporal
information. Additionally, a semi-supervised learning framework for incorporation
of unannotated data will be investigated.

1.2 Related Work
In the Drone vs Bird Detection Challenge, different teams have created varying
solutions to the challenge of drone detection. M. Nalamati et al. present a solution
to the challenge using a Faster-RCNN with a base of ResNet-101 [2]. Another
solution was to incorporate Super-Resolution techniques in order to increase the
recall and thus increase the number of detected drones. This solution was presented
by V. Magoulianitis et al. [3]. C. Craye et al won the competition 2019 and used a
U-Net with ResNet110v2 and divided the detection and recognition paths into two
networks [4]. D. Iglesia et al. presented a solution using a version of a RetinaNet in
order to detect drones [5].
Outside of the Drone vs Bird Detection Challenge, further work has been made with
the RetinaNet. C. Fu et al presented a modified version of a RetinaNet that ob-

1


1. Introduction

tained a higher accuracy without increasing the computational cost. The presented
approach adds mask predictions and a different loss function [6].
This master thesis utilizes a version of a RetinaNet. The original RetinaNet was
proposed by T. Lin et al. The network consists of a backbone for feature extraction
and two sub-networks for classification and regression. A loss function for the classi-
fication known as the focal loss was also proposed. This loss function aims to tackle
the problem of heavy class imbalance between the foreground and the background.
This resulted in the detector having the same accuracy as a two-stage detector but
with the speed of a one-stage detector [7].
In order to incorporate temporal information into a neural network, G Sistu et al.
proposed multi-stream fully convolutional networks. In their work a two stream
FCN, a three stream FCN, and a network with two streams combined with an
additional LSTM architecture were presented [8].
D. Chahyati et al. implemented a Siamese network based on a RetinaNet and
incorporated the Hungarian algorithm for tracking humans in moving images [9].
X. Wang investigated the impact of different time steps for temporal information
in a RetinaNet, and which combination is preferred in order to utilize temporal
information [10].
Labeling a large amount of data can be very expensive when training a neural
network, several methods exist that aim to reduce the amount of work needed. E.
Sangineto et al. suggested a self-paced training protocol for object detection using
only image level annotations. The region proposals of a Fast-RCNN were utilized to
acquire proposal boxes and the box with the highest confidence score was marked as
a pseudo-label. "Easy" examples were used in the early network in order to prevent
the training from diverging and to reliably expand their training set [11].

1.3 Purpose
The purpose of this thesis is to detect and classify drones while simultaneously not
classifying birds as drones. To accomplish this, this master thesis focuses on the
study, development, and implementation of an object detection algorithm. The
thesis also endeavors to improve an object detection algorithm without using large
amounts of annotated data. Additionally, this thesis investigates a semi-supervised
learning framework for the incorporation of unannotated data. It also aims to
investigate whether the framework can be utilized to iteratively and reliably expand
the amount of available annotated training data while utilizing unannotated data.

1.4 Proposed Approach
The project will start by modifying a standard RetinaNet to better fit the available
data. Once complete, different strategies for incorporating temporal information and
simultaneously utilize a pre-trained backbone will be developed and implemented.
Finally, a semi-supervised learning framework for the incorporation of unannotated
data will be created and evaluated. At the end of the thesis, the developed algo-
rithm should be able to detect and classify drones in a setting occupied by birds.

2


1. Introduction

The framework will be used to investigate whether the algorithm can be used to it-
eratively and reliably expand the amount of available annotated training data while
only utilizing unannotated data. This framework is expected to be able to generate
annotations for easy scenarios and be able to create a few annotations for more
difficult scenarios.

1.4.1 Integration of Temporal Information
In object detection, the integration of temporal information has been proven to
increase accuracy significantly. Temporal information refers to information from
both space and time simultaneously. For example, in between consecutive frames
of a video sequence, an object has moved in both space and time. The information
in both of these frames is highly correlated and several methods exist that aim
to utilize this information. A pre-trained backbone often improves the accuracy
of a detection algorithm when only a small amount of training data is available.
Since pre-trained backbones are not typically trained with the inclusion of temporal
information, the parameters are only trained as feature extractors. Therefore, to
utilize the pre-trained backbone, temporal information needs to be integrated in
such a way that the pre-trained parameters are still used as feature extractors. For
example in the work of [5], a frame difference channel was added to the three RGB-
channels of the input image. Since the backbone is only trained with a three-channel
input, this implementation could not utilize pre-trained parameters. This master
thesis will study and evaluate different ways temporal information can be integrated,
while also being able to utilize a pre-trained backbone. This will be accomplished
by incorporating the difference between feature maps for consecutive timesteps into
the neural network, both after and within the backbone.

1.4.2 Use of Unannotated Data
Annotated data is usually hard to find as well as expensive since it requires extensive
manual labor. Therefore different ways to incorporate unannotated data will be in-
vestigated. This will be accomplished through the development of a semi-supervised
learning framework. This framework will utilize the algorithm to generate annota-
tions on unannotated examples.

1.5 Scope and Limitations
Due to time constraints, this master thesis will implement and evaluate three dif-
ferent methods of integrating temporal information. A RetinaNet will be used as
a starting point and optimized to the best of our ability. The results of the op-
timized RetinaNet will be specifically tailored to the problem at hand and should
not be seen as a general purpose solution. Due to the limited amount of data avail-
able, the methods used to improve the algorithm will utilize a pre-trained backbone.
Limited access to GPU resources will also restrict how computationally expensive
the suggested algorithm can be. The object detection algorithm will only be tuned
for the classification of drones with other objects such as birds being regarded as

3


1. Introduction

background. Only one version of the semi-supervised learning framework will be
considered. The aim of this framework is to investigate whether the algorithm can
be utilized as a simple annotation tool for drones. This investigation will be limited
to a limited amount of unannotated collected data since the goal is to investigate
the rough performance of the framework. The framework will further be limited to
drones of roughly the same size as in the available annotated dataset from the Drone
vs Bird Detection Challenge [1]. Furthermore, the evaluation of the framework will
be preformed through visual inspection of a sample of generated annotations to de-
termine if the annotations are comparable to hand annotated data. Further manual
evaluation with regards to the confidence score will be made. Additionally, these
annotations will be evaluated against the Drone vs Bird Detection Challenge test
dataset. No additional hand-annotated data will be utilized to investigate the per-
formance of the framework.

1.6 Tools and Equipment
This section presents the tools and equipment used in this Master thesis. The
section begins with an introduction of Google Colab and PyTorch and lastly the
object detection library Detectron2 is presented.

1.6.1 Google Colab
Google Colab is a service from Google that allows users to execute written Python
code in a browser and provides free access to GPU resources. However, the amount
of resources available varies day to day and is dependent on the current service
usage and how much the user has recently used. The maximum continuous runtime
is 12 hours and commonly available GPUs are Nvidia K80s, T4s, P4s, and P100s,
however, there is not possible to choose which GPU one is assigned. The memory of
the virtual machine also varies between runtimes, however, it does not vary during
runtime [12].

1.6.2 PyTorch
PyTorch is a Python based open-source framework developed by Facebook’s artificial-
intelligence research group (FAIR). It can be utilized in conjunction with GPUs and
is commonly used when developing deep learning projects. It has a simple and easy
to use API making it user friendly.

1.6.3 Detectron2
The code was written in Python and used the platform Detectron2. Detectron2 is an
open source PyTorch based modular object detection library created by Facebook’s
artificial-intelligence research group. State of the art object detection algorithms
such as versions of Faster R-CNN and RetinaNet are implementable and easy to fur-
ther modify thanks to Detectron2’s modular design. Pre-trained model weights are

4


1. Introduction

easily obtainable and usable. Furthermore, the datasets COCO, LVIS, CityScapes,
and PascalVOC are integrated in Detectron2.

1.7 Contribution
The contribution of this thesis is an investigation of different methods for integrat-
ing temporal information with limited GPU resources and data for drone detection.
The suggested semi-supervised learning framework has to the best of our knowl-
edge, not been utilized for object detection before. Therefore the development and
investigation of this framework will also be seen as a contribution.

1.8 Report Outline
Chapter 2 will cover the relevant theory in this thesis. The theory will cover the
basics behind the neural network, object detection, the basic RetinaNet model as
well as the evaluation metrics and Semi-supervised learning.

Chapter 3 will cover the methodology used in this thesis. It will present the
available data, describe how the standard RetinaNet was implemented and modi-
fied, how the three different methods of temporal information were integrated, as
well as a description of the semi-supervised learning framework.

Chapter 4 will cover the results of the Sections presented in the methodology
as well as a more detailed description of the methods.

Chapter 5 will discuss the results in more detail and present suggested future work.

Chapter 6 will contain conclusions drawn from the results and discussion.

5


1. Introduction

6


2
Theory

This chapter presents the relevant theory for this Master Thesis and aims to ease the
understanding of the project. Section 2.1 covers the theoretical basics of artificial
convolutional neural networks. Section 2.2 introduces object detection and IoU.
ResNet will be presented in Section 2.3 and FPN in Section 2.4. RetinaNet is
introduced in Section 2.5. Finally, different evaluations methods are presented in
Section 2.6 and semi-supervised learning is introduced in Section 2.7.

2.1 Neural Networks

Neural networks are algorithms modeled to recognize patterns from an input with a
design inspired by the neural network structure inside the human brain. A neuron
consists of a cell body, an axiom, and dendrites. Most neurons receive an input
signal via its dendrites and then produces an output signal through its axon. The
information between two neurons is transferred via synapses, which enables the
passing of an electrical or chemical signal to the other cell. An artificial neuron
operates in much the same way as a biological neuron, see Figure 2.1. The input
signal of an artificial neuron can be represented as xi and the synapse is represented
by the weight Wi. A neuron may have inputs from multiple sources, thus the i
in these expressions represent each input. The information transferred into the
dendrites of the target neuron can then be represented by Wixi + b where b is the
bias. In an artificial neural network, information is encoded by the frequency of
which the neuron is sending information. The inputs to the neuron are summed and
put through an activation function, that defines if the neuron should be considered
to have sent information or not.
The artificial neural networks are able to learn from information because the pa-
rameter W0 and b are trainable. Thus when receiving an input, W0 and b should be
selected so the information received is properly decoded. This process of selecting
the proper value forW0 and b is referred to as training. In a neural network, multiple
neurons are used simultaneously. A group of neurons forms a layer and there are
several layers in a complete network. The layers can be divided into the input layer,
the hidden layers as well as the output layer. The input layer receives the initial
input to the network and sends it through the hidden layers. The output of the
network is produced after the hidden layers by the output layer. A standard neural
network has all the neurons in the previous layers connected to all the neurons in
the upcoming layer. This structure is known as fully connected layers [13].

7


2. Theory

Figure 2.1: Visualization of an artificial neuron and a biological neuron [13]. CC-
BY

2.1.1 Activation Functions
The activation function is responsible for deciding whether a neuron has sent in-
formation or not. The activation function is required to be non-linear to enable
stacking of multiple layers and the usage of gradient based optimization methods.
With a linear activation function, the stacked layers would simply be a linear com-
bination of each other and can thus be replaced by a single layer. Furthermore, with
gradient based optimization of the gradient with respect to the input would always
be constant if a linear activation function was utilized. This leads to an optimization
that is not dependent on the input [14]. The ReLU and the Sigmoid function are
two of the most commonly used activation functions, see Figure 2.1.1. The ReLU
function can be written as:

ReLU(x) = max(0, x). (2.1)

The Sigmoid function can be written as:

Sigmoid(x) = 1
1 + e−x . (2.2)

Figure 2.2: Visualization of different commonly used non-linear activation func-
tions [15]. CC-BY

8


2. Theory

2.1.2 Loss Functions
Once an input has been given to a neural network and an output is received, one
needs to calculate whether this output was correct or not. This accomplished by the
inclusion of a loss function. The goal of the loss function is to minimize the error
between the predicted output and the expected output. The expected output can be
referred to as the ground truth. There are different kinds of loss functions adapted
for solving specific problems, for example regression and classification problems.

2.1.2.1 Cross-Entropy Loss

The cross-entropy loss can be defined by either binary or multi-class cross entropy.
Multi-class cross entropy can be split into multiple binary cross entropy functions
were each function corresponds to one class of interest. In this thesis, binary cross
entropy have been considered since only the drone class is of interest. The binary
cross entropy loss can be written as:

CE(p, y) =

−log(p), if y = 1
−log(1− p), otherwise.

In this equation, y is the ground truth and p is the estimated probability that the
class has the label y = 1. The equation for binary cross entropy can be rewritten
as:

CE(pt) = −log(pt). (2.3)
Where pt is defined as:

pt =

p, if y = 1
1− p, otherwise.

The binary cross entropy loss receives its inputs from the last layer of the underlying
neural network. The prediction is the estimated probability that the input was either
foreground or background. Here, the foreground refers to the particular class that
the binary problem corresponds to and the background refers to the other classes
and no class [16].

2.1.2.2 Focal Loss

The focal loss is based on the commonly used cross entropy loss although it is
designed to combat class imbalance, see Figure 2.3. Class imbalance, the imbalance
between objects labeled as foreground and background, is a common issue. In a
dataset there might be far more instances that are labeled as background instead of
foreground. Should this be the case, the neural network might become overconfident
in cases of background since the contribution of background is far larger than the
contribution of foreground. This would result in poor performance when classifying
a foreground example.
The focal loss aims to improve upon this problem by introducing a modulating factor
to the cross entropy loss. The definition of focal loss can written as:

FL (pt) = − (1− pt)γ log (pt) . (2.4)

9


2. Theory

The modulation factor − (1− pt)γ is responsible for suppressing losses that have a
large probability, because if a prediction is very accurate, the impact it will have
on the loss should be considerably smaller compared to a prediction with a lot of
uncertainty [7].

Figure 2.3: Comparison of focal and cross entropy loss. Here, γ is a parameter that
is tuned by the user. The standard cross entropy loss is represented when γ = 0.
From [7]. CC-BY

2.1.2.3 Smooth L1-Loss

Smooth L1-loss is a variant of the standard L1-loss that combines the properties of
L1-loss when the loss is large and then switches to L2-loss as the loss gets smaller.
The L1-loss can be written as:

S =
n∑
i=1
|yi − f (xi)| . (2.5)

The L2-loss can be written as:

S =
n∑
i=1

(yi − f (xi))2 . (2.6)

In these equations, yi is the ground truth and f(xi) is the approximated value. The
L1-loss takes the absolute value of the difference between the target and the predic-
tion while the L2-loss takes the square of the difference. The L1-loss is advantageous
when the loss is large since it is robust to outliers and produces sparser solutions,
however it is not differentiable when the loss is zero, see Figure 2.4. Thus the
gradient-based optimization will be sub-optimal. Due to its quadratic nature, the
L2-loss is differentiable when the loss is zero. The L2-loss produces more accurate
results compared to the L1-loss because it penalizes larger errors to a higher extent,
however, the L2-loss is more sensitive to outliers compared to the L1-loss [17].

10


2. Theory

Figure 2.4: Illustration of L1-loss, L2-loss and smooth L1-loss. Note how derivative
is undefined for L1-loss when the loss is 0. From [18]. CC-BY

The smooth L1-loss combines the L1 and L2-loss and uses the parameter β to
distinguish between them. The smooth L1-loss can be written as:

smoothL1(x) =
{

0.5x2 if |x| < β
|x| − 0.5 otherwise. (2.7)

When the loss is above a certain threshold decided by β, the regression loss will
be L1-loss. When the loss falls below this threshold, the loss changes and instead
becomes L2-loss, see Figure 2.5.

Figure 2.5: Illustration of smooth L1-loss. When the loss falls below a certain
threshold, it switches to L2-loss. From [19]. CC-BY

2.1.3 Optimizers
Once the neural network has processed an input and the losses for the predictions
are calculated, the trainable parameters need to be tuned in order to minimize the
loss. This is done through an optimizer and the back-propagation algorithm.
It is common to use gradient based optimization methods, the most common of
these methods is known as gradient descent, see Figure 2.6. In a neural network the
amount of parameters depends on the number of layers, for notational simplicity, the
weights and biases of all the layers are included in the parameter θ. The equation
for gradient descent can be written as:

11


2. Theory

θt+1 = θt − α∂E (X, θt)
∂θ

. (2.8)

In this equation, α is the learning rate, which is a hyperparameter that indicates
how much the weights are allowed to be updated during training. A too small α
will result in an optimization that is very slow since the weights are updated by a
small amount each pass. A too large α might result in the parameters changing too
much and jumping over the optimum. E(X, θt) is the expected value for the loss
function with network parameters θ at time t and input-output pairs: (xi, yi) ∈ X.
This equation illustrates how the parameters in θ are updated by using the previous
weights and the gradient of the loss function for those parameters given an input-
output pair [20].

Figure 2.6: Illustration of gradient descent with one parameter, w. From [21].
CC-BY

In gradient descent, the weight update only occurs when the entire dataset has
been processed. Due to the sheer amount of parameters in a neural network, the
convergence of the optimizer to the global minimum may be very slow. Therefore
an alternative form of gradient descent is utilized in neural networks. This form is
known as stochastic gradient descent (SGD). The difference between gradient de-
scent and stochastic gradient descent is that instead of processing the entire dataset
before updating the weights, SGD takes a sample and updates the weights based on
this sample.

Neural networks usually contain many layers which all contain a set of parameters.
When using stochastic gradient descent to perform optimization, it is necessary to
know how each of these parameters affects the final loss function. To calculate this,
the back-propagation algorithm is used. Back-propagation consists of four parts:
the forward pass, the loss function, the backward pass, and the weight update. The
forward pass passes an input through the network to get a prediction. The loss
function calculates the error of the prediction with respect to the expected output.
The backward pass is then performed in order to relate how all the parameters in
all layers contribute to the loss function. Once the relation between the parameters
for the layers is known for the specific input, the weights are updated in such a way
that the loss is reduced [22].

12


2. Theory

2.1.4 Convolutional Neural Networks

A convolutional neural network (CNN) includes neurons, activation functions, a loss
function, and an optimizer but the layers differ from a basic neural network. A con-
volutional neural network utilizes convolutional layers instead of fully connected.
The number of layers varies but the network consists of an input layer, multiple
hidden layers, and an output layer. Convolutional layers are beneficial when deal-
ing with images compared to standard fully connected layers since they utilize the
concept of filters. An image can be thought of as a matrix with a specific width
and height. In this matrix, each pixel is given a value based on its color gradient.
Similarly, each filter is a matrix that also contains values, however, these values are
trainable. If an image were to be processed in a fully connected fashion, the image
matrix would need to be flattened into an array and each element would be assigned
to a separate neuron as input. This would not only greatly increase the required
parameters for the network, but the highly correlated information between adjacent
pixels would not be utilized. Therefore, a convolution layer is more suitable for pro-
cessing images compared to a fully connected one. A fully convolutional network is a
convolutional network without any fully connected layers. These types of networks
are very common in image segmentation.

Given an input image, the filters move as a sliding window across the pixels in the
image. The amount of pixels that the filter is moved each iteration is specified by
the stride. When the filter moves, the numerical values of the image are multiplied
by the parameters inside the filter with element wise multiplication. These values
are then summed up and the resulting value represents the information that was
extracted by the filter at a specific location in the image. By applying filters to
the inputs for the different layers, the network can differentiate and recognize the
different object and features in images, see Figure 2.7.

The filters range from being able to detect basic features such as brightness in an
image to more detailed features and characteristics of an object. Each layer usually
contains more than one filter since each filter will only be trained to recognize a
specific feature. The number of filters in a convolutional layer corresponds to the
number of channels. Once the filter has slid or convolved over the entire image, the
output is an array of numbers. This array is referred to as a feature map.

In a convolutional network maxpooling layers are added to reduce the spatial size
to reduce the computational expense. This by dividing the output of the layers into
regions and only keep the regions with the highest value, thus reducing the size.
The intuition behind this is that the pixel of a feature map that has the highest
value contains the most useful information, therefore the other adjacent pixels can
be discarded [23].

13


2. Theory

Figure 2.7: Visualization of how an image is proccessed and classified in a convo-
lutional nerural network [23]. CC-BY

2.1.5 Transfer Learning
Transfer learning is when a model, trained and adapted for a specific task, is reused
as base for another assignment. To train a convolutional network from scratch
requires a large amount of data, thus it is advantageous to utilize pre-trained models.
The pre-trained models are often good at extracting common features because they
are trained on large data set, with images of different kinds of objects. Even though
the specific image class is missing from the pre-training, the basic features, such as
basic shapes can be used and then fine-tune the network for the specific task. This
can be done by retraining only the deepest layers on the specific data [24].

2.2 Object Detection
Object detection can be divided into two parts, object localization, and object clas-
sification. Localization is used to point out where objects are located in the image
and classification is used to decide what type of objects are present in an image.
There are two different types of object detection techniques, two-stage object detec-
tion, and one-stage object detection. The output of an object detection algorithm is
usually a box that covers the object as well as the predicted class label of that box.
These boxes are referred to as bounding boxes. This section will cover two-stage
detectors, one-stage detectors, and explain IoU and non-maximum suppression and
how they are used in training and evaluation.

2.2.1 Two-Stage Detection
Two-Stage Object detectors perform localization and classification of an object in
two stages, R-CNN, Fast R-CNN and Faster R-CNN are examples of two-stage
detectors. The first stage is to decide regions on the image in which objects of
interest may be present, so called regions of interest. These regions can be generated
in different ways such as utilizing the simple search algorithm or use a neural network
known as a region proposal network. Once these regions are decided the second
stage is performed by passing these regions into a neural network, performing object
classification. In addition, regression is performed to fit the proposed regions closer
to the actual object. Since the model needs to perform two passes over the image,
this method is not very fast [25].

14


2. Theory

2.2.2 One-Stage Detection
YOLO, SSD, and RetinaNet are one-stage object detectors that combine localization
and classification into one step. Therefore, one-stage detectors only require a single
stage. The regions of interest in one-stage detectors are acquired through the use of
anchor boxes. One can think of the anchor box as an initial guess of where an object
might be present. These anchors are evenly distributed on the input image, and for
each of the anchors, a prediction is made whether they contain an object of interest
or not. If an anchor is deemed to contain an object with the help of a classifier, the
placement of the anchor is fine-tuned by a regressor to fit the object more accurately.
The placement and sizes of these anchors are decided beforehand by the user, and
thus no region proposal pass is required since the regions to investigate are decided
beforehand. This often makes the object detection faster but less accurate compared
to the two-stage object detection [26].

2.2.3 IoU
The intersection over union (IoU) is a measurement of how much one box overlaps
with another box, see Figure 2.8. This can be written as:

IoU(box1, box2) = |box1 ∩ box2|
|box1 ∪ box2| . (2.9)

.

Figure 2.8: Visual representation of IoU From [27]. CC-BY

The IoU measurement is used to decide whether a proposed anchor box during train-
ing or predicted bounding box during evaluation is deemed to be correct. During
training the model has access to all the ground truth bounding boxes and when the
anchors are placed, the IoU is calculated for all the anchors and ground truths. If
an anchor box has an IoU larger than the defined threshold over a ground truth,
the model learns that this anchor contained an object with a specific class and how
far away this anchor was from the ground truth. If the anchor box did not have
an IoU over the threshold, the model learns that this anchor box just contained the
background [28].
In the prediction step, anchor boxes are generated and a class is predicted for all
anchor boxes. If an anchor box is deemed to contain an object, its position is

15


2. Theory

adjusted with regards to the predicted offset by the regressor in order to acquire the
final bounding boxes used in prediction. The predicted bounding boxes are filtered
with non-maximum suppression (NMS) in order to reduce multiple predictions of
the same object [29].
When evaluating a model, the IoU is used to determine whether a predicted bound-
ing box is correct or not. If a predicted bounding box is placed with an IoU larger
than the threshold over a ground truth of the correct class, this prediction is con-
sidered to be correct. If a predicted bounding box does not have an IoU larger than
the threshold over a ground truth, or the predicted class is incorrect, this prediction
is considered to be incorrect. IoU on its own is not used to calculate any evaluation
score, however, it is utilized in some common evaluation metrics.

2.2.4 Non-Maximum Suppression
To filter out multiple predictions of the same object, non-maximum suppression is
conventionally used by object detectors. The algorithm compares all of the pre-
dictions confidence scores and the one with the highest score is selected. If any
prediction has an IoU over some threshold with this prediction, it is discarded. This
process is done for all the remaining predictions until no prediction has an IoU score
over the threshold with another prediction [30].

2.3 ResNet
To improve the performance of neural networks, a common strategy is to add more
layers and thus making it deeper. However, stacking more layers on top of another
is problematic. When more layers are added, one encounters problems that are
known as Vanishing and exploding gradients [31]. The problem with vanishing and
exploding gradients has been addressed before ResNet. The problem was addressed
in such a way that vanishing and exploding gradients did not hinder the network
to converge from the beginning. A common way to combat this problem is to use
the ReLU activation function since it has shown to be able to effectively handle
this problem [32]. However, when deeper networks were evaluated, another problem
manifested in the form of performance degradation. Even though it was possible to
make the network converge from the start, an increase of layers surprisingly showed
that both training and testing errors were increased with the addition of more layers.
The purpose of ResNet was, therefore, to solve the vanishing and exploding gradient
problem, while also taking care of the performance degradation problem by using
skip connections to add the output from one layer to the other layers, skipping over
some layers [33].

2.3.1 Vanishing and Exploding Gradient Problem
The problem with vanishing and exploding gradients occur during the backpropa-
gation of the error function. During backpropagation, the gradient of the error is
calculated backward through the network to tune the parameters of the network
such that the loss function is minimized. Backpropagation is calculated by utilizing

16


2. Theory

the chain rule to represent how each layer affects the final loss function [34]. Some-
times, however, the gradients become very large or very small. Since the chain rule
multiplies the gradients of all the layers, this might cause the final product to be
close to zero or very large. With a too large or too small gradient, the network is
unable to gather useful information of how each layer affect the loss function, and
thus it is unable to learn [32].

2.3.2 Performance Degradation

The performance degradation problem was exposed when the previously utilized
methods for combating the vanishing and exploding gradients revealed that the
training and testing errors increased with deeper networks. The reason for this
was that the currently available solvers simply could not find the desired mapping
between the layers. The problem can be solved by utilizing so called residual blocks,
see Figure 2.9. These blocks use a skip-connection between the non-linear layers
and reformulate the problem slightly. Previously the goal was to find the desired
underlaying mapping between the stacked layers. This mapping can be described as
H(x). Since this mapping proved to be hard for solvers to find, the stacked layers
were recast to fit another mapping, namley the residual mapping. This mapping
can be described as F(x) := H(x) − x. This equation is then rewritten as H(x) =
F(x)+x. It is easier for the solvers to find F(x) rather than finding the unreferenced
H(x), because F(x) generally has a small response and thus the identity mapping,
x provides a strong precondition [33].

Figure 2.9: A residual block. From [35]. CC-BY

2.3.3 ResNet Architectures

The architectures of ResNet are based on stacking residual blocks on top of each
other. ResNet-50, ResNet-101 and ResNet-152 are different kinds of ResNet archi-
tectures, where different numbers of layers are utilized. For example, ResNet-50
consists of 50 convolutional layers and ResNet-101 has 101, excluding the final fully
connected layers.

17


2. Theory

Figure 2.10: Table of the layers in different ResNet architectures. From [36].
CC-BY

In a common ResNet each residual block consists of either two 3x3 layers or two
1x1 layers with a 3x3 layer in between, see Figure 2.10. The former of these two
structures is called a Bottleneck block, however, the two structures work in the same
way as a residual block. Each layer in Figure 2.10 is referenced as conv2_x up to
conv5_x, however, for the remainder of this report, these layers will be referenced
to as Res2 up to Res5.

2.4 Feature Pyramid Network
The concept of feature pyramids is used in many object detection algorithms, such
as SSD, that utilizes a deep convolutional network in order to compute a feature
hierarchy, see Figure 2.11. The deep convolutional network sub-samples layers which
in turn create feature maps of varying spatial resolution. This inherent pyramidal
feature hierarchy is used by the SSD in the form of auxiliary layers that are added
on top of its backbone. These auxiliary layers then perform predictions on their
own in conjunction with the output from the backbone. This enables the SSD to
utilize multiple scales of spatial resolution when performing detections. However, a
major downside comes with this approach, by adding these auxiliary layers to the
output of the network the architecture does not utilize the higher resolution feature
maps. The higher resolution feature maps do not have rich enough semantics since
they have not been processed entirely by the network. Therefore it would not be
beneficial to add them to the detection output [37]. By omitting these feature maps,
the SSD struggles with the detection of small objects.

Figure 2.11: The SSD-architecture. The auxilary layers are added at the output
of the backbone. From[38]. CC-BY

18


2. Theory

The feature pyramid network (FPN) is an improvement of the shortcomings of the
SSD-architecture. The FPN architecture enables the use of high resolution feature
maps that is simultaneous semantically strong. To achieve this, the architecture
consists of a bottom-up pathway for feature extraction and a top-way pathway for
up-sampling low resolution feature maps, see Figure 2.12. These feature maps are
then combined through the use of lateral connections.

Figure 2.12: Example of FPN structure. From[39]. CC-BY

2.4.1 Bottom-up Pathway
The bottom-up pathway consists of a backbone that is used as a feature extractor.
This backbone can, for example, consist of the feed forward computations of a
ResNet, see Figure 2.13. At each level the feature map is sub-sampled with a factor
of two and thus its dimensions are halved. This is represented by the different
strides of the feature levels. The strides at different feature levels are {4, 8, 16, 32}
at {Res2, Res3, Res4, Res5} respectively. The receptive field specifies the region of
an image that is visible at each feature level, once the stride increases the receptive
field increases. Since the receptive field increases with higher levels, these levels
are more suitable for detecting larger objects. At these levels, the feature map is
semantically strong due to its many convolutions, however, this comes with the cost
that the spatial resolution is low. On the lower levels, the feature map has a higher
spatial resolution and a smaller receptive field and is therefore more suitable for
detecting smaller objects.

2.4.2 Top-down Pathway with Lateral Connections
In the top-down pathway, feature maps are merged in order to create feature maps
that are both semantically strong and have high resolution. This is done by up-
sampling the feature maps by a factor of 2 for each of the levels in the FPN. Nearest
neighbor up-sampling is utilized and the up-sampled feature maps are then enhanced
with the feature map of the corresponding size from the bottom-up pathway. The
feature map from the bottom-up pathway lacks the strong semantics compared to the
up-sampled ones from the top-down pathway, however they have higher resolution.
Through the use of lateral connections between the two pathways, the semantically
strong and high resolution feature maps are merged. The lateral connections is
a convolutional layer with a kernel size of 1 × 1 and are used to set the channel
depth to 256 for all feature maps across all layers from the bottom-up pathway.
The feature maps are merged using element-wise addition and to reduce aliasing
from the upsampling, each merged feature map is put through a final convolutional

19


2. Theory

layer with a kernel size of 3 × 3. After this process, each feature map have the
strongest semantics possible since they all originates from the deepest layer and
their resolution has been enhanced with the corresponding feature map from the
bottom-up pathway [40].

Figure 2.13: Example of FPN structure with ResNet. From[39]. CC-BY

2.5 RetinaNet

RetinaNet is an algorithm heavily based on the FPN architecture, it uses the FPN as
its backbone but makes slight modifications to it, see Figure 2.14. The original FPN-
backbone has feature levels {p2, p3, p4, p5} that are related to the corresponding feed
forward computations from the ResNet backbone {Res2, Res3, Res4, Res5}. The
RetinaNet however elects not to use p2 as it is rather computationally expensive.
Furthermore it includes two additional feature levels and thus making the RetinaNet
consist of {p3, p4, p5, p6, p7}. The last two feature levels are based on the output
from Res5, p6 is obtained by performing a convolution on Res5 with a kernel size of
3×3 and with a stride of 2. The last level, p7 is obtained by applying a convolution
on p6 with a kernel size of 3× 3 and a stride of 2. By including these two additional
levels the detection of large object is improved [7]. Similarly to the FPN architecture,
RetinaNet attached sub-networks to each feature level in order to make predictions.
These sub-networks are designed for classification as well as regression and one of
each is attached in parallel to each feature level.

20


2. Theory

Figure 2.14: An example of RetianNet. From [41]. CC-BY

2.5.1 Classification Subnet
Identical classification sub-networks are placed on each feature level and all of them
share parameters and consist of five fully convolutional layers with a kernel size of
3×3. The channel depth for the first four layers is 256, corresponding to the channel
depth of the FPN output. The fifth layer has a channel depth of A ×K where K
is the number of classes to predict and A is the number of anchors for each spatial
location. Between the first four layers there is a ReLU activation function and the
output of the final layer is passed through a sigmoid activation function to obtain the
binary predictions. The size of the last layer can be written as (N,A×K,Hi,Wi),
N represent the batch size, Hi and Wi represent the height and width of the input
feature map where i indicates the FPN level. The number of channels on the last
layer corresponds to A ×K, thus for each spatial location A amount of anchors is
defined. Each of these anchors is responsible for detecting any of the K available
classes. Thus each output channel represents the probability of an anchor containing
any of the classes. In the classification part of the RetinaNet, the focal loss function
is utilized.

2.5.2 Regression Subnet
The classification sub-network is responsible for evaluating each anchor in order to
decide if they contain an object or not, whereas the regression sub-network is respon-
sible for deciding how much the anchor box should be shifted to more accurately fit
the object. The regression sub-network is almost identical to the classification sub-
network with five fully convolutional layers, however, the final layer differs slightly.
The regression sub-network has a channel depth of A× 4, four values that need to
be predicted to fit the edges of the anchor box closer to the object, and the size of
the output layer is (N,A× 4, Hi,Wi) [7].
The L1-loss function is used for the bounding box regression in the original imple-
mentation of the RetinaNet, however, in this work a smooth L1-loss is used instead.
The smooth L1-loss combines the positive properties of both L1 and L2 loss and
thus contributes to a better approximation of the bounding box regression.

2.6 Evaluation
This section covers two different ways of how to evaluate the performance of an
object detector. This thesis will use F1-score and AP to evaluate the results.

21


2. Theory

2.6.1 Precision
Precision is a measurement of how well the model is able to only detect objects of
interest. For example, if a model for detecting diseases have low precision, it will
yield a high amount of positive detections even if only a few have the disease. The
precision of an object detector can be calculated as:

Precision = TP

TP + FP
. (2.10)

TP is the true positives, the detections that were deemed correct and FP stands
for false positive. IoU is utilized to determine the nature of the detection. If the
predicted bounding box has an IoU larger than the threshold over a ground truth,
the detection is deemed to be related to this ground truth. If the predicted class
of the bounding box is the same as the class of the ground truth, the prediction
is deemed to be correct. Otherwise, the detection is labeled as a false positive.
Multiple predictions of the same object are also deemed to be false positives [42].

2.6.2 Recall
Recall is a measurement of how well the model is able to detect all objects of interest.
A model for detecting disease in humans with a low recall will result in a lot of missed
diagnoses where people are told they are healthy even though they are sick. The
recall can be written as:

Recall = TP

TP + FN
. (2.11)

FN stands for false negative, the number of ground truths undetected by the predic-
tor. If a predictor outputs a set of bounding boxes and none of the bounding boxes
have an IoU over the ground truth that is larger than the threshold, this indicates
that the model was not able to detect this object. Thus it is referred to as a false
negative [42].

2.6.3 F1-Score
The F1-score combines both of these metrics to evaluate how well the model is able
to prevent false detections and not miss actual detections [43]. The F1 score is the
harmonic mean of precision and recall, thus both values are taken into consideration.
The F1-score is defined as:

F1 = 2× Precision×Recall
Precision+Recall

. (2.12)

2.6.4 Average Precision
In object detection problems, calculating the mean average precision is a common
way to evaluate the performance of an algorithm. Average precision is the area
under the curve of a precision-recall curve,

AP =
∫ 1

0
p(r)dr. (2.13)

22


2. Theory

This curve plots values of its recall and precision on its x and y-axis correspondingly,
see Figure 2.15. The precision and recall pairs are calculated for a confidence score
threshold that is slowly decreasing. This threshold is taken from the confidence of
the predictions, where highly confident predictions are investigated first and then
as the threshold decreases, more of the not so confident predictions are considered.

Figure 2.15: An example of PR-curve. From [44]. CC-BY

As the confidence score is decreased, lower quality predictions are allowed. This
results in an increase in recall since more predictions are made and thus the number
of false negatives is decreased. However, the precision is reduced when lower quality
matches are allowed since more false positives occur with a lower confidence score.
Since precision is reduced with increased recall, the maximum AP-score is achieved
with a compromise between these two metrics. Mean average precision is simply
the average of the average precision for all the available classes. In this work, only
one class of drones has been considered and therefore average precision and mean
average precision is the same [42].

2.7 Semi-Supervised Learning
Supervised learning methods refer to methods that utilize fully labeled datasets
and aim to approximate a mapping function that given input, produces the desired
output. This process is utilizing the labeled data as a reference during training to
assert to the model whether a prediction was correct or not. Unsupervised learning
does not have access to a labeled dataset and thus cannot utilize a reference during
training. The aim of an unsupervised learning method is to find a relationship
describing the structure in the data. An example of this is k-means clustering that
aims to assign data points into one of K groups, to partition similar data points
into the same category [45].
Semi-Supervised learning is a combination of supervised learning and unsupervised
learning. It is a methodology for utilizing a small amount of annotated data and a
large amount of unannotated data. One way to implement Semi-Supervised learning
is by utilizing pseudo-labeling. First the network is trained in a supervised fashion,

23


2. Theory

using annotated data. Thereafter the model is used on unannotated data and gen-
erates predictions on this data. The prediction with the highest confidence score is
considered to represent the true label and is thereafter utilized as annotated data
[46].

24


3
Methods

This chapter will present the methods utilized to fit the existing RetinaNet archi-
tecture to suit the problem of small drone detection. Furthermore, the work done
with temporal information and the semi-supervised learning framework will be pre-
sented. Section 3.1 will describe the analysis of the datasets from the Drone vs Bird
Detection Challenge and provide a visualization of the training data with empha-
sis on the challenges to overcome. Section 3.2 presents the implementation of the
RetinaNet. Section 3.3 will describe the modification of the RetinaNet. Section 3.4
describes how temporal information was incorporated and Section 3.5 presents the
Semi-Supervised learning framework.

3.1 Visualization and Analysis of Dataset
The data set used for both training and evaluation of the algorithm comes from the
competition Drone vs Bird Detection Challenge [1]. It contains eleven short video
sequences for training and three short videos for testing. The images were taken
with a static camera and contain both drones, birds, and background. However,
only annotations for drones are available. The training data consists of 7245 images
and the test data consists of 2079 images. The resolution of the images extracted
was 1080 × 1920.

3.1.1 Visualization of Data
The data provided from the Drone vs Bird Detection Challenge depicts drones flying
in a wide variety of different environments and contains challenging scenarios such as
occlusion, moving objects in the background, flying within close proximity of birds,
and a combination of these. As well as these challenges, the small size of the drones
present an additional difficulty. The figures presented below illustrate a couple of
samples of the data that were used for training the algorithm.

25


3. Methods

Drones are featured in different environments, see Figure 3.1 and Figure 3.2. In these
figures the environment change with regards to the lightning of the background. One
drone has clear LED-lights on and another has no lights on thus the appearance of
the drone itself is changed. The environment can include land and urban features,
see Figure 3.3 and Figure 3.4. The size of the drones was small and thus the
algorithm needs to take into account both the change in the environment as well as
the reduced size.

Figure 3.1: Drone flying during the night with LED-lights and a dark background

Figure 3.2: Drone flying during the day with no lights on and a light background.

26


3. Methods

Figure 3.3: A drone flying in a park far away.

Figure 3.4: A drone flying in a field far away.

27


3. Methods

Another challenge was the inclusion of birds flying next to the drones, see Figure
3.5 and Figure 3.6. The drone is far away and a bird is flying next to it. In this
example, the algorithm needs to handle the small distant drone and simultaneously
not misclassify the bird.

Figure 3.5: Original image containing one drone far away flying close to a bird.

Figure 3.6: Zooming in on the drone, one can see the contours of the bird flying
close.

28


3. Methods

3.1.2 Analysis of Data
The majority of the annotations for the training data from the Drone vs Bird Detec-
tion Challenge have an area of 3-32 pixels, see Figure 3.7. The aspect ratios of the
annotations were defined as the height of the annotation over the width, see Figure
3.8. It was most common for the width and height of the annotations to be equal,
otherwise, the width is generally larger than the height.

Figure 3.7: Histogram illustrating the distribution of ground truth sizes.

Figure 3.8: Histogram illustrating the distribution of aspect ratios in the ground
truth.

29


3. Methods

3.2 Implementation of RetinaNet

The implementation of the RetinaNet as defined in [7] was found in the Detectron2
library. The following section aims to explain the standard implementation in detail.

3.2.1 Overview of Implementation
The standard RetinaNet, a part of Detectron2, is designed to be able to detect
objects in a wide range of shapes and sizes. The RetinaNet is able to detect objects
from eighty different classes and with sizes varying between 32-813 pixels [7]. The
RetinaNet used in this thesis, before any modifications, consist of an FPN-backbone
which has a bottom-up architecture of ResNet-101. The outputs of the bottom-
up ResNet-101 consisted of [Res3, Res4, Res5] and the corresponding output of the
FPN was [p3, p4, p5, p6, p7]. The number of channels from the FPN was 256, and
the features were merged with element-wise addition in the FPN. The pre-trained
weights for ResNet-101 come from ImageNet MRSA R101.

3.2.2 Anchor Placement in Implementation
Anchors were placed on each feature level independently. In Detectron2 the anchors
defined for each feature level were centered around each pixel on their corresponding
feature map. The sizes of the feature maps were decided by the strides for each level.
The strides for the outputs of the FPN were [8, 16, 32, 64, 128], corresponding to
level [p3, p4, p5, p6, p7]. These values indicate how much the input image was down-
sampled at each level. The pixels on the feature map will henceforth be referred
to as Feature pixels in order to avoid confusion from the pixels in the input image,
also known as Input pixels. The anchors were centered around the feature pixels,
on each feature map. Since ResNet is fully convolutional, the spatial positions on
the feature maps can be related to the spatial position on the input, see Figure 3.9.
Depending on the stride of the feature level, each feature pixel corresponds to an
area that was defined as Stride×Stride on the input pixels. Once the feature pixels
were related to the corresponding area on the input images, the anchor coordinates
were defined with respect to their corresponding coordinates on the input image.

Figure 3.9: Illustration of how feature pixels relate to the input pixels. The low
level feature map has a stride of 2 and the high level feature map has a stride of 4
[47]. CC-BY

30


3. Methods

3.2.3 Sub-networks in the Implementation

When the anchors were defined, the feature maps from the FPN-backbone was
fed into the classification and regression sub-networks. In Detectron2, the output
layer of the classification sub-network has channels corresponding to the number
of anchors around each feature pixel times the number of classes that are to be
predicted. This is represented as a tensor of size (N,A×K,Hi,Wi) were the index
i denotes the feature level. The output of the regression sub-network is almost
identical to the classifier output but the number of channels was defined as the
number of anchors around each feature pixel times four. The four represent the
edges of the anchor box and the values in these channels represent how much the
edges should regress in order to fit the ground truth. This is represented by a tensor
of size (N,A×4, Hi,Wi). Given that an anchor was predicted to contain an object of
interest, the goal of the regression sub-network was to decide how much this anchor
box was to be regressed in order to fit the detected object properly.

3.2.4 Anchor Matching with Ground Truth

In order for the network to utilize a ground truth annotation, at least one of the
anchors placed needs to overlap with the ground truth. Whether an anchor was
considered a match with a ground truth was based on an IoU threshold. For a Reti-
naNet in Detectron2, anchor matching was performed individually for each anchor
from all feature levels, for each of the ground truths in the image. If an anchor
was labeled as foreground, it was assigned the correct label in the range [0, K-1]
where K was the number of classes to predict. If it was labeled as the background
it was assigned the label K. If the anchor was decided to be discarded, it was given
the label -1. If an anchor was labeled to be the foreground, the distance for the
four edges of this anchor to the ground truth was used as the ground truth for the
regression sub-network.

3.2.5 Training with Predictions and Ground Truths

By matching the predictions from the sub-networks with the corresponding ground
truth elements, the algorithm was trained. The predictions from the sub-networks
were performed on each feature level of the image separately, thus needed to be
interpolated and concatenated. Anchors with labels -1 were discarded. The clas-
sification loss was calculated with the focal loss, with regards to the predictions
containing background and foreground. The focal loss parameter γ, was set to 2.0,
and α to 0.25 to focus learning on hard negative examples. These values were taken
from the work by T. Lin et. al, where the best combination was investigated [7].
The regression loss was calculated with the smooth L1 loss for anchors labeled as
foreground. The β parameter in the smooth L1 loss was set to 0.1. Once the loss
was between (-0.1,0.1) the L2 loss function was utilized. When the loss was outside
the (-0.1, 0.1) range, L1 loss was utilized. Other values of the loss parameters were
not investigated in this project.

31


3. Methods

3.2.6 Inference
During inference, the model utilized the predictions from the sub-networks and the
pre-defined anchors. In Detectron2 inference was performed on each feature level
independently and predictions on each feature level were concatenated and then put
through non-maximum suppression. A Sigmoid function was used to receive the
probability that the anchors contained an object of interest.
A filter was applied to the probabilities to only keep a certain amount of the top-
scoring predictions. Another filter was applied to the remaining predictions that
discarded predictions below a certain threshold. The anchor boxes remaining at
this stage were fine tuned by their corresponding predictions from the regression
sub-network. The predictions were concatenated and put through non-maximum
suppression. The anchors left after non-maximum suppression was the final out-
put from the model which contained both the bounding box coordinates for the
prediction, the predicted class, and the confidence score of the bounding box.

3.3 Modification of RetinaNet
This Section will cover the different modifications of the RetinaNet to fit the model
to the task of small object detection. At the end of this Section the final RetinaNet
is introduced.

3.3.1 Defining a Baseline for the RetinaNet
The anchor sizes for the standard RetinaNet implemented by FAIR in Detectron2
needed to be resized to enable detection of smaller objects since the vast majority
of training data from the Drone vs Bird Detection Challenge was below 32 pixels.
The original implementation, by FAIR, placed anchors of different sizes on different
feature levels however, in this work, the base RetinaNet has anchors of the same size
placed on each feature map. The size of the input images were resized from 1080p
down to 800x1333 to enable faster training.

3.3.2 Defining a Baseline for Training
The training for the experiments was conducted with 30.000 iterations due to limited
GPU resources and based on the work by M. Nalamati et al [2]. The team utilized
the dataset from the Drone vs Bird Detection Challenge and the same backbone of
ResNet-101, as utilized by the RetinaNet in this thesis. It was discovered that the
optimal training range was between 30.000 to 70.000 iterations. A linear warm up
method with 1000 iteration was utilized. By linearly increasing the learning rate
the influence of early training images was reduced, thus reducing the over-fitting in
the early stages of training. However, in this project different warm-up parameters
were not tested and evaluated. The warm up factor was 0.001 however, further
experiments with the warm up factor for this project were not investigated.
A random horizontal flip of the training images was utilized as data augmentation.
The probability of this flip occurring was 0.5.

32


3. Methods

Due to heavy class imbalance, a prior probability of 0.01 was utilized by the fore-
ground class i.e the drone class in the classifier. The reason for this was that the
background class tends to dominate the loss function and result in unstable training
early on for datasets with heavy class imbalance [7]. Only one GPU was available
thus the batch size was set to 2.

3.3.3 Utilizing Transfer Learning
Due to the small amount of available data from the Drone vs Bird Detection chal-
lenge, transfer learning was utilized. A study, investigating the impact of freezing
the backbone at different layers, was performed. Four versions of the baseline Reti-
naNet were trained, for each of these networks, the backbone was frozen until Res1,
Res2, Res3, and Res4.

3.3.4 Inclusion of P2
An additional feature level, p2 was added to the network thus enabling additional
predictions on a feature map with higher resolution. Higher resolution feature maps
are more suitable for detecting smaller objects, therefore an experiment was con-
ducted to investigate whether the inclusion of an additional high resolution feature
level would improve the accuracy.

3.3.5 Defining a New Baseline for the RetinaNet
A new baseline was defined where the input sizes were not resized. Furthermore,
the standard aspect ratios were changed to better fit the available data.

3.3.6 Anchor Placements
Anchors of different sizes are often placed on different feature levels for feature pyra-
mids such as FPN. However, this strategy advises a thorough study of the receptive
field of each level. Due to time constraints, this study has been omitted, instead,
experiments with anchor placements were performed. Different sizes, amounts, and
placement of anchors were investigated.

3.3.7 Pruning of P6 and P7
In the feature pyramid architecture predictions were made on each feature level in-
dividually. The top levels p6 and p7 are semantically strong however the resolution
of these feature maps is low. Lower resolution feature maps tend to struggle with
accurate detections of smaller objects [5]. An experiment was conducted to inves-
tigate whether these levels could be removed to reduce the computational expense
without reducing accuracy.

33


3. Methods

3.3.8 Relaxation of IoU Thresholds

In the standard RetinaNet, implemented by FAIR, an anchor box is considered
a match with an annotation if the IoU is 0.5 or higher. For small objects, this
threshold may be too strict. This would result in fewer or no anchors matching
the ground truth and thus further increase the class imbalance. An experiment was
conducted to investigate how a more relaxed threshold would affect performance,
the IoU threshold was reduced from 0.5 down to 0.4.

3.3.9 Tuning of Inference Parameters

During inference, three parameters can be tuned to increase the performance of
the algorithm. The parameters are TopK, Non-maximum suppression, and a score
threshold. The TopK parameter specifies the maximum amount of predictions al-
lowed by the network, three Topk values were evaluated. By default, the network
keeps only the top 1000 scoring results. Non-maximum suppression was used to
filter out multiple predictions of the same object. Four different values of the IoU
threshold were evaluated. By default, the network has an IoU threshold of 0.5.
The score threshold removes predictions with a confidence score smaller than some
threshold. Four different values of this threshold were investigated. By default, the
lowest allowed confidence score is 0.05. Due to time constraints, these parameters
were only experimented with for a subset of the trained networks. The networks
that were experimented with were the new baseline, the network with reduced IoU-
threshold, and the network with p6 and p7 removed.

The reason for experimenting with the network with reduced IoU threshold comes
from the fact that more anchors will be matched with the ground truth. Since an-
chors of the same size are placed on each feature level, this might cause an increase
in false positives since there might be matches of the same object across different
levels. Should this be the case, the NMS threshold during inference might be able
to filter out these double matches.

It is also interesting to investigate the network with p6 and p7 removed. The
reason for this comes from the fact that when removing p6 and p7 the performance
is expected to decrease somewhat. However, this will make the network lighter and
faster. With experiments of the inference parameters, it might be possible to reduce
this decrease in performance.

3.3.10 Final RetinaNet

Once the optimal parameters for anchor placements, pruning of p6 and p7, relax-
ation of IoU thresholds and tuning of inference parameters were decided, they were
implemented in the final RetinaNet. The final RetinaNet was used for the imple-
mentation of temporal information, and the semi-supervise learning framework.

34


3. Methods

3.4 Temporal Information
Once the final RetinaNet was chosen, methods for integrating temporal informa-
tion were investigated. Due to the limited available data from the Drone vs Bird
Detection Challenge, the suggested methods required to be compatible with the pre-
trained backbone. The parameters of the pre-trained backbone have been trained
as feature extractors of images. Therefore to utilize the pre-training, the methods
needed to be integrated to enable the utilization of the pre-trained parameters as
feature extractors in roughly the same way as they were trained. Three methods of
integrating temporal information were implemented and evaluated. Each method
was based on integrating information from one frame at time t, with the information
from a previous time step. Two time steps were investigated, t− 1 and t− 4.

3.4.1 Concatenation of Feature Maps
This method utilized two identical copies of the FPN backbone. One backbone was
given an input image of a frame at time t and the other one was given an input image
of a previous frame. The backbone computed the feature maps for the two images
and the feature maps from t were concatenated with the difference between the fea-
ture maps at time t and the previous time step. The concatenation was performed
along the channels of the feature maps. The approach was inspired by the work of
C. Craye et al [4]. The team concatenated grayscaled images of different time steps
at the input to the network. Instead of performing the concatenation at the input,
the concatenation was performed after the backbone. The additional channels con-
tributed with additional information to the classifier and regression head, regarding
the change of the feature map between frames.

Let P denote the set of feature maps from the backbone. Here P can be described
by:

P = {p3, p4, p5, p6, p7} (3.1)

The concatenation of feature maps was done for two sets of feature maps, from time
step t and the difference between the feature maps at time step t and t − i where
i ∈ {1, 4}. The set of difference between feature maps is denoted by:

Psub = Pt − Pt−i (3.2)

The concatenation was performed on the feature maps in P and Psub and can be
written as:

P‖Psub = {p3 ‖ p3sub, p4 ‖ p4sub , . . . , p7 ‖ p7sub} (3.3)

3.4.2 Siamese Networks with Addition Merge
An implementation of a Siamese network was made. The goal was to utilize the
frozen layers as shared parameters between two consecutive frames. These frames
were merged before the gradient enabled layers. This method aimed to be a more
lightweight approach to integrating temporal information compared to the method

35


3. Methods

of concatenating the feature maps. With this method, only one set of model param-
eters needed to be stored in the memory. Furthermore, instead of having to save
two full sets of feature maps, only one full set and one additional Res3 was required.
Backpropagation was only needed to be performed on one set of trainable layers.

Let R represent the set of feature maps from the backbone. This set can be di-
vided into two subsets with one set representing the feature maps from the frozen
layers and the other represents feature maps from the trainable layers. These sets
can be written as RFrozen and RTrainable respectively they are described as:

RFrozen = {Res1, Res2, Res3} (3.4)

RTrainable = {Res4, Res5} (3.5)

The proposed methodology was to feed two consecutive frames through the layers
in RFrozen and then merge the feature maps before the feature maps were fed into
the layers in RTrainable. The merge was performed by adding the feature maps and
take the average by dividing by two. The succeeding layers were pre-trained and
expected an input in the form of a feature map from an image. By receiving the
average of the two feature maps, the input to the pre-trained parameters was similar
to the expected input.

3.4.3 Siamese Networks with Concatenation Merge
The final proposed implementation of temporal information was a combination of
the previous methods. In the same way as the previous Siamese network, the set of
feature maps, R was divided into RFrozen and RTrainable. Two consecutive frames
were fed into the frozen layers. The merge was done by concatenating the feature
map at time t with the difference between the feature map at time t and the feature
map at time t − i. The choice of the merge was based on the common practice of
merging through concatenation rather than addition, thus the difference between
these two types of merge was investigated. The concatenation merge contributed
with additional information regarding the change of the feature map between frames.
An auxiliary convolutional layer with kernel size 1 × 1 was utilized to reduce the
number of channels to the appropriate amount expected by the next coming layer.
This method, in addition to merging at Res3, performed an experiment with merging
at Res4. The reason for this was to investigate how the location of the merge in the
network affected performance.

3.5 Implementation of Semi-supervised Learning
Framework

The goal of the semi-supervised learning framework suggested in this thesis was to
enable the use of unannotated data in a modern object detection algorithm. Since
gathering data is considered rather easy at Saab, the difficult part would be to

36


3. Methods

manually annotate a vast amount of data. Therefore it was investigated whether
the algorithm could reliably be utilized as an annotation tool for unlabeled data. The
data collected for this thesis contains scenes that were largely different compared
to the data found in the Drone vs Bird Detection Challenge datasets. This will
challenge the generality of the algorithm since the collected data can be considered
belonging to another domain. Furthermore, the data collected and annotated by the
framework was incorporated into the Drone vs Bird Detection Challenge datasets,
to investigate whether data from another domain than the Drone vs Bird Detection
Challenge could be utilized to further improve the performance of the algorithm.

3.5.1 Semi-supervised Learning Framework
The semi-supervised learning framework draws inspiration from the work of E.
Sangineto et al [11]. The team utilized a Fast-RCNN to generate bounding box
annotations on images using the region proposal network in the first iteration and
later utilizing the predicted bounding boxes as annotations. The annotations were
performed on easily classified samples for the first iterations and more difficult clas-
sified examples were successively added between iterations. By utilizing easily clas-
sified samples in the first iterations the training was kept from diverging in the
early stages due to noisy annotations. The framework in this thesis utilized easily
classified samples of drones flying in an area that differs from the provided train-
ing data from the Drone vs Bird Detection Challenge. Predictions were made on
the easy samples using the algorithm and if these samples proved to be compara-
ble to hand annotated data, they were utilized as training data for the next iteration.

The collected data consisted of three parts, background, training and test data.
The background dataset consisted of a scene where birds may be present, however,
no drones were included thus the scene was annotated as background. The training
data consisted of a video with a rather easy case where the drone was flying towards
an area that was occupied with traditional Swedish houses as well as pine trees.
The test dataset was similar to the training dataset, however, with more emphasis
on difficult cases such as flying the drone around treetops and multiple cases of the
drone flying in front of buildings.

3.5.2 Evaluation of Generated Annotations
The evaluation of the network trained with the generated annotations was made
with visual inspection, investigation of confidence scores and misclassifications, and
through evaluation against the Drone vs Bird Detection Challenge test data. The
visual inspection was performed by sampling a section of the proposed annotations
and an investigation whether the annotations were competitive to work done by hu-
man annotators was made. The quality of the annotations was investigated between
iterations of the framework, by investigating the confidence scores of predictions
made between iterations and how accurately the annotations were placed on the
drone. Furthermore, misclassified objects were investigated between iterations, to
investigate whether these misclassifications were reduced between iterations.

37


3. Methods

Furthermore, the number of frames that were deemed to be usable between training
iterations was a factor when evaluating the performance of the framework.

Since only a sample of the generated annotations was evaluated through visual
inspection, further evaluation of the quality of the annotations was needed. The
performance of the network trained with the generated annotated dataset was eval-
uated with the Drone vs Bird Detection Challenge test data. The reason for this was
to investigate whether the generated annotations could contribute to an increase in
performance on any of the test videos from the Drone vs Bird Detection Challenge.
The generated annotations might prove to contain noisy labels that were not dis-
covered in the visual inspection. Therefore this evaluation was used to determine
whether the benefits of the generated annotations outweigh the shortcomings. The
F1-score was used as the evaluation metric.

38


4
Results

This chapter will give an in-depth explanation of the experiments conducted in the
methodology and present the results of these experiments. Section 4.1 will present
the evaluation metrics. In Section 4.2 the results regarding the modifications of the
RetinaNet will be presented. Section 4.3 will present the results of the implementa-
tion of temporal information. In Section 4.4, the results from the semi-supervised
learning framework will be covered. In this thesis, five different datasets have been
used. Section 4.2 and 4.3 will only utilize the training and test data from the
Drone vs Bird Detection Challenge. In Section 4.4, three new datasets will be intro-
duced. These datasets will be utilized by the semi-supervised learning framework in
conjunction with the training and test datasets from the Drone vs Bird Detection
Challenge.

4.1 Evaluation Metrics
Two main evaluation metrics have been utilized to evaluate the performance, the
COCO AP metric, and the F1-score. The results regarding the tuning of the Reti-
naNet and temporal information were evaluated with the implemented COCO AP
function. The AP-score calculated for each evaluation was based on an average over
multiple IoU-thresholds. Theses thresholds ranged from 0.50 : 0.95 with an incre-
ment of 0.05 between each threshold. AP-score for small, medium, and large objects
was obtained for each evaluation. An object was considered small if the area of its
ground truth bounding box was smaller than 32 pixels, medium if it was between
32-96 pixels and finally large if it was larger than 96 pixels [48].

4.1.1 Evaluation Drone vs Bird Detection Challenge
In the Drone vs Bird Detection Challenge, another metric was utilized to evaluate
the final algorithms, the F1-score. To compare the final algorithm with the state
of the art algorithms, the F1-score of the final RetinaNet was evaluated using this
metric.

4.1.2 Evaluation of Semi-Supervised Learning Framework
To evaluate the semi-supervised learning framework, visual inspection was utilized.
Furthermore, the semi-supervised learning framework was evaluated with the F1-
score to compare how the use of this framework improved the results in the Drone
vs Bird Detection Challenge.

39


4. Results

4.2 Results RetinaNet

This section presents the results of the original implemented baseline, how transfer
learning was utilized, and the inclusion of P2. Thereafter a new baseline is intro-
duced followed by the results of the anchor placement and the pruning of p6 and
p7. Consequently, the results of the relaxation of the IoU thresholds and the tun-
ing of inference parameters will be presented. Finally the results of the Drone vs
Bird Detection Challenge will be presented. The datasets used in Section 4.2 is the
training and test data from the Drone vs Bird Detection Challenge.

4.2.1 Evaluation of Baseline
The baseline parameters of the RetinaNet are the default settings from the Reti-
naNet implemented by FAIR, see Table 4.1. The anchor sizes were selected based
on the sizes of the annotations in the Drone vs Bird Detection Challenge dataset.
The initial input size was down-sampled from the native resolution of 1080 × 1920
to the size 800 × 1333. The AP-scores for this baseline were calculated, see Table
4.2.

Table 4.1: The standard RetinaNet parameters used in this work.

Frozen until IoU Feature levels Aspect ratios Anchors Input size
Res4 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333

Table 4.2: Performance of the baseline.

AP AP(50) AP-s AP-m AP-l
11.659 30.705 6.954 38.517 36.583

4.2.2 Utilizing Transfer Learning
In these experiments, all parameters were kept constant according to the baseline,
and only the point where the backbone was frozen until was changed, see Table
4.3. These experiments indicated that to achieve the highest overall AP-score, the
backbone should be frozen until Res3, see Table 4.4.

Table 4.3: Parameters for experimenting with which layer to freeze until.

Frozen until IoU Feature levels Aspect ratios Anchors Input size
Experiment 1 Res1 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333
Experiment 2 Res2 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333
Experiment 3 Res3 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333
Experiment 4 Res4 [0.4, 0.5] p3-p7 [0.5, 1.0, 2.0] [4, 8, 16, 32, 64, 128, 256] 800 x 1333

40


4. Results

Table 4.4: Results of freezing the baseline at different levels.

AP AP(50) AP-s AP-m AP-l
Experiment 1 13.678 38.248 10.566 32.029 30.898
Experiment 2 13.221 31.392 6.231 43.556 38.677
Experiment 3 14.583 38.792 9.575 37.673 43.811
Experiment 4 11.659 30.705 6.954 38.517 36.583

4.2.3 Inclusion of P2
When including p2 it became apparent that this feature level was computationally
expensive. To reduce the training time, the top feature levels p6 and p7 were re-
m