Radar Based Classification of
Vulnerable Road Users
A comparison between two networks based on the ResNet
and PointNet architectures and an evaluation of using
time aggregated radar data for learned classifiers of
vulnerable road users

Master’s thesis in Systems, Control and Mechatronics

CHRISTIAN GARCIA
MÅNS LERJEFORS

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2019


Master’s thesis 2019

Radar Based Classification of Vulnerable Road
Users

A comparison between two networks based on the ResNet and
PointNet architectures and an evaluation of using time aggregated

radar data for learned classifiers of vulnerable road users

CHRISTIAN GARCIA
MÅNS LERJEFORS

Department of Electrical Engineering
Division of Signal processing and Biomedical engineering

Chalmers University of Technology
Gothenburg, Sweden 2019


Radar Based Classification of Vulnerable Road Users
A comparison between two networks based on the ResNet and PointNet architectures
and an evaluation of using time aggregated radar data for learned classifiers of
vulnerable road users
CHRISTIAN GARCIA, MÅNS LERJEFORS

© CHRISTIAN GARCIA, MÅNS LERJEFORS, 2019.

Supervisors: Christopher Zach, Chalmers University of Technology
Jianan Liu, Jeanette Warnborg, Alexander Lyckell, Aptiv Contract Services AB
Examiner: Christopher Zach, Electrical Engineering

Master’s Thesis 2019
Department of Electrical Engineering
Division of Signal processing and Biomedical engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Visualisation of how the findings of this thesis are intended to be used in
traffic.

Typeset in LATEX
Gothenburg, Sweden 2019

iv


Radar Based Classifications of Vulnerable Road Users
A comparison between two networks based on the ResNet and PointNet architectures
and an evaluation of using time aggregated radar data for learned classifiers of
vulnerable road users
CHRISTIAN GARCIA, MÅNS LERJEFORS
Department of Electrical Engineering
Chalmers University of Technology

Abstract
As an increasing amount of automated features are integrated into vehicles today,
there is a demand for a reliant system for detecting vulnerable road users. This
thesis investigates the possibilities of classifying vulnerable road users based on solely
radar data. It also explores the effect of using time aggregated data for different
time spans. The investigation is done by comparing the performance of two different
network architectures. One of the networks is inspired by the convolutional neural
network ResNet and the other one by a neural network called PointNet which main
application is to classify spatial point clouds. As input range-Doppler images and
radar point clouds are used. The best performance is achieved by the ResNet-
inspired architecture, with a time span ranging over three discrete data points. This
achieves a accuracy of 92.59%. The time aggregations of data is shown to have little
to no effect in increasing the performance of either of the networks.

Keywords: deep neural networks, machine learning, radar, vulnerable road user
classification, active safety.

v


Acknowledgements
We would like to thank the people that helped us during this thesis and made it
possible. Thank you Mats Björnerbäck, first and foremost for giving us the oppor-
tunity to do this thesis at Aptiv and also for taking the time to discuss what type
of thesis would be of use for Aptiv. Thank you Jonathan Jansson, Erik Larsson,
Henric Eriksson and Jonas Lundberg for exchanging ideas and giving us feedback
on our work. Thank you Jianan Liu for giving us an extensive introduction to the
findings in machine learning that you found the most important for this thesis and
for giving us a fundamental understanding of the work previously done at Aptiv.
Thank you Jeanette for giving us valuable ideas when questions arose, for giving us
feedback on our work and for helping us with technical issues. Thank you Alexander
Lyckell, for providing the data and for explaining how Aptiv’s radars work. And
lastly, thank you Christopher Zach for taking the time and responsibility to be the
examiner of this thesis and for providing us with ideas and feedback on how the
work could be executed.

CHRISTIAN GARCIA, MÅNS LERJEFORS, Gothenburg, June 2019

vii


Contents

List of Figures xi

List of Tables xvii

List of Abbreviations and Nomenclature xix

Nomenclature xix

1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scientific contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5
2.1 Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Frequency modulation . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Azimuth angle . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Constant false alarm rate . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Micro-Doppler . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Time integrated range-Doppler . . . . . . . . . . . . . . . . . 9

2.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Activation function . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Learning, backpropagation and loss . . . . . . . . . . . . . . . 11
2.2.3 Optimisation algorithm . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5 Overfitting and dropout . . . . . . . . . . . . . . . . . . . . . 15
2.2.6 Batch normalisation . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.7 Resblock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.8 PointNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Classification problem and evaluation metrics . . . . . . . . . . . . . 18
2.3.1 Binary relevance problem . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Metrics of networks . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 k-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

ix


Contents

3 Method 23
3.1 Datasets and their characteristics . . . . . . . . . . . . . . . . . . . . 23
3.2 Retrieval and preparation of data . . . . . . . . . . . . . . . . . . . . 25
3.3 Preprocessing of data . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Reshuffling the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 ResNet mini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 PointNet mini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Training and performance evaluation . . . . . . . . . . . . . . . . . . 35
3.8 Computer hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Results 37
4.1 Five-fold cross-validation of T1,s . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Precision-recall curves and AUC-scores . . . . . . . . . . . . . 41
4.1.2 Training convergence rate of T1,s . . . . . . . . . . . . . . . . . 42
4.1.3 PointNet sample size effect . . . . . . . . . . . . . . . . . . . . 43

4.2 T1,s as train set and T2,s as validation set . . . . . . . . . . . . . . . . 44
4.3 Five fold cross-validation of T2,s . . . . . . . . . . . . . . . . . . . . . 46

5 Discussion 49
5.1 Network comparison and performance . . . . . . . . . . . . . . . . . . 49
5.2 Effect of time aggregation . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Filtering the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Conclusion 55

Bibliography 57

A Appendix 1 I
A.1 Dataset cardinality and densitiy . . . . . . . . . . . . . . . . . . . . . I
A.2 Dataset T1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
A.3 Dataset T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII

x


List of Figures

2.1 Illustration of a host vehicle with a radar mounted in the front. The
radar yields three detections, where two detections belongs to a tar-
get, which in this case is a pedestrian. The detections of interest are
orange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Illustration of the linear frequency modulation continuous wave tech-
nique with three chirps. . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Micro-Doppler map. The measurement comes from a car driving in
a circle in front of a radar for approximately 30 seconds. . . . . . . . 9

2.4 Integrated range-Doppler map. The measurement comes from a man
riding a bicycle in a circle in front of a radar approximately 30 seconds. 9

2.5 A conventional fully connected neural network with three layers, three
inputs, four neurons per layer and one output. . . . . . . . . . . . . 10

2.6 The figure illustrates the sigmoid function and the ReLu function
explained in equations (2.4a) and (2.4b) respectively. . . . . . . . . . 11

2.7 The computational flow of a neuron, with three inputs and a bias term. 12
2.8 A convolutional filter acting on an input image. In the figure the

convolutional filter acts on three image patches per row and three
image patches per column. Thereof the output is a three-by-three
matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.9 Illustration of four 3×3×1 activation maps yielded by four 2×2×1
filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.10 Visualisation of a resblock. . . . . . . . . . . . . . . . . . . . . . . . . 16
2.11 Architecture of the point order invariance module with n number of

points.The multiple rows of MLPs to the left illustrates the MLP is
shared, i.e. it is the same MLP used for all points. . . . . . . . . . . 17

2.12 Architecture of the T-net module. The multiple rows of MLPs to the
left illustrates that the it is the same MLP used for all points. . . . . 17

2.13 An illustration of how the test and training data is chosen between k
iterations in k-fold cross-validation. . . . . . . . . . . . . . . . . . . . 21

3.1 An illustration over time aggregation of data points before being fed
to a network. In this figure the the segment length s = 3 is used as
an example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Two driving scenarios. Figure 3.2a illustrates a driving scenario from
T1,s and Figure 3.2b illustrates a driving scenario from T2,s. . . . . . . 25

xi


List of Figures

3.3 An illustration of the radar set up. The dotted lines illustrates the
field of views of the radars. Each radar is represented by a specific
colour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Two time integrated range-Doppler maps both with segment length
s = 1. The image to the left is created with filtered data. The filter
used was a CFAR filter as explained in section 2.1.3. The image to
the right is created without filtering the data. . . . . . . . . . . . . . 28

3.5 Two time integrated range-Doppler maps both with segment length
s = 5. The image to the left is created with filtered data. The filter
used was a CFAR filter as explained in section 2.1.3. The image to
the right is created without filtering the data. . . . . . . . . . . . . . 29

3.6 Two time integrated range-Doppler maps both with segment length
s = 10. The image to the left is created with filtered data. The filter
used was a CFAR filter as explained in section 2.1.3. The image to
the right is created without filtering the data. . . . . . . . . . . . . . 29

3.7 A visualisation of two point clouds in 3D space where the x- and y-axis
are spacial coordinates in meter and the z-axis is the Doppler shift
in radar bins. The figure to the right depicts a point cloud obtained
where no filtering is applied and the figure the left depicts the same
point cloud but where CFAR-filtering is conducted. The point clouds
are obtained with segment length s = 1. . . . . . . . . . . . . . . . . 30

3.8 A visualisation of two point clouds in 3D space where the x- and y-axis
are spacial coordinates in meter and the z-axis is the Doppler shift
in radar bins. The figure to the right depicts a point cloud obtained
where no filtering is applied and the figure the left depicts the same
point cloud but where CFAR-filtering is conducted. The point clouds
are obtained with segment length s = 5. . . . . . . . . . . . . . . . . 31

3.9 A visualisation of two point clouds in 3D space where the x- and y-axis
are spacial coordinates in meter and the z-axis is the Doppler shift
in radar bins. The figure to the right depicts a point cloud obtained
where no filtering is applied and the figure the left depicts the same
point cloud but where CFAR-filtering is conducted. The point clouds
are obtained with segment length s = 10. . . . . . . . . . . . . . . . . 31

3.10 The ResNet mini architecture. The numbers denotes the size of the
filters used in the layer followed by the number of filters used. In the
cases where another stride than 1 is implemented it is stated at the
end. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.11 The PointNet mini architecture. The multiple stacked MLPs after
each T-net module illustrates that the same MLP is used for all points.
The numbers to the right represent the size of the layers in each MLP. 34

4.1 The figures illustrates the change in accuracy A, as defined in sec-
tion 2.3, over the aggregation time of the data points. The value is
the average achieved value from the five-fold cross-validation. The
aggregated points corresponds to a ∆t of 0.05, 0.15, 0.25, 0.35, and
0.50 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

xii


List of Figures

4.2 Graphs of the change in exact match ratio MR, as defined in sec-
tion 2.3, over the aggregation time of the data points. The value is
the average achieved value from the five-fold cross-validation. The
aggregated points corresponds to a ∆t of 0.05, 0.15, 0.25, 0.35, and
0.50 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 The change in F1,µ-score, as defined in section 2.3, over the aggrega-
tion time of the data points. The value is the average achieved value
from the five-fold cross-validation. The aggregated points corresponds
to a ∆t of 0.05, 0.15, 0.25, 0.35, and 0.50 seconds. . . . . . . . . . . 39

4.4 All accuracies obtained when doing a five-fold cross-validation on the
ResNet mini for segment lengths 1, 3, 5, 7, and 10. The image to the
left shows the accuracies obtained when feeding the network CFAR-
filtered data and the image to right when feeding the network unfil-
tered data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 All accuracies obtained when doing a five-fold cross-validation on the
PointNet mini for segment lengths 1, 3, 5, 7, and 10. The image to the
left shows the accuracies obtained when feeding the network CFAR-
filtered data and the image to right when feeding the network unfil-
tered data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6 Precision-recall curves for the two evaluated networks, with a dataset
using segment length s = 3 for ResNet mini and segment length
s = 10 for PointNet mini, for the two classes of VRUs, pedestrian and
bicyclist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.7 The figures illustrates the training convergence for ResNet mini on
the CFAR-filtered and unfiltered datasets. The change in accuracy,
A, is shown over trained epochs for the datasets consisting of 1, 3, 5,
7 and 10 aggregated data points. . . . . . . . . . . . . . . . . . . . . 42

4.8 The figures illustrates the training convergence for PointNet mini on
the CFAR-filtered and unfiltered datasets. The change in accuracy,
A, is shown over trained epochs for the datasets consisting 1, 3, 5, 7
and 10 aggregated data points. . . . . . . . . . . . . . . . . . . . . . 43

4.9 The figures illustrates the impact that sample size have on the per-
formance of PointNet mini. The change in accuracy A, exact match
ratio MR and F1,µ is shown over number of sampled points. . . . . . 44

4.10 Confusion matrices for the classes bicyclist and pedestrian. The two
matrices on the left are the results of ResNet mini and the two ma-
trices on the right are the results of PointNet mini. The values in the
confusion matrices correspond to the fraction of the total number of
executed classifications. . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.11 Precision-recall curves for the two classes of VRUs, pedestrian and
bicyclist. The precision-recall curve is done for both evaluated net-
works, with a dataset using a segment length s = 3 for ResNet mini
and a segment length s = 10 for PointNet mini. . . . . . . . . . . . . 45

xiii


List of Figures

4.12 Confusion matrices for the classes bicylist and pedestrian. The two
matrices on the left are the results of ResNet mini and the two ma-
trices on the right are the result of PointNet mini. The values in the
confusion matrices correspond to the fraction of the total number of
executed classifications during the five fold cross-validation. . . . . . 46

4.13 Precision-recall curves for the two classes of VRUs, pedestrian and
bicyclist. The precision-recall curve is done for both evaluated net-
works, with a dataset using a segment length s = 3 for ResNet mini
and a segment length s = 10 for PointNet mini. . . . . . . . . . . . . 47

A.6 The target object accelerates to required speed while the host ve-
hicle remains stationary. Driving scenario 10 is divided into two
sub-scenarios for each target, as is illustrated. The car accelerates
to 40 kph, the bicycle to 30kph and the pedestrian to 5kph. In the
sub-scenarios A.6a,A.6b and A.6c the target keeps a distance of 5m
throughout the logging. These sub-scenarios are done both clock-wise
and counter clock-wise. The sub-scenarios in A.6d,A.6e and A.6f are
done both from right to left and vice versa. . . . . . . . . . . . . . . . V

A.7 The target object accelerates to 5kph while the host vehicle remains
stationary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

A.8 The target object and host vehicle accelerates to required speed. Host
vehicle drives in reverse gear in 10kph. The speed of the car is 50kph,
the speed of the bicyclist 30kph and the speed of the pedestrian 5kph. VI

A.9 The host vehicle accelerates to 40kph while the target is stationary
in front of the host. When the host has driven past the target, the
target accelerates to required speed. The speed of the car is 30kph,
the speed of the bicyclist 30kph and the speed of the pedestrian 5kph. VI

A.10 Driving scenario 10. Both host vehicle and bicycle start by standing
still next to each other. Both host and bicyclist then accelerates to
20kph. Scenario executed on both sides and in both directions. . . . VII

A.11 The target object and host vehicle accelerates to required speed. The
target then turns in front of the target as illustrated. The speed of the
host varies between 10kph, 15kph, and 20kph. The bicyclist speed
and the pedestrians spe is 50kph, the speed of the bicyclist is 10kph
and the pedestrians speed is 5kph. The scenario is executed with
turns in both directions. . . . . . . . . . . . . . . . . . . . . . . . . . VII

A.12 Host vehicle accelerates to 10kph and target to 20 kph (bicyclist) or 5
kph (pedestrian). Host vehicle then turn into pedestrian crossing or
bicycle lane as is illustrated A.12a and A.12b. The driving scenario
is executed with host driving in both directions. . . . . . . . . . . . . VIII

A.13 Host vehicle accelerates to 30 kph and target to 20 kph. The driving
scenario covers both when the target bicyclist is travelling in the
adjacent lane to the host vehicle and when having a lane between the
host vehicle and the bicyclist, as is shown in A.13a and A.13b. . . . . VIII

xiv


List of Figures

A.15 Driving scenario 14. Host vehicle accelerates to 5kph and the target
bicyclist to 20kph. In order to make the illustrated left turn the host
vehicle makes a slight turn into the adjacent bicycle lane. . . . . . . IX

A.16 Driving scenario 15. The host vehicle accelerates to 5kph and makes a
tight turn which leads to the trailer cutting the sidewalk. The target
is standing still on the sidewalk. . . . . . . . . . . . . . . . . . . . . X

xv


List of Figures

xvi


List of Tables

3.1 The number of examples in each dataset T1,s and T2,s with each seg-
ment length s. Dataset T2,s is only made in with two different segment
lengths since it is only tested for with the best performing segment
lengths on T1,s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 The table contains the percentage of each class in the datasets T1,s
and T2,s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Sample size for different integration lengths . . . . . . . . . . . . . . 32

4.1 The table gives an indication of the three main results. It is viable
to divide either datasets T1,s or T2,s and then train on one part of the
divided dataset and test on the other. But the driving scenarios in
the two datasets are too different to be able to train on T1,s and test
on T2,s and get good performance. . . . . . . . . . . . . . . . . . . . . 37

4.2 The performance of the two networks with the datasets yielding the
highest scores, T1,3 and T1,10 respectively. The results are obtained
by doing a five-fold cross-validation. . . . . . . . . . . . . . . . . . . . 38

4.3 The maximum and minimum spreads of accuracies when doing a five-
fold cross-validation. The spread is given in pp. . . . . . . . . . . . . 40

4.4 AUC-scores for the ResNet mini and PointNet mini for the classes
pedestrian and bicyclist. . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 The performance of the two networks using the datasets yielding the
highest scores, T1,3 and T1,10 to train on, and validating on the corre-
sponding datasets T2,3 and T2,10. . . . . . . . . . . . . . . . . . . . . . 44

4.6 AUC-scores for ResNet mini and PointNet mini for the classes pedes-
trian and bicyclist. The scores are achieved after the networks have
been trained on T1,s and validated on T2,s. . . . . . . . . . . . . . . . 46

4.7 The performance of the two networks with the datasets yielding the
highest scores, T2,3 and T2,10 respectively. The results are obtained
by doing a five-fold cross-validation. . . . . . . . . . . . . . . . . . . . 46

4.8 AUC-scores for ResNet mini and PointNet mini for the classes pedes-
trian and bicyclist. The scores are achieved after training and valida-
tion on T2,s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A.1 The cardinality C and density D of the two datasets T1,s and T2,s is
presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

xvii


List of Tables

xviii


List of Abbreviations and
Nomenclature

Adam Adaptive moment estimation
CFAR Constant False Alarm Rate
CNN Convolutional Neural Network
Euro NCAP European New Car Assessment Programme
GPU Graphics Processing Unit
LFMCW Linear Frequency Modulated Continuous Wave
MLP Multilayer Perceptron
MSE Mean Squared Error
pp percentage points
ReLu Rectified Linear unit
resblock Residual building block
RMSprop Root Mean Square propagation
SISO Single-Input Single-Output
VRU Vulnerable Road User

Nomenclature
a Distance between two antennas
A Accuracy
AUC Area Under the Curve
b Any integer
B Bandwidth
c Speed of light in air
EPV Events Per Variable
f Received radar frequency
F Mapping of stacked nonlinear layers in a resblock
f0 Emitted radar frequency
F1,µ Micro average of F1-score
FN False negative
FP False positive
FPR False positive rate
f ∗(x) True model

xix


0. List of Abbreviations

h(x) Classifier
H Underlying mapping of a resblock
i Index for examples
I Time integrated range-Doppler map
j Index for classes
J Cost function
k Amount of iterations and splits in a k-fold cross-validation
K All time integrated range-Doppler maps from one logging stacked intro a struct
L Labelset
m Amount of points in a point cloud
M Logging with all range-Doppler maps
m0 1st moment vector in the Adam optimisation algorithm
MR Exact Match Ratio
n Amount of data points in a dataset
nB Batch size
nD Doppler resolution of the radar
nr Range resolution of the radar
p Amount of dimensions of one detection in a point cloud
P Precision
pdropout Dropout parameter
pnetwork Amount of parameters in a network
q Amount of classes
r Range
R Recall
RL(x) ReLu function
s Segment length
t Time
T Dataset
TN True negative
TP True positive
v0 2nd moment vector in the Adam optimisation algorithm
vr Velocity of the receiver
vt Velocity of the target
w Learnable weight
x Point cloud, time integrated range-Doppler map, or a function input
X Batch
x̄ Normalised input x
x Coordinate in x-dimension of a time integrated range-Doppler
y Function output
Y Labels belonging to example x
y Coordinate in y-dimension of a time integrated range-Doppler
Z Output from classification function
α Stepsize
β Learnable bias term
z Coordinate in z-dimension
∆f Doppler shift

xx


0. List of Abbreviations

∆φ Difference in phase between antennas
∆t Integration time for a time aggregated data point
Γ Learnable parameters
γ1 Learnable scaling factor used in batch normalisation
γ2 Learnable shift parameter used in batch normalisation
λ Wavelength of emitted frequency
µB Mean of a batch
σ2

B Standard deviation of a batch
σ(x) Sigmoid function
θ Azimuth angle from the boresight of the host to the target
ξ Activation function
ζ Decay rate

xxi


0. List of Abbreviations

xxii


1
Introduction

With an automotive industry that is moving towards autonomous driving, more and
more automated features are integrated into cars. Adaptive cruise control, intelligent
speed adaptation and emergency brake assist are examples of features of this kind
that are already common in modern cars. The aim of these features is to enhance
car safety and thereby decrease the amount of road related accidents, as well as to
reduce energy consumption and increase comfort.

Pedestrians and bicyclists, also known as Vulnerable Road Users, VRUs, are common
elements in traffic, especially in the landscape of bigger cities. Car accidents with
VRUs are one of the most common accidents happening due to driver distraction or
misjudgement [1]. Hence, it is an area where automated safety features have a large
impact. In Sweden for example, approximately 2000 pedestrians are injured in traffic
related accidents every year [2]. One way to prevent these kind of accidents from
happening would be for cars to have a reliant classification system for pedestrians
implemented. Today VRU classification is done mainly with computer vision. The
drawback of this approach is that it is sensitive to harsh weather conditions and
disturbances such as dirt on the lens. Classification has been proven possible with
radar, but not to the same extent as image classification. The radar approach suffers
for example from its low resolution which complicates the classification process.

Machine learning has been found to be applicable for a large set of problems with
outstanding results in recent years. Fields such as image recognition, spam detection,
medical diagnosis, financial analysis and predictive maintenance are just a few areas
where machine learning has excelled.

1.1 Purpose

The purpose of this thesis is to investigate to what extent VRUs can be classified
using radar data. A successful radar based classification system could be a good
compliment to the vision based systems that are mainly used today. With two types
of sensors the system could be more robust against poor vision conditions.

In recent years automated safety features has been included as requirements when

1


1. Introduction

the vehicle safety organisation Euro NCAP rates the safety of a car. In 2020 an auto
emergency braking system that specifically reacts to VRUs will be incorporated in
the test procedure [1]. Reliant classifications of VRUs is a way to make this feature
possible.

1.2 Objective

The thesis investigates the radar based classification possibilities by comparing two
network architectures. One of the networks is a convolutional neural network, CNN,
based on the architecture of ResNet [3]. The other network is based on Multi Layer
Perceptrons, MLPs, and is inspired by PointNet [4]. The two networks are developed
to classify whether the radar data contains a car, pedestrian, and/or a bicyclist.
The thesis is done at the company Aptiv Contract Services AB, in their office in
Gothenburg, Sweden.

Two datasets are created where one of them contains driving scenarios inspired by
driving scenarios defined by Euro NCAP. The driving scenarios used for the two
datasets are defined in A.2 and A.3. Each network is trained and tested on both
datasets separately. The networks capability to generalise is studied by using one
of the datasets as training set and the other dataset as testing set.

Due to the difference in architecture the two networks require radar data prepro-
cessed in different ways. The CNN-based architecture is fed radar data in the form
of 2D images, or maps, which display the received radar signal amplitude over the
dimensions range and Doppler shift. The MLP-based network, inspired by the Point-
Net [4], is fed radar data on the form of a data point cloud. The data is preprocessed
and time aggregated for an evaluation on whether this can facilitate the classifica-
tion of the networks. Hence the evaluation is a comparison between the two network
architectures and an investigation of the feasibility of the two radar preprocessing
methods. The radar data will also be filtered to evaluate to which extent this has
an effect on the networks performance.

1.3 Scope

This thesis is limited to only perform classifications based on the total radar in-
put. Neither of the two networks are able to give any information regarding where
the detected object is located, but instead only determine whether the radar data
contains a car, pedestrian, and/or bicyclist or not. This is to avoid the extensive
amount of manual labelling it would require.

The evaluation is based on data provided by the company Aptiv and is taken from
a logging session done to test their products. The scenarios in these logging session

2


1. Introduction

are partly inspired by the Euro NCAP scenarios for VRU detection. The evaluation
is hence a proof of concept and not a direct evaluation of the feasibility of a direct
implementation of the results from this study in real world scenarios.

1.4 Scientific contribution

This thesis aims to make a contribution in the area of radar based classification
using machine learning. Its main contributions are:

• A comparison between the CNN-based architecture and the PointNet inspired
architecture and their respective radar data input.

• An evaluation of how time aggregating radar data affects the networks perfor-
mance.

• An investigation of how filtered radar data affects the networks performance.

1.5 Outline of thesis

Except from the brief introduction to the problem introduced in Chapter 1, this
thesis consists of five additional chapters.

Chapter 2 serves as a background chapter providing the reader with the basic
knowledge within the field. It partly consists of an explanatory section describing
the key concepts of radar technology and an overview of how radar data is commonly
illustrated. It also consists of a section containing the fundamentals behind neural
networks, an introduction to the two network architectures used in this thesis and
describe commonly used evaluation metrics for this kind of problem. The chapter
ends with an overview of related work in the research area.

Chapter 3 covers the methods used in this thesis. It explains the datasets used to
train and validate the networks and how these datasets are retrieved and prepro-
cessed. It also thoroughly describes the networks used in this thesis.

Chapter 4 consists of the results gathered by comparing the two networks on the
two datasets. These results are then discussed inChapter 5 and the final conclusion
of the thesis can be read in Chapter 6.

3


1. Introduction

4


2
Background

This section will go through the necessary theory in order to get an understanding
of the key concepts covered in this thesis. It begins by explaining the basics of a
radar and the specific algorithms and concepts behind the radar used in this thesis.
It continues with an introduction to the theory behind neural networks and an
explanation of the used network architectures. This section also brings up by what
metrics the networks are evaluated, and how the training is done to make sure that
the result of the performance is portrayed fairly. Lastly, related work in the scientific
area is discussed.

2.1 Radar

The primary usage of a radar is to determine the characteristics of the surrounding
environment based on how a transmitted electromagnetic wave is reflected back. A
basic radar set up is composed of two components, a transmitter and a receiver. A
signal that is reflected back to the receiver is denoted as a detection. One transmitted
signal can cause many detections. These detections can be caused by both the
ground and surrounding objects. In radar terminology an object of interest is often
denoted as a target. A target can yield several radar detections as is illustrated
in Figure 2.1. A radar can determine the distance to a detection and hence also a
target by using the received arrival time of the transmitted signal. The range, r, to
a detection can then be computed by the fairly simple equation,

r = c t

2 , (2.1)

where t corresponds to the time it takes for the signal to echo back to the radar
and c corresponds to the velocity of waves in the medium, which in this case is the
speed of light in air.

5


2. Background

Figure 2.1: Illustration of a host vehicle with a radar mounted in the front. The
radar yields three detections, where two detections belongs to a target, which in this
case is a pedestrian. The detections of interest are orange.

Besides from range, a radar have the possibility to measure the velocity by mak-
ing use of the Doppler effect. This phenomenon is described by the equation for
Doppler shift, ∆f , which explains the difference in frequency between the emitted
and received signal by

∆f = ∆v
c
f0 = ∆vλ, (2.2)

where ∆f = f − f0, ∆v = vr − vt, and vr is the velocity of the receiver, vt is the
velocity of the target, f the received frequency, f0 is the emitted frequency, and λ
is the wavelength of the emitted frequency [5]. Hence, the measured speeds will be
the relative radial velocity to the radar. This means that an object travelling in a
circle around a radar will have a measured relative radial velocity of zero [6].

2.1.1 Frequency modulation

In order to compute the velocity of a target, a radar transmitting a continuous wave
with a fixed frequency could be used. As stated above, the target velocity can then
be computed with equation (2.2). However, with this type of radar it is not possible
to compute the range to the target [6]. Due to this, several frequency modulation
techniques has been developed in order to gain information about the distance to a
target.

One of the most common techniques within the automotive industry is called linear
frequency modulated continuous wave, LFMCW [6]. This is also the modulation
technique used by the radars in this thesis. The principles of LFMCW is depicted
in Figure 2.2.

6


2. Background

Figure 2.2: Illustration of the linear frequency modulation continuous wave tech-
nique with three chirps.

Instead of using a fixed frequency, as would have been done in a simple continuous
wave radar, a LFMCW-radar lets the frequency vary from a minimum frequency f0
to a frequency f0 + B, where B corresponds to the bandwidth. A frequency sweep
like this is referred to as a chirp. During a single measurement a LFMCW-radar
transmits multiple chirps. By measuring the difference in frequency ∆f , between
the transmitted frequency and the received frequency, the range can be computed.
This is possible due to the range being proportional to the the linear frequency
change. If the target is moving, adjustments to the target induced frequency shift
also has to be made. The radar used to gather data for this report acquires data at
a rate of 20Hz. In this thesis one single measurement instance will be referred to as
a data point. One data point will hence be all detections gathered from the same
measurement instance.

2.1.2 Azimuth angle

When a radar is equipped with multiple receiving antennas the angle of which the
object is located is possible to compute. This can be done by measuring the dif-
ference in phase, ∆φ, between the receiving antennas. In automotive purposes the
angle of interest is the angle in the horizontal plane. This angle is called the azimuth
angle, θ, and can be computed by

θ = sin−1
(
λ

2πa(∆φ+ 2πb)
)
, (2.3)

where λ is the wavelength, a is the distance between the two antennas used for the
calculation, and b can be set to any integer to solve the equation since the sine
function is periodic. With the azimuth angle it is possible to estimate, not only at

7


2. Background

which distance the target is located, but also in which direction the target can be
found. The complete algorithm to find the azimuth angle can be found in [7].

2.1.3 Constant false alarm rate

The signal detected by a radar receiver will consist of both noise caused by the
internal components of the receiver and of noise caused by the surroundings. If the
noise is low enough a simple solution to this problem would be to define a certain
threshold for the signal strength and filter out all signals below the defined threshold.
A low threshold would yield many false alarms but simultaneously not bear the risk
of missing real targets, while a high threshold would yield few false alarms but would
have a higher risk of filtering out real detections.

The set up of having a fixed threshold could work fairly well in a fixed environment
with a stationary radar, but when the surroundings change it is hard to set a de-
cent fixed threshold value. The purpose of the Constant False Alarm Rate, CFAR,
algorithm is to let the threshold value vary and hence making it adaptable to new
environments where the background noise, for instance, is higher. There are several
different CFAR-algorithms that estimate the varying threshold value in different
ways [8]. The specific CFAR-algorithm used for this study is confidential.

2.1.4 Micro-Doppler

A commonly used method to visualise radar data is through micro-Doppler images.
These images depict the micro-Doppler effect, which is a phenomenon occurring
when an object has multiple detection points with diferent speeds relative the radar
and thus reflects back different Doppler frequencies. For instance, a walking human
would yield a range of different Doppler speeds. The signals reflected off the torso
would correspond to the speed in which the person is heading, while the arms and
legs are travelling in other relative speeds. Over time, this yields characteristic pat-
terns called micro-Doppler signatures, which vary depending on the studied object.
E.g. a car would not have the pattern that a pair of swinging arms cause in its
micro-Doppler signature, since all parts of the car travels at the same speed. Even
though a car passing at close distance to the radar would have both positive and
negative speeds while being directly in front of the radar, since one part of the car
is travelling away from the radar and the other part is travelling towards the radar.
The micro-Doppler map from a car can be seen in Figure 2.3. A micro-Doppler map
normally has frequency on the y-axis and time on the x-axis. The intensity for each
x- and y-value is then normally plotted as a heat map on the surface spanned by x
and y.

8


2. Background

Figure 2.3: Micro-Doppler map. The measurement comes from a car driving in a
circle in front of a radar for approximately 30 seconds.

2.1.5 Time integrated range-Doppler

Range-Doppler maps are another common way to display radar data. In a range-
Doppler map, the computed range is depicted against the computed Doppler veloc-
ity, also known as Doppler shift. The magnitude of the reflected signal is illustrated
with colours, ranging from red to dark blue, where red represents the largest reflected
values and blue the lowest.

Figure 2.4: Integrated range-Doppler map. The measurement comes from a man
riding a bicycle in a circle in front of a radar approximately 30 seconds.

9


2. Background

In [9] a way to combine the spectrogram-like features of a micro-Doppler with the
range data in range-Doppler maps is proposed. The proposed approach is time
integrated range-Doppler maps. By taking the maximum pixel value over a time
span, ∆t, an objects time-correlated features can be visualised and extracted. An
example of time integrated range-Doppler map can be seen in Figure 2.4.

2.2 Neural networks

Deep neural networks are a modern subcategory of machine learning. The naming is
derived from it being loosely inspired of the neurons in the human brain. These type
of networks are of high importance in many applications today, such as computer
vision and natural language processing.

The fundamental architecture of deep neural networks is based on the fully connected
layer. This layer contains neurons where each neuron is connected to all neurons in
the adjacent layers and not connected to any neurons in the same layer. A neuron
in a neural network is simply a unit that takes several inputs and computes an
activation value to pass forward to neurons in the next layer. The overall goal of the
network is to approximate the true model f ∗(x), by the network model h(x), based
on the input x. An illustration of a simple fully connected network can be seen in
Figure 2.5. An architecture based on these layers is generally known as a multilayer
perceptron or MLP.

Figure 2.5: A conventional fully connected neural network with three layers, three
inputs, four neurons per layer and one output.

2.2.1 Activation function

The activation function is mainly used to map the output values from a layer to
suitable values that will serve as input to the neurons in the next layer. The activa-

10


2. Background

tion function introduces nonlinear properties to the neural network. Two commonly
used activation functions are the sigmoid function, σ(x), and the rectified linear
unit, ReLu, function, RL(x). The functions are explained by

σ(x) = 1
1 + e−x , (2.4a)

RL(x) = max(0,x), (2.4b)

respectively, and are presented visually in Figure 2.6.

Figure 2.6: The figure illustrates the sigmoid function and the ReLu function
explained in equations (2.4a) and (2.4b) respectively.

When dealing with classifiers it is preferable to have an activation function in the
final layer of the network that yields a probabilistic output. The sigmoid function
does exactly this. However, the univariate sigmoid function is only applicable in
binary classification cases, since it gives a probability of a statement being either
true or false. This is since the sigmoid function will map the final output of a
neural network to a single probabilistic value ranging from 0 to 1. In other terms,
the probability that the input belongs to a class or not. In cases with multiple
classes there are a few different approaches to solve the classification problem. One
approach is to use multi-class classification, and let each input only be classified
as belonging to one class. Another approach is to use multiple binary classifiers,
which instead defines a problem as a multi-label classification problem. In this case,
multiple sigmoid functions, one per label, could be used as a final layer.

2.2.2 Learning, backpropagation and loss

In order to properly estimate the true model, f ∗(x), the network has learnable pa-
rameters. Each neuron has learnable weights, wi, and a learnable bias term, β. The
input, xi, is multiplied with a corresponding weight, wi, which is then added with
the bias, β. This is done for every input to the neuron and then summed together.
The resulting value of the summation is put through an activation function, ξ, which

11


2. Background

then constitutes the output y of the neuron. Hence, the output can be expressed as
y = ξ(wTx + β). This output then serves as input to the neurons in the next layer.
The complete computational flow of a single neuron is illustrated in Figure 2.7.

Figure 2.7: The computational flow of a neuron, with three inputs and a bias term.

When the network is trained, an input x is inserted to the network, which by letting
the data flow through all layers of the network produces an output Y. This process
is called forward propagation. For each network a loss function is defined, which
gives an estimate of how well the network performs. An example of a loss function
is the mean squared error, MSE, where the loss, J , is defined as

J = 1
n

n∑
i=1

(Yi − Zi)2 (2.5)

where Yi is the target variable, Zi is the output predicted by the network and n
is the number of samples being predicted. The loss can easily be computed after
forward propagation. The choice of loss function is dependent on which type of
application the network is designed for. MSE is one of the most commonly used loss
functions for regression problems. One commonly used loss functions dealing with
multiple binary classification problems is the Multi Label Soft Margin Loss [10],
which is formulated below,

J = −1
q

n∑
i=1

Yi log
(

eZi

1 + eZi

)
+ (1− Yi) log

( 1
1 + eZi

), (2.6)

where q corresponds to the number of labels.

To update the value of the learnable parameters, Γ, backpropagation is done. Back-
propagation refers to the process of computing the gradient of the loss with respect
to the parameters, ∇ΓJ(Γ). Hence, the weights and biases will be updated in a man-
ner that produces a lower loss in the next forward propagation. The computations
of the gradients in every layer are done with the chain rule.

12


2. Background

In most deep learning applications the complete dataset is divided into batches.
Large batch size are computationally faster, while small batch size has the advantage
of bringing better generalisation performance. Both [11] and [12] conclude that a
batch size of 32 is a good compromise. New parameter values are computed by
doing backpropagation for every batch. An epoch is referring to when all batches of
a dataset have been used to update the parameter values. The training of a network
usually consists of several epochs of parameter updating and backpropagation [13].

2.2.3 Optimisation algorithm

There are several different optimisation algorithms. The optimisation algorithm is
used to calculate how the parameters of the solution, Γ, are going to be updated.
This is done by taking a step of length α in the direction of the gradient calculated
by the algorithm. The step size is often referred to as learning rate. One of the most
popular algorithms is Adam, which uses the benefits of momentum and root mean
square propagation, RMSprop [14]. Momentum pushes the solution in the direction
of the previous gradient and thus creating "momentum" and RMSprop makes the
method take smaller steps in steep directions and bigger steps in less steep direc-
tions [14]. The algorithm described by

Algorithm 1
The stochastic optimisation algorithm, Adam, is best initialised with the stepsize
α = 0.001, ε = 10−8, ζ1 = 0.9, and ζ2 = 0.999 [14]. All vector operations are applied
element-wise. ζ1 and ζ2 to the power of t is denoted ζt1 and ζt2.
Require: α: Stepsize
Require: ζ1, ζ2 ∈ [0, 1): Exponential decay rates for the moment estimates
Require: f(Γ): Stochastic objective function with parameters Γ
Require: Γ0: Initial parameter vector
m0 ← 0 (Initialise 1st moment vector)
v0 ← 0 (Initialise 2nd moment vector)
t← 0 (Initialise timestep)
while Γt not converged do

t← t+ 1
gt ← ∇Γft(Γt−1) (Get gradients w.r.t. stochastic objective at timestep t)
mt ← ζ1 ·mt−1 + (1− ζ1) · gt (Update biased first moment estimate)
vt ← ζ2 · vt−1 + (1− ζ2) · g2

t (Update biased second raw moment estimate)
m̂t ← mt

1−ζt
1
(Compute bias-corrected first moment estimate)

v̂t ← vt

1−ζt
2
(Compute bias-corrected second raw moment estimate)

Γt ← Γt−1 − α · m̂t√
v̂t+ε

(Update parameters)
end
return Γt (Resulting parameters)
.

13


2. Background

The benefits of Adam is it being computational efficient, requires small memory
usage and is suitable for large datasets [14].

2.2.4 Convolutional layer

A convolutional neural network, CNN, is a specific kind of neural network. The
architecture has proved to be extremely efficient in image recognition related tasks.
A key component of the CNNs is the convolutional filters.

Figure 2.8: A convolutional filter acting on an input image. In the figure the
convolutional filter acts on three image patches per row and three image patches
per column. Thereof the output is a three-by-three matrix.

The convolutional filter is an, often square, matrix of learnable weights. The dot
product is performed between the weights in the filter and an equally sized patch
in the input image. This product then becomes the value of the element in the
corresponding place of the output. The stride of a convolutional filter is the amount
of steps in pixels the filter is ”moved” before acting on the next input patch. In
Figure 2.8 a stride of one is used. The filter is applied from left to right and from
top to bottom. The output of a convolutional filter is called activation map.

14


2. Background

Figure 2.9: Illustration of four 3×3×1 activation maps yielded by four 2×2×1
filters.

A convolutional layer usually consists of several convolutional filters, resulting in
several activation maps. This yields an output with a depth corresponding to the
amount of filters used in the convolutional layer. An illustration of a convolutional
layer, consisting of the same input image and filter size as used in Figure 2.8, can
be seen in Figure 2.9.

2.2.5 Overfitting and dropout

For a neural network it is important to perform well on previously unseen data.
This ability is called generalisation in machine learning vocabulary. If the network
is overfitting it does not generalise well, which means that there is a large difference
in network performance when it is tested on training data and when it is tested
on new data. There are several reasons to why overfitting will occur. One often
mentioned reason is having more network parameters than training samples in the
dataset [15].

There is, however, many ways to reduce the risk of overfitting. Dropout is one
commonly used method to do exactly this. The concept behind dropout is fairly
simple. For each step in the training phase a random fraction of neurons, pdropout,
will be dropped out, i.e. ignored. This means that in a case where the dropout rate
is set to pdropout = 0.5, half of the neurons will be randomly ignored throughout the
training process. This is done in order to avoid a network that is very dependant
on a few neurons for making a proper classification. With dropout all neurons is
forced to learn something about the data. This significantly reduces the risk of
overfitting [16].

2.2.6 Batch normalisation

Every time the weights are updated the distribution of a hidden layer’s input is
changed. This requires the network to have a low learning rate, which slows down

15


2. Background

the learning [17]. Batch normalisation refers to the procedure of normalising the
input to a succeeding hidden layer in order to solve this problem. The normalisation
is done for every batch, X = {x1, ...,xnB}, where nB denotes the batch size. The
normalisation scheme is described below

x̄i = xi − µB√
σ2

B + ε
, (2.7a)

yi = γ1x̄i + γ2, (2.7b)

where µB and σ2
B corresponds to the mean and variance of the batch being consid-

ered. γ1 is a learnable scaling factor and γ2 is a learnable shift parameter. ε is a
small number introduced to avoid division by zero. Hence, x̄ is the normalisation of
the input xi and yi the output from the batch normalisation. This process is done
in every neuron. Batch normalisation is shown to not only speed up the learning
process, but also to reduce the risk of overfitting [17].

2.2.7 Resblock

Even though a networks ability to generalise increases with the depth of the network,
beyond a certain depth adding layers to a network can lead to that the accuracy
stagnates or even degrades [18]. This is partially due to the vanishing gradient
problem [3][19][20]. A widely used approach to combat this issue is the use of residual
building blocks, or resblocks, from [3] where the ResNet architecture is explained.
The idea behind the ResNet is to not guess that the stacked layers directly fit an
underlying mapping, but instead let the layers explicitly fit a residual mapping. To
do this the underlying mapping is defined as H(x) and the stacked nonlinear layers
fit the mapping F(x) := H(x)−x. The idea is that if an identity mapping is optimal
or at least a close enough mapping, then it is easier to get the residual to zero than
to find an identity mapping with nonlinear layers.

Figure 2.10: Visualisation of a resblock.

The identity mapping is realised by shortcut connections as illustrated in Figure 2.10.

16


2. Background

Using resblocks as a building block for a network helps avoiding the vanishing gra-
dient and exploding gradient problem [3].

2.2.8 PointNet

PointNet is a network designed to consume point cloud data and perform object clas-
sification and part segmentation on the dataset. This is desirable since point clouds
resembles the way raw sensor data is received. In PointNet each point is processed
independently. In the basic architecture a point is represented by 3D coordinates
(x, y,z). Additional dimensions, e.g. colour and normal, can be added [4].

In order to successfully classify point clouds two main challenges are solved by
PointNet. The first one is the problem of being invariant to the order of which the
points are fed to the network. The solution proposed in PointNet is a structure
with a shared MLP for all points, followed by a max pooling and an MLP. The
max pooling acts as a symmetric function and is hence making PointNet invariant
to permutations. An illustration of this implementation can be seen in Figure 2.11.
The second problem solved by PointNet is the problem of being invariant to point
cloud rotations. By letting a small version of the network, called T-net, predict an
affine transformation matrix PointNet is able to align the input points. This module
is illustrated in Figure 2.12.

The PointNet architecture has been proven successful of performing part segmenta-
tion and classification on radar point clouds in [21]. This approach uses two spatial
coordinates, (x, y), and two additional dimensions which can be found in [21].

Figure 2.11: Architecture
of the point order invariance
module with n number of
points.The multiple rows of
MLPs to the left illustrates
the MLP is shared, i.e. it
is the same MLP used for all
points.

Figure 2.12: Architecture of the T-net module.
The multiple rows of MLPs to the left illustrates
that the it is the same MLP used for all points.

17


2. Background

2.3 Classification problem and evaluation metrics

There are several ways to measure the performance of a network. Different metrics
measures different aspects of the networks performance. Certain metrics are better
suited for some classification problems than others. This section will explain how
the problem is defined for the networks and the metrics being used to evaluate them.

2.3.1 Binary relevance problem

The classification problem in this thesis is defined as a binary relevance problem [22].
This approach trains one binary classifier for each label. The model independently
predicts each label in one example. To do this a dataset needs to be defined.

A dataset, T , is defined by its n examples (xi,Yi), 1 ≤ i ≤ n. The examples are
defined by (xi ∈ X ,Yi ∈ Y = {0, 1}q), where xi is a the input to be classified,
and Yi contains the binary true labels associated with xi. The datasets include
a labelset L, where the labels lj ∈ L, 1 ≤ j ≤ q, and |L| = q. A classifier, h,
classifies an example xi by h(xi). Each classification outputs q predicted labels,
that is h(xi) = Zi = (z1, .., zq). Ideally Zi = Yi,∀i. This translates to the problem
”Does label lj belong to xi?”.

The general disadvantage with the binary relevance problem is that it does not
model label dependency. This should not be a disadvantage for this particular
classification problem since label dependency is not desirable. A case where label
dependency would be of interest is for example when classifying movies. A movie is
likely to be correctly labelled family friendly and comedy at the same time, but not
horror and family friendly. For this particular classification problem the probability
of the presence of a pedestrian should not be dependent on the probability of the
presence of a car for example.

2.3.2 Metrics of networks

A common way to measure network performance is by computing the accuracy, A,
which in this case where the problem is defined as a binary relevance problem, is
defined by

A =

q∑
j=1

(
TPj + TNj

)
q∑
j=1

(
TPj + FPj + TNj + FNj

) , (2.8)

where T, F, P and N in TP, TN, FP and FN stand for true, false, positive and
negative, and q is the number of labels. A true positive, TP, is a classified example
that has been classified as labelj and does belong to labelj. TN, FP and FN are

18


2. Background

in analogy with TP. Hence A is a measurement of how well a network is classifying
overall, without giving any importance to a particular label.

If a network however is able to classify an example as containing multiple labels
at once, accuracy does not paint the whole picture. The exact match ratio, MR,
is a more strict version of accuracy where all predicted labels of an input must be
correctly classified to contribute to the score. This metric gives a measure of the
ratio between the amount of examples completely correctly classified to the amount
of examples classified in total.

MR = 1
n

n∑
i=1

I(Yi = Zi), (2.9)

where I is the indicator function, Zi the predicted labels, Yi the true labels and
n the amount of examples being evaluated. MR does not take partially correct
classifications in to consideration. Partially correct classifications are counted as
incorrect classifications.

In binary classification precision and recall are two commonly used measures. The
precision, P , is defined by

P = TP
TP + FP . (2.10)

P is a measurement of how precise a network is each time it classifies an object as
containing a particular label. A high P value would suggest that the network, when
classifying an object as containing a specific label, is most often correct. Recall, R
is defined by

R = TP
TP + FN . (2.11)

R does instead give high values for a specific label if a high amount of the examples
are classified as containing that specific label. The downside to this measurement
is that a network that tends to over-classify objects as a specific class would give a
high value of R. The harmonic mean of R and P is called F1, and is defined by

F1 = 2 · R · P
R + P

. (2.12)

F1 is a measure of the balance between P and R, or the harmonic mean of P and R.
In datasets where there is a relatively large imbalance of labels, it is better to use the
micro average of F1 than the macro average. The micro average F1,µ is calculated
by

F1,µ = 2 ·

q∑
j=1

TPj
q∑
j=1

(
2TPj + FPj + FNj

) (2.13)

19


2. Background

where TPj and FPj are the amount of true positives and false positives for label lj
respectively, and q is the number of labels.

The false positive rate FPR, is a measure of how often a classifier wrongly classifies
an example as positive label, when it actually is negative, per total amount of
negative examples. It is defined by

FPR = FP
FP + TN . (2.14)

Another performance measure is the Precision-Recall curve in combination with its
area under the curve, AUC. This curve is a plot of the precision on the vertical axis
and the recall on the horizontal axis, for different thresholds in the last step of the
classifier. As explained in Section 2.2.1 each binary classifier outputs a probabilistic
output between 0 and 1. The thresholds values in question are the values for where
a prediction is being considered true. Hence, a Precision-Recall curve displays how
a binary classifier is affected by choosing different threshold values. AUC is a metric
of how good the Precision-Recall curve is and is simply calculated as the area below
the curve, with a maximum of 1.

A metric on how to evaluate the number of parameters in a network compared to
the number of examples in the dataset is the events per variable, EPV , as suggested
in [23] and [24] for regression models. The metric is defined by

EPV = n

pnetwork
, (2.15)

where n is the number of examples in a given dataset and pnetwork is the number of
parameters in a network.

2.3.3 k-fold cross-validation

Cross-validation is used to estimate the expected performance. It is also used to
select the best fit model and to ensure that the model is not overfitting. The k-fold
cross-validation method is implemented by splitting up the dataset in to test and
training data k times. Each time the size of the test dataset is 1

k
:th of the full

dataset. The k different test datasets are chosen so that no test set has overlapping
data with another test set. The remaining data is the training data. The concept
is visualised in Figure 2.13.

20


2. Background

Figure 2.13: An illustration of how the test and training data is chosen between
k iterations in k-fold cross-validation.

Which k to choose is a trade-off between choosing a large k and thus not perturbing
the data enough, and a small k which leads to a small training set relative to the
full dataset. Choosing k = 5 is considered a good compromise between the two [25].

2.4 Related work

The main area of active safety related detection and classification research has been
conducted within vision based systems. In [26] a computer vision based system
for real time vehicle tracking is proposed. The proposed system is proven to be
robust against harsh conditions such as occlusion, varying lightning conditions, and
vibrations. A system to perform vision based pedestrian detection on-board of a
moving vehicle is presented in [27]. In [28] pedestrian classification based on a single
frame is investigated with the conclusion that some features need to be measured
over time in order to get reliant classifications.

In regards to image recognition alone, there has been done extensive research in de-
veloping highly effective network architectures in order to enhance the classification
performance. The ResNet is presented in [3]. This network uses identity mapping to
surpass the vanishing or exploding gradient problem. The ResNet allows a deep neu-
ral network architecture, which will be used in this work but with a smaller amount
of parameters. The densely connected convolutional network, DenseNet, presented
in [29], connects all layers to each other and achieves to substantially reduce the
number of parameters and alleviate the vanishing gradient problem even further.
Batch normalisation gives the benefit of achieving the same accuracy with substan-
tially fewer training steps [17]. In [30] it is shown that under certain conditions and
assumptions all bad local minima can be removed by adding a neuron. In [31] it is
shown that this can be done for any neural network, for multi-class classification,

21


2. Background

for binary classification, and regression with an arbitrary loss function.

Classification of VRUs based on radar data has not been done to the same extent
as the vision based research, but there are still plenty of research on the topic.
In [32] pedestrian recognition without machine learning has been studied and written
about. It is shown that over 95% of pedestrians can under optimal conditions be
classified correctly with a 77GHz radar, by primarily analysing the variance of the
radial velocity of the object being classified. However, under worse conditions the
classification rate can drop down to 29.4%. Laterally moving pedestrians is the main
contributing factor to this drop in accuracy.

The most common approach when using deep learning methods for radar based
classification is to visualise the radar data in either a Range-Doppler map or a Micro-
Doppler map and then feeding this image to a CNN. This is partly done in [33] where
a 25GHz FMCW Single-Input Single-Output, SISO, radar is used in real time for
human-robot identification. The CNN approach with range-Doppler maps as input
is compared to conventional classical learning approaches with extracted features.
In [33] only single frame range-Doppler maps are used and hence no aggregation is
done.

The difference between lateral moving vehicles and pedestrians in terms of feature
extraction and classification is studied in [34]. In [35] the characteristic micro-
Doppler signature of pedestrians are studied with a state of the art radar sensor.
Pedestrian micro-Doppler signatures are also studied in [36] together with micro-
Doppler signatures of bicyclists. In [37] the micro-Doppler signatures are used as
inputs to a CNN in order to classify seven different human activities. This was
done with a success rate of 90.9%. In [21] the task of semantic segmentation and
classification on radar point clouds is demonstrated. The authors of [38] implements
a neural network based on the MLP architecture to classify pedestrians and vehicles.
This network is trained using radar outputs as input to the network.

The authors of [39] has analysed the effect of time aggregation on estimates of the
elasticities of output with respect to employment and to average hours of work.
They find that low frequency generate better estimates of output- employment elas-
ticity while high frequency data generate better predicts the output-average hours
elasticity. Which is a clear indicator that lower frequency not always generates bet-
ter estimates or predictions, and that the hypothesis of an increasing accuracy with
higher numbers of time aggregated data points might be wrong. In [40] proves both
theoretically and experimentally that their proposed algorithm for the retrieval of
temporal aggregates of data from sensors in infrastructures can be used to save time
cost and storage space consumption. The findings in [41] show that the applica-
tion of aggregation algorithms, which generalise the weighted majority algorithm,
performs very well in comparison to the auto-regressive moving average algorithm.
Time aggregation is mainly used in fields of economics and is not as commonly
applied in the field of radar and VRU detection and classification.

22


3
Method

In this section the methods implemented in this thesis are explained. This includes
the structure of the different datasets and how they are fed into the networks. How
data is obtained and preprocessed is explained in detail, as well as the classification
problem definition itself. The section ends in an explanation of how the networks
are defined and trained, and which metrics are used to evaluate them.

3.1 Datasets and their characteristics

Two different types of datasets have been created, T1,s and T2,s. Dataset T2,s contains
driving scenarios inspired by driving scenarios defined by Euro NCAP conditions for
a five-star rating year 2020 [1]. The driving scenarios for T2,s are defined in A.3 and
dataset T1,s contains the driving scenarios defined in A.2. Hence, what distinguishes
the dataset is the data points they contain. A five-fold cross-validation was done on
both datasets T1,s and T2,s separately.

A study on how time aggregating data points was done on dataset T1,s. Thus, T1,s
was made in five different versions. One for each segment length, s, that has been
tested where the variable s defines how many time frames are aggregated. The
versions contain the exact same data points but the data points are aggregated
over different time periods and thus have different segment lengths s.A test on the
networks ability to generalise has been made. It was done by training on dataset
T1,s and testing on dataset T2,s.

23


3. Method

Figure 3.1: An illustration over time aggregation of data points before being fed
to a network. In this figure the the segment length s = 3 is used as an example.

The time aggregation of data points was done with different methods for the two
networks. Time aggregation of data points for the CNN was implemented by making
time integrated range-Doppler maps, as will be further explained in section 3.3, and
concatenation was used for the point clouds. The integration time has been set to
0.05, 0.15, 0.25, 0.35, and 0.5 seconds. Which corresponds to the time aggregation
of 1, 3, 5, 7, and 10 data points. The time aggregation concept is illustrated in an
example with segment length s = 3 in Figure 3.1. This means that all datasets T1,s
contain the exact same data points when s varies, but different number of examples.
This is also true for T2,s. Note that T1,s and T2,s does not have any data points in
common. As explained in Section 2.3.1, each example in a dataset is denoted xi. In
this study xi is a time integrated range-Doppler map or a point cloud, depending on
the network being evaluated. The number of labels, q, for dataset T1,s is q = 3, and
for T2,s, q = 2. The labels are bicyclist, car, and pedestrian for T1,s, and bicyclist,
and pedestrian for T2.

There are 108 889 data points in dataset T1,s and 42 758 data points in dataset
T2,s. The number of examples n in the datasets for varying segment length is given
by Table 3.1. T2,s was only made with the segment lengths that yields the best
results for each network.

Table 3.1: The number of examples in each dataset T1,s and T2,s with each segment
length s. Dataset T2,s is only made in with two different segment lengths since it is
only tested for with the best performing segment lengths on T1,s.

s 1 3 5 7 10
n1,s 108 889 37 932 22 534 16 650 11 022
n2,s - 14 297 - - 4 102

The distribution of classes in the two datasets is given by Table 3.2.

24


3. Method

Table 3.2: The table contains the percentage of each class in the datasets T1,s and
T2,s.

bicyclist car empty pedestrian
T1,s 21.8% 19.1% 32.7% 26.4%
T2,s 38.5% 0% 33.3% 28.2%

3.2 Retrieval and preparation of data

The two datasets, T1,s and T2,s, consists of data collected from 16 different driving
scenarios. T1,s includes 10 of these scenarios and T2,s the remaining 6 scenarios. The
driving scenarios in T1,s partly consists of data collected where the host vehicle is
stationary and partly where the host is moving. In all 16 scenarios there is only one
target present. This means that the scenarios are relatively simple and hence can
at most be considered to be simulations of traffic scenarios with very low amount
of surround targets. For every driving scenario multiple logging sessions are made.
These logging sessions contain variations in distance and relative velocity between
the host vehicle and the target object. Figure 3.2 illustrates one example driving
scenario from each dataset. The full details of these driving scenarios can be further
studied in Appendix A.2.

(a) Driving scenario 1, bicyclist. (b) Driving scenario 12, pedestrian

Figure 3.2: Two driving scenarios. Figure 3.2a illustrates a driving scenario from
T1,s and Figure 3.2b illustrates a driving scenario from T2,s.

T2,s consists of data gathered from 6 different driving scenarios inspired by the Euro

25


3. Method

NCAP tests for VRU detection as stated in section 3.1. Therefore these driving
scenarios are produced to be more relevant for VRU protection than the scenarios
in T1,s. All targets in T2,s are either of the class bicyclist or pedestrian. The details
about these driving scenarios can be seen in Appendix A.2 and A.3.

All data from both datasets are collected at an empty airfield. This is to reduce the
amount of radar reflections from the surroundings as much as possible. In Figure 3.3
one can see how the radars are situated on the host vehicle. There are two radars
mounted on each side of the truck, making the total number of radars four. Each
radar has a 150°field of view [42]. Since the target objects are not visible to the
radars at all times, manual labelling of all recorded scenarios has been conducted.
In order to be labelled as one of the classes, the target object is needed to be at
a distance of maximum 30m from the radar in question. When the target object
exceeds the 30m range it is no longer labelled as the specific class and these parts
of the recording are pruned.

Figure 3.3: An illustration of the radar set up. The dotted lines illustrates the
field of views of the radars. Each radar is represented by a specific colour.

3.3 Preprocessing of data

Each radar detection contains information of range, azimuth angle, relative velocity
between the radar and the detection, and the amplitude of the received signal. This
data is processed to create time integrated range-Doppler maps and point clouds.

The radar used in this thesis has a Doppler resolution of 512 and a range resolution
of 128. This means that it maximally can detect 512 variations of Doppler velocities
and 128 variations of range. The maximum detectable range depends on which scan
type the radar is using. This particular radar has four different scan types - two mid
range scans, which can detect targets up to 80m, and two short range scans which
has a maximum range of 40m. The scan types shifts for every data point, hence the
range of the radar will shift every data point. This is why a minimum range of 30m
is used when labelling the data. It is to ensure that all scan types have the target
within its range.

When creating time integrated range-Doppler maps the range and Doppler resolution

26


3. Method

is used. The Doppler resolution is nD= 512 and the range resolution is nr= 128,
making the images of size 512×128 pixels. A time integrated range-Doppler map
is defined as I, and a logging sequence containing several range-Doppler maps is
defined as M . Each detection is mapped into a pixel which has a corresponding
pixel in the Range Doppler-image. The amplitude of the detected signal is used
to decide the intensity of that pixel. If a detection in the next integration step
maps to the same bin the intensity of the pixel is chosen to the highest value.
The computations are further explained by Algorithm 2 which outputs all the time
integrated range-Doppler maps in a variable K.

Algorithm 2
Integrated Range-Doppler image generation. A function returning an object with all
created time integrated range-Doppler maps from one logging session. The algorithm
essentially computes an addition between the range-Doppler maps being aggregated
together, except if there is a value larger than 0 at the same pixel in one or more of
the maps. Then it keeps the largest value. The amount of consecutive data points
being aggregated together is represented by s, for segment length. Here s = 5 is
used as an example. I is a time integrated range-Doppler map. M is a logging
sequence with all range-Doppler maps from that logging, nD and nr are the Doppler
and range resolution of the radar respectively. I is a time integrated range-Doppler
map, K is the output of the function and is all the time integrated range-Doppler
maps from one logging.
s = 5
M = logging with all RD-maps
nD = 512
nr = 128
for i = 0:|M | do

if (i%s) == 0 then
I = M(i)

end
else

for j = 0 : nD do
for k = 0 : nr do

if I(j, k) < M(i, j, k) then
I(j, k) = M(i, j, k)

end
end

end
if ((i− 1 + s)%s) == 0 then

K.append(I)
I = zeros

end
end

end
return K

27


3. Method

Examples of different time integrated range-Doppler maps are shown in Figures 3.4a,
3.4b, 3.5a, 3.5b, 3.6a and 3.6b. The used segment lengths are s = 1, s = 5, and
s = 10. The figures to the left are produced with CFAR-filtered data and the figures
to the right with unfiltered data. The figures illustrates the increased information
gained in the images when the segment length is increased.

(a) Filtered range-Doppler map. (b) Unfiltered range-Doppler map.

Figure 3.4: Two time integrated range-Doppler maps both with segment length
s = 1. The image to the left is created with filtered data. The filter used was a
CFAR filter as explained in section 2.1.3. The image to the right is created without
filtering the data.

28


3. Method

(a) Filtered range-Doppler map. (b) Unfiltered range-Doppler map.

Figure 3.5: Two time integrated range-Doppler maps both with segment length
s = 5. The image to the left is created with filtered data. The filter used was a
CFAR filter as explained in section 2.1.3. The image to the right is created without
filtering the data.

(a) Filtered range-Doppler map. (b) Unfiltered range-Doppler map.

Figure 3.6: Two time integrated range-Doppler maps both with segment length
s = 10. The image to the left is created with filtered data. The filter used was a
CFAR filter as explained in section 2.1.3. The image to the right is created without
filtering the data.

29


3. Method

The point clouds were generated by concatenating x- and y-positions with the
Doppler velocity, for each detection in one time frame from the radar logging. This
sums up to a dimension size p= 3. The x- and y-positions are obtained by using the
range and azimuth angle. The azimuth angle, θ, was computed by equation (2.3).
To aggregate the point clouds over time a simple concatenation is made.

In Figures 3.7a, 3.7b, 3.8a, 3.8b, 3.9a and 3.9b each detection from one data point
is plotted in 3D space. This space is defined by x- and y-axis as spacial axis in
meters. The z-axis is proportional to the radial velocity of the detections. Datasets
both with data filtered by the CFAR method and with no filter has been created.
The figures illustrates the difference between the point clouds when filtering was
used and when not. The figures to the left depicts point clouds with CFAR-filtering
implemented and the figures to the right depicts point clouds where no filter is used.
Figures 3.7 to 3.9 also illustrates how the data aggregation changes the form and
information content of a point cloud. It is clear that curves and lines are more
accentuated when the segment length s is increased.

(a) Filtered point cloud. (b) Unfiltered point cloud.

Figure 3.7: A visualisation of two point clouds in 3D space where the x- and y-axis
are spacial coordinates in meter and the z-axis is the Doppler shift in radar bins.
The figure to the right depicts a point cloud obtained where no filtering is applied
and the figure the left depicts the same point cloud but where CFAR-filtering is
conducted. The point clouds are obtained with segment length s = 1.

30


3. Method

(a) Filtered point cloud. (b) Unfiltered point cloud.

Figure 3.8: A visualisation of two point clouds in 3D space where the x- and y-axis
are spacial coordinates in meter and the z-axis is the Doppler shift in radar bins.
The figure to the right depicts a point cloud obtained where no filtering is applied
and the figure the left depicts the same point cloud but where CFAR-filtering is
conducted. The point clouds are obtained with segment length s = 5.

(a) Filtered point cloud. (b) Unfiltered point cloud.

Figure 3.9: A visualisation of two point clouds in 3D space where the x- and y-axis
are spacial coordinates in meter and the z-axis is the Doppler shift in radar bins.
The figure to the right depicts a point cloud obtained where no filtering is applied
and the figure the left depicts the same point cloud but where CFAR-filtering is
conducted. The point clouds are obtained with segment length s = 10.

Due to the full sized point clouds being too computational demanding a predefined
number of points to be randomly sampled for each integration length was chosen.
The sample sizes were chosen to be larger with increased segment length, but still
not ending up with too large point clouds for segment length s = 10. In Table 3.3
the sample sizes corresponding to each integration length can be seen. Table 3.3
also contains information about how many points the average point cloud consists
of for each segment length.

31


3. Method

Table 3.3: Sample size for different integration lengths

Time aggregation length 1 3 5 7 10
Nr. sampled points 128 154 256 360 512
Mean nr. points CFAR cloud 186 557 885 1341 1911
Mean nr. points NoFilter cloud 434 1256 2238 3045 4149

3.4 Reshuffling the data

Since the data points in each dataset is a time series and therefore consecutive data
points might be similar since they are close to each other in time. When creating a
training and validation set it is undesirable to have neighbouring data points in time
in both the validation and the training datasets, since this might yield an unfairly
high accuracy. Therefore the data points have been shuffled by logging session so
that data points consecutive in time never can be distributed over both the training
set and validation set.

3.5 ResNet mini

The chosen CNN architecture is based on the ResNet in [3], which is a deep convolu-
tional network based on identity mapping as explained in section 2.2.7. ResNet mini
is 14 layers deep and has 67 267 parameters when the input size is 512-by-128. It is
implemented Pytorch [43]. The reason why a mini-version of ResNet was chosen is
partly to reduce the risk of overfitting, as described in Section 2.2.5, partly to re-
duce the time it takes to train the network, and partly to make it managable for the
hardware to execute. The network architecture consists of a normal block followed
by six residual blocks and a fully connected layer at the end. The architecture is
illustrated in Figure 3.10.

32


3. Method

Figure 3.10: The ResNet mini architecture. The numbers denotes the size of the
filters used in the layer followed by the number of filters used. In the cases where
another stride than 1 is implemented it is stated at the end.

The normal block begins with a convolutional filter with kernel size 7-by-7, stride 2
and padding of 3 pixels, followed by a batch normalisation layer and a ReLu, and
lastly a max pooling layer with kernel size 3-by-3, stride 2 and padding of 1 pixel.

The first three resblocks can be summarised by taking in 16 channels as input from
the normal block, and outputting eight activation maps from the last of the three
resblocks. In the fourth resblock the amount of filters are doubled to 16 and the
input is downsampled by using a stride of 2. In order to match the dimensions for
the residual mapping in this block a linear projection with a convolutional filter with
kernel size 1-by-1 and stride 1 is made. This linear projection is marked by a dashed
line in Figure 3.10. The second and third resblocks are identical. Both takes 16
channels as input and outputs 16 channels. The last resblock is connected to a fully

33


3. Method

connected layer. All the layers in the resblocks consist of convolutional filters with
a stride of 1, followed by a batch normalisation layer and a ReLu, except the first
layer in the fourth resblock, which has a stride of 2. The parameters of the batch
normalisation is defined in [3]. The code was greatly inspired by the code found
in [44].

3.6 PointNet mini

The PointNet mini architecture is based on the architecture described in [4] and
it was implemented in PyTorch [43]. PointNet mini is a smaller network than its
predecessor. It consists of 171 386 parameters, but still keeps the main architecture
of the original PointNet. PointNet mini only perform point cloud classification of
complete point clouds. No segmentation is done. The reasoning behind why a mini-
version is chosen for PointNet is the same as for ResNet. It is both to reduce the
overfitting risk and to enhance computation times.

Figure 3.11: The PointNet mini architecture. The multiple stacked MLPs after
each T-net module illustrates that the same MLP is used for all points. The numbers
to the right represent the size of the layers in each MLP.

The architecture of PointNet mini is illustrated in Figure 3.11. The input point
cloud consists of m points with dimension p. In the first T-net module the point
cloud constructs an affine transformation matrix of size p× p. In this thesis p = 3,

34


3. Method

since the point clouds consists of three dimensions, as explained in Section 3.3. Each
point is multiplied with the transformation matrix and used as input to the shared
MLP with two layers. The layer output size is 16 for both layers, which is denoted
by the numbers on the right side in Figure 3.11. The second instance of the T-net
module serves the purpose of aligning the features. The network then classifies q
binary classes.

3.7 Training and performance evaluation

The evaluation of both ResNet mini and PointNet mini has been done in several
steps. The first step was a five-fold cross-validation of both networks on the T1,s
dataset. A, MR and F1,µ is computed for all segment lengths in order to get a
good estimate of the effect that the time aggregation of data gives. Precision-Recall
curves has then been done for the best performing datasets for ResNet mini and
PointNet mini respectively. This gives a good indicator for how well the networks
detects and classifies VRUs on T1,s for all segment lengths.

The reason for why Precision-Recall curves are used instead of the more commonly
used ROC curve is that the ROC curve can present an overly optimistic measure
of the networks performance if there are moderate to large class imbalance in the
datasets, [45]. The main reason behind this is the use of False Positive Rate in
the ROC curve. This is since a change in proportion between positive to negative
instances does not affect the ROC curve [46]. The parameter that varies to obtain
the curves is the threshold value for the binary classifiers.

In the second step of the evaluation T1,s was used as training set and T2,s as validation
set. By doing this a good prediction of how well ResNet mini and PointNet mini
generalise to new unseen scenarios. This evaluation was only done for the best
performing segment lengths obtained when doing a five-fold cross-validation on T1,s.
In the third and final step of the evaluation a five-fold cross-validation was done on
T2,s. Precision-Recall curves have been done for both set-ups involving T2,s as well.
The metrics in these two steps has only been computed with pedestrian and bicyclist
as classes. This is to get as fair results as possible since the T2,s dataset does not
contain any cars.

Both networks were trained for 30 epochs each and with a batch size of 32. Both
networks also used Adam as optimiser. Adam was initialised with the standard
settings described in 2.2.3. The classification threshold, which is the threshold value
for when the binary classifier considers a probability to be true, is set to 0.5.

The networks were evaluated with data filtered with CFAR and unfiltered data. The
reason for evaluating the input aspect of the problem was to ensure that important
features were not lost in the filtering process. Feeding the networks with unfiltered
data would eliminate one processing step and making the learning process more
end-to-end compatible. In [47], benefits of end-to-end learning for self-driving cars

35


3. Method

are described.

3.8 Computer hardware

To train a network there is preferable hardware. To run a multi-layered CNN, as the
ResNet mini, a Graphics Processing Unit, GPU, is preferable. For this thesis, two
different types of computers have been used. One stationary with a GeForce GTX
1060 with 6GB GDDR5, and two Dell e6420 laptops with specifications in [48].

36


4
Results

In the following chapter the results are presented. The performance of ResNet
mini and PointNet mini is evaluated on two sets of data, T1,s and T2,s. On the T1,s
dataset an investigation of how the performance is effected by using time aggregated
data points is done. The segment length that yields the best performance, one per
network architecture, are also evaluated by five-fold cross-validation on dataset T2,s.
Lastly, the results from training on T1,s and using T2,s as validation set are presented.
In summary it is shown that a five-fold cross-validation on T1,s and T2,s both yields
accuracies A > 90%. Training on T1,s and testing on T2,s on the other hand, does not
yield results much better than a classifier that only makes random guesses. These
results are visualised in Table 4.1.

Table 4.1: The table gives an indication of the three main results. It is viable to
divide either datasets T1,s or T2,s and then train on one part of the divided dataset
and test on the other. But the driving scenarios in the two datasets are too different
to be able to train on T1,s and test on T2,s and get good performance.

Results
T1,s 3

T2,s 3

T1,s → T2,s 7

4.1 Five-fold cross-validation of T1,s

It can be seen in Figures 4.1a and 4.1b that the general performance in accuracy is
higher for ResNet mini than for PointNet mini. The mean accuracy for the filtered
and unfiltered data for ResNet mini is 91.59% and 91.25%, and for PointNet mini
it is 89.13% and 86.81%. This means that the ResNet mini, using filtered data for
any segment length performs on average 2.46 percentage points, pp, better than the
PointNet mini, and 4.44 pp for the unfiltered data.

37


4. Results

(a) (b)

Figure 4.1: The figures illustrates the change in accuracy A, as defined in sec-
tion 2.3, over the aggregation time of the data points. The value is the average
achieved value from the five-fold cross-validation. The aggregated points corre-
sponds to a ∆t of 0.05, 0.15, 0.25, 0.35, and 0.50 seconds.

Figures 4.1 to 4.3 shows that segment length s = 3 yields the highest scores for all
metrics, for ResNet mini using both filtered and non-filtered data, and for PointNet
mini when using non-filtered data. The best score overall for the PointNet mini
however is obtained with segment length s = 10 and using filtered data. The best
performing network of all is ResNet mini using three aggregated data points and
filtered data. Note that the difference between using filtered and non-filtered data
is marginal for ResNet mini, but substantial for PointNet mini. The scores for A,
MR, and F1,µ are 92.59%, 81.14%, and 0.83 respectively for the ResNet mini using
s = 3 with filtered data, and 90.05%, 79.00%, and 0.771 for the same score for the
PointNet mini using s = 10 with filtered data. The results for the best performing
configurations are summarised in Table 4.2.

Table 4.2: The performance of the two networks with the datasets yielding the
highest scores, T1,3 and T1,10 respectively. The results are obtained by doing a five-
fold cross-validation.

A MR F1,µ
ResNet mini 92.59 81.14 0.830
PointNet mini 90.05 79.00 0.771

38


4. Results

(a) (b)

Figure 4.2: Graphs of the change in exact match ratioMR, as defined in section 2.3,
over the aggregation time of the data points. The value is the average achieved value
from the five-fold cross-validation. The aggregated points corresponds to a ∆t of
0.05, 0.15, 0.25, 0.35, and 0.50 seconds.

(a) (b)

Figure 4.3: The change in F1,µ-score, as defined in section 2.3, over the aggregation
time of the data points. The value is the average achieved value from the five-fold
cross-validation. The aggregated points corresponds to a ∆t of 0.05, 0.15, 0.25, 0.35,
and 0.50 seconds.

39


4. Results

When doing a five-fold cross-validation a certain spread of the different metrics is
obtained. The spread of the accuracy is shown in Figures 4.4 and 4.5. The spreads
are defined by the maximum and minimum value of the accuracies obtained from
a certain segment length and are shown in Table 4.3 for both the ResNet mini and
PointNet mini, with and without filtering of data.

Table 4.3: The maximum and minimum spreads of accuracies when doing a five-
fold cross-validation. The spread is given in pp.

ResNet mini PointNet mini
max min max min

CFAR 6.38 2.06 7.50 1.49
No Filter 6.44 2.73 7.30 3.64

In Table 4.3 the maximum and minimum values for the ResNet mini with filtered
data is obtained using s = 7 and s = 1 respectively. Without filtering the data it is
obtained with s = 10 and s = 7. The corresponding results for the PointNet mini,
in the same order, are obtained with s = 7, 1, 1 and 10.

(a) (b)

Figure 4.4: All accuracies obtained when doing a five-fold cross-validation on the
ResNet mini for segment lengths 1, 3, 5, 7, and 10. The image to the left shows the
accuracies obtained when feeding the network CFAR-filtered data and the image to
right when feeding the network unfiltered data.

40


4. Results

(a) (b)

Figure 4.5: All accuracies obtained when doing a five-fold cross-validation on the
PointNet mini for segment lengths 1, 3, 5, 7, and 10. The image to the left shows the
accuracies obtained when feeding the network CFAR-filtered data and the image to
right when feeding the network unfiltered data.

4.1.1 Precision-recall curves and AUC-scores

Figure 4.6: Precision-recall curves for the two evaluated networks, with a dataset
using segment length s = 3 for ResNet mini and segment length s = 10 for PointNet
mini, for the two classes of VRUs, pedestrian and bicyclist.

41


4. Results

Figure 4.6 shows the precision-recall curves of ResNet mini and PointNet mini for
the segment length yielding the best performance. It is shown that ResNet mini
performs better than PointNet mini for almost all thresholds for both pedestrian
and bicyclist. PointNet mini is only slightly better for very low threshold values for
both classes. The AUC-scores are presented in Table 4.4, where it is clear that the
ResNet mini is performing better than the PointNet mini for the class pedestrian
and slightly better for the class bicyclist.

Table 4.4: AUC-scores for the ResNet mini and PointNet mini for the classes
pedestrian and bicyclist.

Pedestrian Bicyclist
ResNet mini 0.968 0.905
PointNet mini 0.903 0.856

4.1.2 Training convergence rate of T1,s

Figure 4.7a and 4.7b shows an illustration of the rate of which ResNet minis perfor-
mance converge. The accuracy values of each dataset is the average value for all five
fold iterations. Both figures show that ResNet mini converge after approximately
5 epochs. After 5 epochs there is no substantial incline in accuracy for any of the
datasets.

(a) ResNet mini with CFAR-filtered data (b) ResNet mini with not filtered data

Figure 4.7: The figures illustrates the training convergence for ResNet mini on the
CFAR-filtered and unfiltered datasets. The change in accuracy, A, is shown over
trained epochs for the datasets consisting of 1, 3, 5, 7 and 10 aggregated data points.

In Figure 4.8a and 4.8b the corresponding illustrations for convergence rates are
shown for PointNet mini. PointNet mini is shown to require more epochs of train-
ing than ResNet mini before stagnating in performance. At which epoch number

42


4. Results

PointNet mini converge vary depending on which dataset the network is trained on.
For example, PointNet minis performance on the CFAR dataset consisting of 10
aggregated data points still shows a slight incline in performance up to epoch 30,
while the unfiltered data set of 10 aggregated datapoints does not show a similar
development. For the unfiltered data there is no significant rise in performance after
epoch 10.

(a) PointNet mini with CFAR-filtered data (b) PointNet mini with not filtered data

Figure 4.8: The figures illustrates the training convergence for PointNet mini on
the CFAR-filtered and unfiltered datasets. The change in accuracy, A, is shown over
trained epochs for the datasets consisting 1, 3, 5, 7 and 10 aggregated data points.

4.1.3 PointNet sample size effect

Due to the restrictions set by hardware it was not possible to use the complete
point clouds when training PointNet mini. Instead a sample of the point clouds
was used. Figures 4.9a, 4.9b and 4.9c illustrates the effect that sample size has on
the performance of PointNet mini. The metrics seen in the figures are the average
values from a five fold cross validation on the CFAR dataset for five time aggregated
datapoints. As described in Table 3.3 the average point cloud of this configuration
consists of 885 points. All three figures shows that an increased sample size enhances
the performance of the network. However, when exceeding the average point cloud
size the performance seems to stagnate.

43


4. Results

(a) (b) (c)

Figure 4.9: The figures illustrates the impact that sample size have on the per-
formance of PointNet mini. The change in accuracy A, exact match ratio MR and
F1,µ is shown over number of sampled points.

4.2 T1,s as train set and T2,s as validation set

The obtained results when T2,s is used as validation set and T1,s as training set
can be seen in Table 4.5 and in Figures 4.10a, 4.10b, 4.10c and 4.10d. The two
networks are trained and validated on the time aggregation length resulting in best
performance at the five fold cross-validation of T1,s. Hence, PointNet mini is trained
and validated on time aggregation length 10 and ResNet mini on time aggregation
length 3. The results from the five fold cross-validation of T1,s are described and
visualised in Section 4.1.

Table 4.5: The performance of the two networks using the datasets yielding the
highest scores, T1,3 and T1,10 to train on, and validating on the corresponding
datasets T2,3 and T2,10.

A MR F1,µ
ResNet mini 65.04 37.64 0.264
PointNet mini 64.63 37.69 0.370

The confusion matrices in Figure 4.10 all show the same behaviour of executing
a majority of false predictions. When predicting true however, all four classifiers
is more likely to be wrong than correct. In total ResNet mini does 70% correct
predictions for the bicyclist-classifier and 61% for the pedestrian-classifier. PointNet

44


4. Results

mini does 60% correct classifications for bicyclist and 69% correct classifications for
pedestrian.

(a) ResNet mini
confusion matrix
heatmap for bi-
cyclist

(b) ResNet mini
confusion ma-
trix heatmap for
pedestrian

(c) PointNet
mini confusion
matrix heatmap
bicyclist

(d) PointNet
mini confusion
matrix heatmap
for pedestrian

Figure 4.10: Confusion matrices for the classes bicyclist and pedestrian. The two
matrices on the left are the results of ResNet mini and the two matrices on the right
are the results of PointNet mini. The values in the confusion matrices correspond
to the fraction of the total number of executed classifications.

The precision-recall curves in Figure 4.11 show that both PointNet mini and ResNet
mini have difficulties in detecting pedestrians for all threshold values. PointNet mini
is shown to be slightly better with a better precision for the pedestrian class. Both
networks performs a bit better on bicyclist classifications, with a tiny advantage for
ResNet mini. In Table