On Classification of Road Types for
Automotive Applications
Master’s thesis in Complex Adaptive Systems

JEANETTE WARNBORG

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2018


Master’s thesis EX005/2018

On Classification of Road Types for
Automotive Applications

JEANETTE WARNBORG

Department of Electrical Engineering
Division of Systems and Control

Chalmers University of Technology
Gothenburg, Sweden 2018


On Classification of Road Types for Automotive Applications
JEANETTE WARNBORG

© JEANETTE WARNBORG, 2018.

Supervisor: Kenny Karlsson, Aptiv
Examiner: Jonas Fredriksson, Department of Electrical Engineering

Master’s Thesis EX005/2018
Department of Electrical Engineering
Division of Systems and Control
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Image captured from one of Aptiv’s test vehicles.

Gothenburg, Sweden 2018

iv


On Classification of Road Types for Automotive Applications
JEANETTE WARNBORG
Department of Electrical Engineering
Chalmers University of Technology

Abstract
One of the challenges within autonomous driving is for the vehicle to determine what
kind of surrounding environment it is operating in. This information could assist
the vehicle in its decision making. In this project two methods, based on neural
networks and support vector machines, for determining road types using radar and
image data have been compared. The road types were divided in to three classes,
highway, major road and city. The best result was achieved by a neural network
using image data. Using radar data gave the worst results and that had a negative
effect on the classification using radar and image data in combination.

Keywords: Autonomous driving, Neural networks, Support vector machines

v


Acknowledgements
I would like to thank Kenny Karlsson for his support throughout all parts of the
project. Furthermore I would also like to thank the examiner Jonas Fredriksson for
his help and feedback.

Jeanette Warnborg, Gothenburg, 2018

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Specification of issue under investigation . . . . . . . . . . . . . . . . 3
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Supervised learning 5
2.1 Neural network (NN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Activation function . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Multilayer perceptron (MLP) . . . . . . . . . . . . . . . . . . 7
2.1.4 Convolutional neural network (CNN) . . . . . . . . . . . . . . 7

2.1.4.1 Pooling layer . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Implementations and extensions . . . . . . . . . . . . . . . . . 10

2.3 Histogram of Oriented Gradients (HOG) . . . . . . . . . . . . . . . . 10

3 Data 13
3.1 Data gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Radar data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Extracting features from vision data for SVM . . . . . . . . . . . . . 14

4 Algorithms 15
4.1 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Image classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Radar classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.3 Combined network . . . . . . . . . . . . . . . . . . . . . . . . 17

ix


Contents

4.2 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Results 19
5.1 NN – Image classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 NN – Radar classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 NN – Combined classifier . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 SVM – Image classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.5 SVM – Radar classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.6 SVM – Combined classifier . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Discussion 25
6.1 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1.1 Image classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1.2 Radar classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.1.3 Combined classifier . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 MSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Conclusion 29

x


List of Figures

1.1 Sample images from the different environments. . . . . . . . . . . . . 2
1.2 High level sketch of the classification pipeline. Radar data, images

and the two combined will serve as input. The classification method
is either a neural network or a support vector machine and the output
is the assigned class of the corresponding input. . . . . . . . . . . . . 3

2.1 A simple description of a neural network with two layers. The data is
fed through the layers, producing a probability vector of the different
classes from which the output class can be found by taking the argmax
of the output vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Example of structure for a simple MLP. . . . . . . . . . . . . . . . . . 8
2.3 Three cases of classification using support vector machines. Blue

triangles and orange dots represent samples from two classes. The
solid line is the decision boundary that separates the classes. The
distance between the solid line and the dotted lines is the margin.
The dotted lines are parallel to the solid line and intersect the support
vector(s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 To the left: A picture divided in patches reduced to the gradient
vector of each pixel. To the right: The histogram of oriented gradients
for a single patch. Image courtesy of Gil (2013). . . . . . . . . . . . . 12

4.1 A schematic view of the image classification network. . . . . . . . . . 16
4.2 A schematic view of the radar classification network. . . . . . . . . . 16
4.3 A schematic view of the combined classification network. . . . . . . . 17

5.1 Combined confusion matrix heatmaps of all the runs for the different
NN classifiers. In the confusion matrix the element at index i, j cor-
respond to the number of samples from class j classified as class i.
The confusion matrix for a perfect classifier would have 0 at the off
diagonal elements. The sum of row i corresponds to the total amount
of samples that was classified as class i and the sum of column j
corresponds to the total number of samples of class j. . . . . . . . . . 21

xi


List of Figures

5.2 Confusion matrix heatmaps for the different SVM classifiers. In the
confusion matrix the element at index i, j correspond to the number
of samples from class j classified as class i. The confusion matrix for
a perfect classifier would have 0 at the off diagonal elements. The
sum of row i corresponds to the total amount of samples that was
classified as class i and the sum of column j corresponds to the total
number of samples of class j. . . . . . . . . . . . . . . . . . . . . . . . 23

xii


List of Tables

3.1 Table of tracklet measurements. . . . . . . . . . . . . . . . . . . . . . 14

5.1 Highest accuracy reached for different methods and inputs. Since the
NNs were evaluated many times the mean and standard deviation is
presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 The accuracy for the runs using the CNN on the image data with the
mean and standard deviation. For both the test set and the entire
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.3 The accuracy for the runs using the CNN on the radar data with the
mean and standard deviation. For both the test set and the entire
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.4 The accuracy for the runs using the CNN on the combined data with
the mean and standard deviation. For both the test set and the entire
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

xiii


List of Tables

xiv


1
Introduction

In recent years the area of autonomous driving have gained a lot of attention in
both the public and scientific communities. For a car to be able to drive itself safely
and efficiently, it need to be able to make at least as good decisions as a human
driver. This problem is not solved trivially and one crucial part of this problem
is how the car should determine in what kind of environment it is driving. The
aim of this thesis is to further investigate how such a classification system could be
implemented.

1.1 Background

The idea of fully autonomous vehicles are getting more real rapidly. This means
that the vehicles has to make smart and safe decisions.

Aptiv’s Active Safety department is developing sensors used by vehicle manu-
facturers worldwide. The sensor data is used as input to numerous advanced active
safety functions such as Adaptive Cruise Control, Automatic Emergency Braking,
Queue Assist and are also being used in the development of autonomous driving
systems. An autonomous driving system needs to be able to identify and adapt
to its surroundings, e.g. drive slowly when pedestrians are nearby or when driving
on bad roads. To enable the car to make these decisions it need to have enough
information about its surrounding environment. A study comparing how well dif-
ferent classification methods perform in different environments and in determining
whether the car is driving on- or off-road showed that it is possible to classify the
surrounding with a high level of accuracy (Tang and Breckon, 2011). As input they
use images from which they extract regions of interest. For this thesis the input will
not only contain images but also radar tracking data.

1.2 Contributions

The main contributions of this project are the comparison of two different classifi-
cation methods (neural networks and support vector machines) and three different
data types (radar, images and radar and images combined). Also included is an in
depth discussion regarding future work and how to further improve the results.

1


1. Introduction

1.3 Aim

This master thesis will serve to investigate how various sensors, such as radars and
cameras, can be used to assess the situation around the vehicle. More specifically
we will classify sensor data from a car in to one of three different environments:
Highway have at least two driving lanes separated with a central barrier, with no
sharp turns, high speed and normally no pedestrians nearby. Major road is similar
to the highway but can have a single lane, has a slightly lower speed, more turns and
not necessarily a central barrier. City environments can look quite different with
a mixture of pedestrians, cyclists, cars and public transportation vehicles. Samples
from these three classes are presented in Figure 1.1. The classification will be done
in order to be able to determine e.g. maximum appropriate speed in real-time
depending on what type of road the vehicle is being driven. Further on this could
be used to limit the maximum speed for a vehicle in a specific environment. A high
level sketch of the classification pipeline is shown in Figure 1.2.

(a) Sample image from the highway
data set.

(b) Sample image from the major data
set.

(c) Sample image from the city data
set.

Figure 1.1: Sample images from the different environments.

2


1. Introduction

   Input    OutputClassification 
method

Figure 1.2: High level sketch of the classification pipeline. Radar data, images
and the two combined will serve as input. The classification method is either a

neural network or a support vector machine and the output is the assigned class of
the corresponding input.

1.4 Limitations
In this thesis we will only consider three previously mentioned environments. The
data used to train the algorithms will only represent these three situations. We also
assume favorable conditions, both to simplify the gathering of the data but also the
classification. This corresponds to data gathered during day time in nice weather
without rainfall.

1.5 Specification of issue under investigation
In this project the main focus will be on the following questions:

• Using tracker data from radars together with images from the front camera,
is it possible to determine if the car is driving on a highway, major road or in
an urban environment?

• With what accuracy can the road type be classified?
• Which classification methods gives the highest accuracy?

1.6 Outline
This thesis is organized as follows: In chapter 2 the theory behind neural networks
and support vector machines is presented together with a short section on image
preprocessing. The setup of the data collection and the data used for this project is
described in chapter 3. In chapter 4 the classification algorithms that were evaluated
are described in detail. The results that were found for the different methods and
data types is presented in chapter 5. Finally, in chapters 6 and 7 a discussion of the
results and methods is found together with the conclusion.

3


1. Introduction

4


2
Supervised learning

In this chapter we introduce the techniques and concepts used in this project. More
specifically we show how to classify different road types using both neural networks
and support vector machines. Firstly, in sections 2.1, 2.1.1 and 2.1.2 we give a basic
introduction to neural networks and in sections 2.1.3 and 2.1.4 we describe the two
different types of neural networks that will be used for this project. In section 2.1.5
we give a short introduction to TensorFlow, the framework used to implement the
neural networks. Secondly, in section 2.2 we give a basic introduction to support
vector machines and in sections 2.2.1 and 2.2.2 we describe the kernels, types and
extensions that will be used for this project. In section 2.3 we describe a method to
extract features from images to be used as input to the support sector machine.

2.1 Neural network (NN)
A neural network could be described as a computational graph where each node
performs some kind of computation. A node in such a graph is often referred to as
a layer and each layer consists of one or many neurons. In Figure 2.1 a visualization
of a simple network can be seen. Typically the graphs are directed and acyclic.
Neural networks can be used for both unsupervised and supervised learning and we
will from now on limit this introduction to explain supervised learning only. This
requires labelled data for the training, which often needs to be manually labeled.
The layers can have different layouts and be advantageous for different tasks. Also by
combing different types of layers it is possible to build networks that are optimized
for different purposes. This introduction will, however, focus on the architecture
suitable for classification of radar and vision data, both separately and combined.
The output from a node in one layer multiplied with the specific weight connecting
that node to a node in the next layer added to a bias will serve as input to the node
in the next layer:

Ii+1 = f(Iiwi + bi) ,

where Ii+1 is the output from layer i and the input to layer i + 1. wi is the weight
matrix for layer i, bi is the biases for layer i and f is the activation function.

In the case of classification the output from the final layer is often the class
affiliation probability distribution. This is compared to the actual labels and the
error is computed, often with the squared error

Etotal = 1
2
∑

(target− output)2 . (2.1)

5


2. Supervised learning

Input

First layer/node

Neurons

Second layer/node

Class probabilities

argmax

Output

Figure 2.1: A simple description of a neural network with two layers. The data is
fed through the layers, producing a probability vector of the different classes from
which the output class can be found by taking the argmax of the output vector.

Since we want the network to produce the same output as the target, (2.1) can be
viewed as the objective function to a minimization problem. To accomplish this
an algorithm that iteratively updates the weights and biases, also known as back-
propagation, described in section 2.1.2, is most often used. A more comprehensive
introduction to neural networks is found in Shanmuganathan and Samarasinghe
(2016).

2.1.1 Activation function
The activation functions of a neural network serve two purposes. In the output layer
the activation function is used to produce an output in the same form as the target.
For the other layers the activation function corresponds to a nonlinear transform
of the data that is passed between the layers. One of the most common activation
functions is the sigmoid function

h(x) = 1
1 + e−x

with the codomain equal to the interval (0, 1). Another common activation function
that often is used in the output layer of a classification network is the softmax
function. It will normalize a vector x with arbitrary (real) values to a vector σ with
values between 0 and 1 and unit length,

σj(x) = exj∑K
i=1 e

xi
, j = 1, . . . , K ,

whereK is the number of classes. Finally the rectified linear unit activation function
(ReLu)

R(x) = max(0, x)
has increased in popularity in recent years. Results have shown that the time needed
to train networks is reduced significantly when using this function in favor of e.g.
sigmoid or tanh function (Krizhevsky et al., 2012).

6


2. Supervised learning

input: target t and data x
init weights w and biases b repeat

prediction p← forwardpropagation(x)
for all layers in reverse order do

compute error δ(p, t)
compute ∆w(δ)
compute ∆b(δ)
w ← w + ∆w
b← b+ ∆b

end
until stopping criteria is reached;
Algorithm 1: Pseudo code for a backpropagation algorithm. How to compute ∆w
and ∆b depends on which optimization algorithm used.

2.1.2 Optimization
By minimizing (2.1) the performance of the network will improve. One way of
doing this is by using gradient based methods, such as stochastic gradient decent or
Adam (Kingma and Ba, 2014), applying them from the output of the network and
backwards. By doing this iteratively while passing new data through the network, it
will gradually adapt the weights and biases to the data. The algorithm is commonly
know as Backpropagation and is described in Algorithm 1.

One of the crucial steps in the algorithm is to compute the gradients between
the layers. If the absolute value of the input to the sigmoid activation function is
large, the gradient will go towards zero. When this happens the weights and biases
of those neurons of the network will not be updated which can eventually lead to
’dead’ neurons in the network. This is why it is important to initialize the weights
and biases carefully and also to normalize the input data so that the input to the
activation function does not grow too large.

It is common to pass the entire training set through the backpropagation al-
gorithm many times and each pass of the complete training set is called an epoch.
Often, the time it takes to train a network is defined by the number of epochs needed
to train the network.

2.1.3 Multilayer perceptron (MLP)
In a MLP all layers are fully connected, that is all neurons in the previous layer
act as input to all neurons in the next layer. This is how layers in regular neural
networks are connected. A MLP consists of at least three layers, an input layer, one
or many hidden layers and an output layer. The hidden layers are used for nonlinear
fit to the data. In Figure 2.2 we can see an example of the architecture of a MLP.

2.1.4 Convolutional neural network (CNN)
A convolutional neural network is used for tasks such as image classification. The
convolutional layer of a network is based on the assumption that a layer can form
a general impression of the input even though a single neuron is only allowed to

7


2. Supervised learning

Input #1

Input #2

Input #N

Prob. class 1

Prob. class 2

Prob. class 3

Hidden
layer

Input
layer

Output
layer

Figure 2.2: Example of structure for a simple MLP.

observe a small part of the input rather than the entire input. A convolutional layer
consists of multiple filters that realize the assumption above. Eventually these filters
will learn to detect different defining features of the input, such as edges in an image
(Ciresan et al., 2012).

2.1.4.1 Pooling layer

After a convolutional layer a pooling layer can be added to downsample the size of
the output. It serves two purposes, to reduce the computational complexity and to
control overfitting. This is done by applying a aggregation function on a moving
window over the data and combining the results. For images, it is common to use
pooling layers with a window size of 2×2 that will output the maximum value from
each window. In addition to the window size and aggregation function there is one
more parameter, the stride, which controls the number of steps the window is moved
in each direction.

2.1.5 Tensorflow
The framework to implement neural networks used for this project is called Ten-
sorFlow (Abadi et al., 2015). It is a framework developed by Google which is a
very powerful and configurable framework written in C++ and Python. It supports
distributed and GPU accelerated training for efficient training on large datasets.

2.2 Support Vector Machine (SVM)
A SVM is a binary classification algorithm. The algorithm uses supervised learning
to create a hyperplane that separates the data in the two classes with the largest
possible margin. The standard implementation used today was first proposed by
Cortes and Vapnik (1995).

Given a linearly separable dataset:

(x̄1, y1), (x̄2, y2), . . . , (x̄n, yn)

8


2. Supervised learning

where x̄i is a p-dimensional data point and yi ∈ {−1, 1} indicates its corresponding
class, the SVM algorithm finds a (p− 1)-dimensional hyperplane that separates the
two classes. This plane is characterized by the parameters b and w̄ and the classifier
can be defined as

ynew = sgn(w̄ · x̄new − b) .
To achieve a robust classifier, the margin is maximized by solving the optimization
problem

minimize
w̄

||w̄||

subject to yi(w̄ · x̄i − b) ≥ 1, i = 1, . . . , n .
(2.2)

The resulting hyperplane is completely determined by the x̄i nearest to it which are
also called support vectors. This formulation of the SVM is also know as a hard
margin SVM. An example of a hard margin SVM is illustrated in Figure 2.3a, in
which the data is linearly separable.

For data that is not linearly separable the optimization problem to minimize is
instead  1

n

n∑
i=1

max(0, 1− yi(w̄ · x̄i − b))︸ ︷︷ ︸
=0 if x̄i lies on the correct

side of the margin

+ λ||w̄||2

where λ is a regularization parameter that determines the trade off between the size
of the margin and ensuring that x̄i lies on the correct side of the margin (Wu and
Liu, 2007; Cortes and Vapnik, 1995). The parameter λ is often found by the use
of cross validation. This formulation of the SVM is know as a soft margin SVM.
An example of a soft margin SVM is illustrated in Figure 2.3b, in which the data
is linearly separable with some data points that is located on the wrong side of the
margin.

2.2.1 Kernels
For data that is not linearly separable and clustered in a way that makes separation
by a straight line meaningless the kernel-trick can be used. The idea is to have a
function that maps the feature vectors to a high dimensional space in which the data
is linearly separable. If this function is non-linear the resulting classifier will also
be non-linear in the original feature space (Theodoridis and Koutroumbas, 2008).
More specifically, the kernel function

K : Rm × Rm → R

is a similarity measure between two feature vectors and defined as the dot product
of the high dimensional mapping functions, ϕ:

K(x̄i, x̄j) = ϕ(x̄i) · ϕ(x̄j) .

A classifier can then be defined as

ynew = sgn
([

l∑
i=1

ciyiK(x̄i, x̄new)
]
− b

)

9


2. Supervised learning

where ci and b are coefficients found by optimization and l is the number of support
vectors. Note that only the support vectors are needed to define the classifier.

Some of the most common kernels are polynomial and radial basis functions
(RBF). The polynomial kernel is defined as

K(x̄i, x̄j) =
(
x̄>i x̄j + c

)d

where c and d are parameters of the kernel. The RBF kernel is defined as

K(x̄i, x̄j) = exp
(
−||x̄i − x̄j||2

2σ2

)

where σ is a parameter of the RBF kernel. An example of a SVM with a RBF kernel
can be seen in Figure 2.3c.

2.2.2 Implementations and extensions
One of the most common types of the SVM algorithm is the C-SVM (Schölkopf et al.,
2000). It has a parameter C that controls the regularization. Another common type
is the ν-SVM that very similar to C-SVM, with the only difference that C ∈ [0,∞)
and ν ∈ [0, 1]. Both of these are implemented in the LIBSVM library (Chang and
Lin, 2011).

A common extension is the Multi-class SVM (MSVM) that allows for classifica-
tion of more than two classes. Two of the approaches are one-versus-one classification
and one-versus-all classification (Duan and Keerthi, 2005; Hsu and Lin, 2002). In
one-versus-one classification a single classifier for each pair of classes is trained and
the data points are evaluated in every classifier. The labels for these data points
are then decided by a majority vote by the classifiers. In one-versus-all classification
one classifier for each class is trained. A classifier for class i is trained with class i
as positive label and all other classes with negative label.

2.3 Histogram of Oriented Gradients (HOG)
One way to accentuate important features in an image and at the same time reducing
the size of it is to calculated the HOG of the image (Dalal and Triggs, 2005). This
method is used on gray-scale images and done as follows. First, the images is divided
in patches. For each pixel in a patch the gradient vector x̄i = (x1,i, x2,i), is defined
as (

x1,i

x2,i

)
=
(
xleft,i − xright,i
xabove,i − xbelow,i

)
where xleft,i, xright,i, xabove,i and xbelow,i corresponds to the values of the pixels to
the left, right, above and below pixel i. Secondly, the magnitudes of the gradients
is aggregated in a histogram over the gradient directions. An illustration of this
procedure is given in Figure 2.4. Finally, the histograms from all patches are con-
catenated into a single feature vector that can be used as input to a classification
algorithm, such as a SVM.

10


2. Supervised learning

(a) Hard margin SVM with linearly
separable data. All data points are

classified correctly.

(b) Soft margin SVM that allows for
wrongfully classified data points in
order to maximize the margin.

(c) Linearly inseparable data classified
using RBF kernel.

Figure 2.3: Three cases of classification using support vector machines. Blue
triangles and orange dots represent samples from two classes. The solid line is the
decision boundary that separates the classes. The distance between the solid line
and the dotted lines is the margin. The dotted lines are parallel to the solid line

and intersect the support vector(s).

11


2. Supervised learning

Figure 2.4: To the left: A picture divided in patches reduced to the gradient
vector of each pixel. To the right: The histogram of oriented gradients for a single

patch. Image courtesy of Gil (2013).

12


3
Data

In this chapter a detailed description of the data used in this project is presented.
The method for gathering the data is introduced in section 3.1, this is followed by
more detailed descriptions of the method in sections 3.2 and 3.3. In sections 3.4 and
3.5 the radar and image data preprocessing is outlined.

3.1 Data gathering
The data for this project was gathered using a car equipped with a front camera
and a front radar. A route containing the three environments (highway, major road
and city) selected for classification was chosen. The data was gathered on a single
day and is evenly distributed between the environments. The data from these logs
where used to construct the training, validation and test data sets.

3.2 Setup
The data were collected using a car equipped with a RACam forward-looking cam-
era, capable of detecting vehicles, pedestrians, bicyclists, road edges, lane markings
and traffic signs and a RACam radar, which is an electronically scanned 76 GHz
automotive radar mounted behind the windshield.

A typical video log is about one minute long and has approximately 1448 frames,
this means that a lot of the frames look really similar. From these frames every 50th
frame is extracted and used for the training and validation with the corresponding
radar data. The frames from the video files were extracted as gray scale images of
size 640 × 480. The result was images in the same format as the sample images in
Figure 1.1.

The radar data used in this project is not raw unprocessed signals from the
radar. With software from Aptiv the detections from the radar are grouped to-
gether and tracked as possible objects. These objects are given statuses and real
valued measurements for different properties and is stored in a struct. Four of these
measurements are listed in Table 3.1.

3.3 Scope
Since all data was gathered at the same time during a single day, the outer conditions,
like weather and time of day, is relatively invariant across the data set. This is

13


3. Data

Table 3.1: Table of tracklet measurements.

Name Description
radar_cross_section A measure how detectable an object is by radar
vcs_long_posn Longitudinal position of object w.r.t. host
vcs_lat_posn Latitudinal position of object w.r.t. host
vcs_long_vel Longitudinal velocity of object w.r.t. host

important to make sure that the classifier is not biased towards e.g. certain rain,
sun or daylight conditions. From the generated data, some of the radar data and
the video logs were extracted.

3.4 Radar data preprocessing
There are 64 slots for objects in both long and mid range which means that a
maximum of 128 objects can be stored and used. Not all of these properties contains
relevant information for this project. First, four different properties are selected and
extracted from the initial struct. For each frame in the video log the mid and long
range data is found and stacked together, resulting in a feature vector with length
128. Each feature was normalized to zero mean and unit variance.

3.5 Extracting features from vision data for SVM
Since SVMs does not work well with large amounts of data it is important to extract
important features from the images for the training. This was done by computing
the HOG for each image. Additionally the images were randomly skewed to reduce
the negative effects of any unwanted rotation of the camera.

14


4
Algorithms

In this chapter the algorithms used for classifying the different road types are de-
scribed. First, a description of the shared configuration between the neural networks
is given in section 4.1. This is followed by more detailed descriptions of the neu-
ral networks specific configurations are presented in sections 4.1.1, 4.1.2 and 4.1.3.
Finally, the setup for the MSVM is presented in section 4.2.

4.1 Neural network
Three NNs were implemented. For classification of the radar data a MLP was
used. For the vision and combined data two different CNNs were implemented. For
the convolutional and fully connected layers the weights were initialized as random
numbers drawn from a truncated normal distribution with mean zero and standard
deviation 0.05. In the truncated normal distribution values are re-sampled if their
magnitude is more than 2 standard deviations from the mean. The biases were
initialized to a constant value close to zero, 0.05. In all networks the Adam optimizer
was used with learning rate λ = 10−4 and other parameters β1 = 0.9, β2 = 0.999,
and ε = 10−8 as defined by Kingma and Ba (2014).

The data for the different networks was divided in three sets, a training, a val-
idation and a test set. The size of the training set contained 3/5 of the complete
set and the test and validation sets contained 1/5 each. The validation set was
used during training to find the best network parameters and avoid overfitting. For
each epoch the validation set was evaluated and when a new minimum value for the
validation loss was found the current network was saved.

The test set was used after the training was done to evaluate the accuracy of the
network on never before seen data.

Because of the randomized initialization of the network parameters each network
was trained and evaluated using different seeds to the random number generators.
In total, for each data type four separate networks were trained and evaluated. This
was done to be able to give a more robust estimate of the accuracy.

4.1.1 Image classifier
For the image classification a CNN was implemented. The input consisted of images
as described in section 3.2, but resized to 128× 128.

The network structure consists of three convolutional layers. The first two layers
had 32 filters of size 3 × 3. The third layer had 64 filters of size 3 × 3. Each

15


4. Algorithms

conv 
+relu

softmaxpool

image 
input output

flattening fully 
connected

+relu

{  x3

x2

Figure 4.1: A schematic view of the image classification network.

convolutional layer was followed by a max pooling layer and a ReLu activation
function. The pooling layer had a window size of 2× 2 and a sliding window stride
of 2× 2. The convolutional layers were followed by a flattening layer that reshapes
the data to a one dimensional vector and two fully connected layers with 256 neurons
in each layer. Finally, to produce an output with the probability distribution over
the classes the softmax function was used. The network is visualized in Figure 4.1.

4.1.2 Radar classifier

To classify the radar data a simple MLP was used. Out of the available signals from
Table 3.1, radar_cross_section and vcs_long_posn were selected as input to the
network. We choose to use only two signals for computational simplicity. Empirical
tests showed that these two signals gave acceptable results. The network structure

hidden hidden

radar 
input

softmax

output

Figure 4.2: A schematic view of the radar classification network.

consisted of an input layer for the 256 features, two hidden layers with 200 and 256
nodes respectively and an output layer for the class probabilities. The activation
function used for the hidden layers was the sigmoid function. The activation function
used for the output layer was the softmax function. The network is visualized in
Figure 4.2.

16


4. Algorithms

conv 
+relu

softmax
pool

image 
input output

flattening
fully 

connected
+relu

{  x3
x2

radar 
input

Figure 4.3: A schematic view of the combined classification network.

4.1.3 Combined network
The first part of the combined network is the same as the first part of the image
classifier. Three convolutional layers with max pooling and ReLu activation func-
tions followed by a flattening layer. The input to these layers consisted only of
the images. The output from the flattening layer was stacked together with the
same radar data that were used in the radar classifier. This served as input to the
next, fully connected layer. This was followed by another fully connected layer with
three outputs that were fed through a softmax function to produce the probability
distribution. The network is visualized in Figure 4.3.

4.2 Support vector machine
Three one-versus-one MSVM classifiers were implemented, one for classifying image
data, one for radar data and one for the combined data. The classifiers were built
using the scikit-learn library (Pedregosa et al., 2011).

In order to get an unbiased result, the entire data set was split in to a test
and training set. The test set was used to evaluate the classifiers. The training
set was used to first find the best value of c and then used for training of the
classifiers. The selection of c was done using 10-fold cross validation as follows: For
each c ∈ {10−7, 10−6, . . . , 107} one classifier was trained and evaluated for each fold.
The c-value for which the classifiers had the highest average validation accuracy over
the 10 folds was chosen for the final classifier.

The radar data for the MSVM consisted of the same signals as for the MLP, as
described in section 4.1.2. The input to the image classifier consisted of their cor-
responding HOGs, as described in section 3.5. The input to the combined classifier
consisted of the stacked of the input to the radar and image classifiers. Additionally
the data was scaled before training and validation according to

xi ←
xi − xmin

xmax − xmin

where xi corresponds the value of the ith feature, xmin and xmax corresponds to the

17


4. Algorithms

minimum and maximum values for the ith feature across the entire data set.
A RBF kernel was used for the image and combined classifiers and a polynomial

for the radar classifier.

18


5
Results

In this chapter the results from the different classifiers and data sources are pre-
sented. The results for the NN and SVM classifiers are presented in the following
sections.

In Table 5.1 a summary of the test set accuracy of the classifiers is presented. In
total the best result is achieved using image data only with the CNN. We can also
note that the accuracy using the radar data is low compared to the image data.

Table 5.1: Highest accuracy reached for different methods and inputs. Since the
NNs were evaluated many times the mean and standard deviation is presented.

Radar data Image data Combined data
NN 49.0± 1.2% 96.9± 1.0% 94.8± 1.2%
SVM 59.2% 95.2% 92.1%

5.1 NN – Image classifier

In Table 5.2 the results from the runs using the CNN with image data is presented.
For the test data the average accuracy and standard deviation was 96.9±1.0 % and
for the entire data set 99.0± 0.4 %.

The confusion matrix for this classifier is presented in Figure 5.1a. It shows that
the classifier does not have any significant bias towards any of the classes. The most
common misclassification is major or highway classified as city.

Table 5.2: The accuracy for the runs using the CNN on the image data with the
mean and standard deviation. For both the test set and the entire data.

Test data All data
Run 1 98.1% 99.4%
Run 2 97.3% 99.2%
Run 3 95.8% 98.6%
Run 4 96.3% 98.6%
Mean 96.9% 99.0%
Std 1.0% 0.4%

19


5. Results

5.2 NN – Radar classifier
In Table 5.3 the results from the runs using the MLP with radar data is presented.
For the test data the average accuracy and standard deviation was 49.0±1.2 % and
for the entire data set 56.2± 2.4 %. Compared to the other two networks the time
per epoch was much shorter. The number of epochs needed to reach 100% training
accuracy and to show signs of overfitting was also much lower for this network.

The confusion matrix for this classifier is presented in Figure 5.1b. It shows that
the classifier is heavily biased towards the highway class and that it is unable to
classify samples from the major class with any significant accuracy.

Table 5.3: The accuracy for the runs using the CNN on the radar data with the
mean and standard deviation. For both the test set and the entire data.

Test data All data
Run 1 47.2% 52.9%
Run 2 49.7% 56.0%
Run 3 49.3% 57.4%
Run 4 49.7% 58.5%
Mean 49.0% 56.2%
Std 1.2% 2.4%

5.3 NN – Combined classifier
In Table 5.4 the results from the runs using the CNN with image data is presented.
For the test data the average accuracy and standard deviation was 94.8±1.2 % and
for the entire data set 98.2± 0.5 %.

The confusion matrix for this classifier is presented in Figure 5.1c. Similar to the
confusion matrix for the image data, this classifier does not have any significant bias
towards any of the classes. The most common misclassification is major or highway
classified as city.

Table 5.4: The accuracy for the runs using the CNN on the combined data with
the mean and standard deviation. For both the test set and the entire data.

Test data All data
Run 1 95.2% 98.5%
Run 2 96.1% 98.7%
Run 3 94.6% 98.1%
Run 4 93.3% 97.6%
Mean 94.8% 98.2%
Std 1.2% 0.5%

20


5. Results

(a) Confusion matrix heatmap for the
image data.

(b) Confusion matrix heatmap for the
radar data.

(c) Confusion matrix heatmap for the
combined data.

Figure 5.1: Combined confusion matrix heatmaps of all the runs for the different
NN classifiers. In the confusion matrix the element at index i, j correspond to the
number of samples from class j classified as class i. The confusion matrix for a
perfect classifier would have 0 at the off diagonal elements. The sum of row i

corresponds to the total amount of samples that was classified as class i and the
sum of column j corresponds to the total number of samples of class j.

5.4 SVM – Image classifier
For the test data the classification accuracy was 95.2% with c = 105 and RBF-kernel.
For the entire data set the accuracy was 99.1%.

21


5. Results

The confusion matrix for this classifier is presented in Figure 5.2a. It shows that
the classifier does not have any significant bias towards any of the classes.

5.5 SVM – Radar classifier
For the test data the classification accuracy was 59.2% with c = 106 and polynomial
kernel with d = 3 and for the entire data set the accuracy was 92.2%.

The confusion matrix for this classifier is presented in Figure 5.2b. It shows that
the classifier is heavily biased towards the major class.

5.6 SVM – Combined classifier
For the test data the classification accuracy was 92.1% with c = 100 and RBF kernel
and for the entire data set the accuracy was 96.4%.

The confusion matrix for this classifier is presented in Figure 5.2c. It shows that
the classifier does not have any significant bias towards any of the classes.

22


5. Results

(a) Confusion matrix heatmap for the
image data.

(b) Confusion matrix heatmap for the
radar data.

(c) Confusion matrix heatmap for the
combined data.

Figure 5.2: Confusion matrix heatmaps for the different SVM classifiers. In the
confusion matrix the element at index i, j correspond to the number of samples

from class j classified as class i. The confusion matrix for a perfect classifier would
have 0 at the off diagonal elements. The sum of row i corresponds to the total

amount of samples that was classified as class i and the sum of column j
corresponds to the total number of samples of class j.

23


5. Results

24


6
Discussion

In this chapter we analyze the results of chapter 5 and give suggestions to possible
ways of further improving the results. The main focus of this chapter is in the
analysis of the neural network classifiers in section 6.1. The results from the SVM
classifiers are analyzed in section 6.2 and suggestions for future work are given in
section 6.3.

6.1 Neural network

In Table 5.1 we see that the image classifier has the highest classification accuracy
and that the radar classifier has significantly lower classification accuracy than the
other two classifiers.

In Figure 5.1 we see the confusion matrices for the NN classifiers. For the major
and city classes the combined classifier is most likely negatively affected by the radar
data since the radar classifier had the most problems in classifying those classes.

In the following subsections we will discuss and analyze the performance of the
different NN classifiers.

6.1.1 Image classifier

The image classifier had the best classification accuracy of all the classifiers. The
average classification accuracy from Table 5.2 was 96.9 ± 1.0%. This result can be
compared to the commonly used CIFAR-10 benchmark dataset for image classifica-
tion (Krizhevsky and Hinton, 2009) where the state of the art classification accuracy
is 96.5% (Graham, 2014). The CIFAR-10 dataset consists of 60000 32 × 32 colour
images divided in to ten classes, with 6000 images per class.

Even though these results are similar one has to consider the differences between
the datasets and the task at hand. Since CIFAR-10 has ten classes, the classification
task is more complex and difficult compared to the three classes in the dataset used
for this thesis. Additionally, the difference between the images within the classes
is larger in CIFAR-10 than in the images used for this thesis which makes the
classification task more complex in the CIFAR-10 case.

On the other hand, the difference of the images between the classes should be
greater in CIFAR-10, which contains classes of e.g. frogs and airplanes, compared
to the images used in this thesis. The CIFAR-10 set also contains more samples,
which is advantageous when training any network.

25


6. Discussion

Given the comparison above, it should be possible to achieve a higher classifica-
tion accuracy using additional data and further improvements to the model.

6.1.2 Radar classifier
The radar classifier had the worst classification accuracy of all the classifiers. The
average classification accuracy from Table 5.3 was 49.0 ± 1.2%. During training of
the network it became clear that the network suffered badly from overfitting as the
validation loss reached its minimum value after only a few epochs and thereafter
increase.

Overfitting in a neural network can be tackled in a variety of ways. Adding more
training data is always useful, as it will increase the variance of the training data,
resulting in a more generalized model. It is also possible to implement some kind of
regularization such as dropout (Srivastava et al., 2014).

In comparison to the SVM for radar classification there should be room for
improvement of the MLP given the large difference in classification accuracy. This
could be done by refining the architecture and parameters of the network.

Another way to improve the classification accuracy could be to include additional
measurements from the radar data. Given the very low classification accuracy in
the current state, it is likely that a combination of all of the suggestions above needs
to be implemented to reach a satisfactory classification accuracy.

6.1.3 Combined classifier
The average classification accuracy for the combined classifier was 94.8 ± 1.2%, as
seen in Table 5.4. The initial idea behind the combined classifier was that the
features extracted from the image data combined with the radar data would be
easier to classify than to classify them separately. This does not seem to be the
case however, as the image classifier has a higher classification accuracy. Instead,
it seems as though the radar data impairs the performance of the network. This
should be expected given the very low classification accuracy of the radar classifier.

To improve the combined classifier the suggestions mentioned in sections 6.1.1
and 6.1.2 should both be implemented. Most important should be to increase the
size and variation of the data sets and also to investigate which measurements to
extract from the radar data.

6.2 MSVM
In Table 5.1 we see that the image classifier has the highest classification accu-
racy of the MSVM classifiers and that the radar classifier has significantly lower
classification accuracy than the other two classifiers.

In Figure 5.2 we see the confusion matrices for the SVM classifiers. The over-
all bad performance of the radar classifier probably affects the combined classifier
negatively.

Except for the radar data classifier the SVM based classifiers performed worse
than the NN classifiers. While there potentially is a lot of room for improvements

26


6. Discussion

in the NN classifiers, it is not obviously the same for the SVM classifiers. It is likely
that improving the image preprocessing and feature extraction and experiment with
other combinations of the radar measurements would give the most gain. It is
also possible to tune the parameters and choice of kernel to further improve the
performance of the classifiers.

6.3 Future work
There are several different tracks to follow in order to achieve better results. Per-
haps the most important track is to acquire more data. This data should be more
varied within the classes to improve the training and verify the robustness of the
classification methods. Since the radar data was the hardest to classify, the radar
classifiers has probably the most room for improvement. This includes the choice of
measurements which could for instance contain the number of identified pedestri-
ans. If adding more measurements were to be done, then additional preprocessing
of those measurements could be needed as well.

The CNN used to classify the image data in this project is comparatively sim-
ple against many of the state of the art architectures. Using techniques from these
architectures the results should be able to be improved. The most important im-
provement for the neural networks in general is to implement regularization, such
as dropout (Srivastava et al., 2014) to reduce the overfitting that occurs.

Finally, there is always possibilities of achieving better results by choosing the
parameters of the classifiers more carefully. This applies for the neural networks as
well as the support vector machines.

27


6. Discussion

28


7
Conclusion

In this project two different methods for classifying road types have been evaluated
against each other. The road types considered were highway, major road and city.
This was done using radar and camera data collected from the area surrounding
Gothenburg. The first method was based on convolutional neural networks and multi
layer perceptrons and the second method was based on support vector machines.
Each of the methods have been evaluated with radar and image data separately as
well as the combination of the two.

The best result was achieved with a convolutional neural networks using image
data only. It was closely followed by the support vector machine also only using
image data as input. Both methods performed poorly using only the radar data
which had a negative effect on the results when using the combined data.

To further improve the results more data should be gathered to ensure the ro-
bustness of the classifiers. It should be possible to improve the accuracy of the
classifiers by further review of the architecture and parameters.

29


7. Conclusion

30


Bibliography

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig
Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-
mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing
Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan
Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,
Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-
cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete War-
den, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Ten-
sorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL
https://www.tensorflow.org/. Software available from tensorflow.org.

Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector ma-
chines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):
1–27, 2011.

D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for
image classification. In Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, pages 3642–3649. IEEE, 2012.

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, Sep 1995. URL https://doi.org/10.1007/BF00994018.

Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.
In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer
Society Conference on, volume 1, pages 886–893. IEEE, 2005.

Kai-Bo Duan and S. S. Keerthi. Which Is the Best Multiclass SVM Method? An
Empirical Study, volume 3541, pages 278–285. Springer Berlin Heidelberg, 2005.

Levi Gil. A short introduction to descriptors, 2013. URL https://gilscvblog.
wordpress.com/2013/08/18/a-short-introduction-to-descriptors/. Ac-
cessed: 2018-01-23.

Benjamin Graham. Fractional max-pooling. CoRR, abs/1412.6071, 2014. URL
http://arxiv.org/abs/1412.6071.

Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support
vector machines. IEEE transactions on Neural Networks, 13(2):415–425, 2002.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.

31

https://www.tensorflow.org/
https://doi.org/10.1007/BF00994018
https://gilscvblog.wordpress.com/2013/08/18/a-short-introduction-to-descriptors/
https://gilscvblog.wordpress.com/2013/08/18/a-short-introduction-to-descriptors/
http://arxiv.org/abs/1412.6071
http://arxiv.org/abs/1412.6980


Bibliography

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny
images. 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-
peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett.
New support vector algorithms. Neural Computation, 12(5):1207–1245, 2000.

Subana Shanmuganathan and Sandhya Samarasinghe. Artificial Neural Network
Modelling, volume 628. Springer International Publishing, 1st 2016 edition, 2016.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: A simple way to prevent neural networks from overfitting. Journal
of Machine Learning Research, 15:1929–1958, 2014.

I. Tang and T. P. Breckon. Automatic road environment classification. IEEE Trans-
actions on Intelligent Transportation Systems, 12(2):476–484, 2011.

Sergios Theodoridis and Konstantinos Koutroumbas. Pattern Recognition, 4th Edi-
tion. Academic Press, 4 edition, 2008.

Yichao Wu and Yufeng Liu. Robust truncated hinge loss support vector machines.
Journal of the American Statistical Association, 102(479):974–983, 2007.

32


	List of Figures
	List of Tables
	Introduction
	Background
	Contributions
	Aim
	Limitations
	Specification of issue under investigation
	Outline

	Supervised learning
	Neural network (NN)
	Activation function
	Optimization
	Multilayer perceptron (MLP)
	Convolutional neural network (CNN)
	Pooling layer

	Tensorflow

	Support Vector Machine (SVM)
	Kernels
	Implementations and extensions

	Histogram of Oriented Gradients (HOG)

	Data
	Data gathering
	Setup
	Scope
	Radar data preprocessing
	Extracting features from vision data for SVM

	Algorithms
	Neural network
	Image classifier
	Radar classifier
	Combined network

	Support vector machine

	Results
	NN – Image classifier
	NN – Radar classifier
	NN – Combined classifier
	SVM – Image classifier
	SVM – Radar classifier
	SVM – Combined classifier

	Discussion
	Neural network
	Image classifier
	Radar classifier
	Combined classifier

	MSVM
	Future work

	Conclusion