Autonomous vehicle control via deep
reinforcement learning
Master’s thesis in Systems, Control and Mechatronics

Simon Kardell
Mattias Kuosku

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2017


Master’s thesis 2017:093

Autonomous vehicle control via deep
reinforcement learning

Simon Kardell
Mattias Kuosku

Department of Electrical Engineering
Division of Signal Processing and Biomedical Engineering

Chalmers University of Technology
Gothenburg, Sweden 2017


Autonomous vehicle control via deep reinforcement learning
Simon Kardell
Mattias Kuosku

© Simon Kardell, Mattias Kuosku 2017.

Supervisor: Jonas Karlsson, Berge
Examiner: Lennart Svensson, Electrical Engineering

Master’s Thesis EX093/2017
Department of Electrical Engineering
Division of Signal Processing and Biomedical Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover:
Image of the synthetic environment used in this master thesis.

iv


Autonomous driving using deep reinforcement learning

Simon Kardell
Mattias Kuosku

Department of Electrical Engineering
Chalmers University of Technology

Abstract
The automotive industry as well as academia are currently conducting a lot of
research related to autonomous driving. Autonomous driving is an interesting topic
that holds the potential to benefit society in many ways, such as reduce the number
of fatalities and reducing the environmental footprint of modern day traffic.
In this thesis, we investigate Machine Learning algorithms that can automatically
learn to control a vehicle based on its own experience of driving. More specifi-
cally we employ two Reinforcement Learning (RL) algorithms called Deterministic
Policy Gradient (DDPG), and Actor-Critic with Experience Replay (ACER). The
algorithms were trained and evaluated in a synthetic environment. The input to
both models are images captured by a front-facing camera and internal states of the
ego-vehicle, i.e., velocity, acceleration, and jerk. The results presented in this thesis
show that current RL-methods are capable of controlling the vehicle steering, using
only images to provide information regarding the position of the ego-vehicle. The
results also indicate that a driving policy obtained via RL is more robust towards
tricky driving scenarios than policies obtained via supervised learning techniques.
However, evaluation of data captured from the real world domain is still needed, to
verify the usability of models trained on synthetic data.

Keywords: ACER, DDPG, Reinforcement Learning, Deep Learning, Neural Net-
works, Machine Learning, Autonomous Driving, Unreal 4 engine

v


Acknowledgements
We would first of all thank Peter Karlsson and Berge, for making this thesis
possible. We would also like to thank the employees, for producing a cheerful at-
mosphere. A special thanks to our supervisor Jonas Karlsson and the Intelligent
Systems team for their support and guidance throughout this thesis and the Visual-
ization team for assistance regarding the Unreal 4 game engine. We would also like
to thank Camilla Kusibojovska for sharing her office-space. Lastly, we would also
like to thank our examiner at Chalmers, Lennart Svensson, for helpful insights
during our meetings.

Simon Kardell and Mattias Kuosku
Gothenburg, October 2017

vii


Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Scope and Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Feed Forward Neural Networks . . . . . . . . . . . . . . . . . 5
2.1.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Training and Optimization Algorithms . . . . . . . . . . . . . 8
2.1.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 9
2.1.6 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 11

2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Partially Observable Markov Decision Process . . . . . . . . . 14
2.2.3 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Discounted reward and objective function . . . . . . . . . . . 15
2.2.5 State Value Function . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.6 Action Value Function . . . . . . . . . . . . . . . . . . . . . . 15
2.2.7 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.8 On-policy vs off-policy . . . . . . . . . . . . . . . . . . . . . . 16
2.2.9 Replay buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.10 Exploration vs. Exploitation . . . . . . . . . . . . . . . . . . . 16
2.2.11 Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.12 Deterministic Policy Gradient . . . . . . . . . . . . . . . . . . 17
2.2.13 Deep Deterministic Policy Gradient . . . . . . . . . . . . . . . 18
2.2.14 Actor-Critic with Experience Replay . . . . . . . . . . . . . . 19

3 Methods 23
3.1 Training Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

ix


Contents

3.2.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Sequential Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Exploration Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Reinforcement Learning Models . . . . . . . . . . . . . . . . . . . . . 31

3.6.1 DDPG-Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.2 ACER-Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Results 35
4.1 Training Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 DDPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3 ACER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Driver Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Comfort Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Discussion 47
5.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Training of Reinforcement Learning Models . . . . . . . . . . . . . . 48
5.3 Driving Behaviour and Robustness . . . . . . . . . . . . . . . . . . . 49

6 Conclusion 51

7 Future Work 53

Bibliography 55

A Pseudocode Adam-optimizer I

B Pseudocode Deep Deterministic Policy Gradient III

C Pseudocode Sample Efficient Actor-Critic with Experience Replay V

x


1
Introduction

To drive a car on the road is a simple task for a human being. In artificial intelligence,
the goal is to solve complex tasks consisting of high dimensional data as input.
Unfortunately, it has proven hard to construct sophisticated agents that are capable
of driving a car with human-like performance.

1.1 Background

Autonomous driving has been a large research area for the last couple of years, and
it will most likely continue to be so for the foreseeable future. The development of
Advanced Driving Assistance Systems (ADAS) has been an ongoing process since
the digitization of the car. From the year 2000, the development has accelerated and
today there exist systems for cruise control, automated parking, blind spot indica-
tor, collision avoidance systems, driver monitoring system, intersection assistance,
traffic sign recognition and much more.

The information these applications utilize to conduct its decisions normally consists
of data provided by sensors such as radar, lidar, and camera. These types of sen-
sors often generate data in large quantities. A popular approach to solve problems
with high dimensional data as input is a field in Machine Learning (ML) called
Deep Learning (DL). Deep Learning methods are capable of extracting relevant fea-
tures from high-dimensional data, by passing the high-dimensional data through
a function-approximator that utilizes a multi-layer computational graph called a
Neural Network. During the last few years, the adaptation of neural network has
increased tremendously, mostly because of recent development in two fields. Firstly
there has been a release of large labeled datasets such as the Large Scale Visual
Recognition Challenge [1], that can be used for training and verification of classi-
fiers. Secondly, there has been large progress in the development of the computa-
tional capacity of graphical processing units (GPU), which enables faster training
of Neural Networks.

In recent years deep learning methods have been utilized as function approximators
to Reinforcement Learning (RL) models, yielding deep Reinforcement Learning.
Unlike general DL-methods where the model is trained on a labeled dataset, rein-
forcement learning models are trained by interacting with an environment. The goal
of the RL-models is to find an optimal behavior, by exploring the environment in
an iterative manner. Primarily these deep reinforcement learning models have been

1


1. Introduction

deployed in games, such as old Atari games [2] and the ancient game of Go [3]. There
have also been some breakthroughs for continuous control Reinforcement Learning
methods [4][5][6]. These algorithms have primarily been evaluated with good results
on the MujoCo physics engine [7], which acts as a benchmark for reinforcement
learning algorithms.

One of the drawbacks of deep learning is that these networks need large amounts of
data. Many methods for data augmentations exist, but even with these methods,
it can be hard to generate sufficiently large datasets. Due to this, the use of syn-
thetic data for training machine learning algorithms have been an active research
field [8][9][10]. For the case of autonomous driving, acquiring training data is difficult
since the dataset will very likely be biased towards containing samples from ideal
driving conditions. A remedy to this problem is to train the models on synthetic
data since non-ideal driving scenarios can safely be simulated within a synthetic
environment. Training on synthetic data is particularly useful for Reinforcement
Learning (RL) since these algorithms require the agent to interact with the envi-
ronment in an iterative learning-by-doing scheme. Since the RL-models learn by
exploring the environment and different actions, training using a real vehicle to con-
duct the exploration is currently highly improbable, since as accidents will likely
occur in this setting.

1.2 Related Work
The topic of this thesis is to investigate recent methods in the field of Artificial
Intelligence (AI), for autonomous driving. There are numerous implementations in
both simulation environments and the physical world. In 2005 the DAVE project
developed an end to end solution to control a RC-car to navigate in outdoor environ-
ments [11]. They trained a shallow network on data collected by human interaction
and achieved results where the agent was capable of driving roughly 20 meters
without bumping into obstacles. The DAVE project has expanded in recent years
and upgraded their hardware for their robots. The new hardware has been used
in research of learning long-range vision for autonomous driving [12]. A group of
developers at NVIDIA were inspired by the work in DAVE and developed DAVE-
2 [13]. The project used modern hardware and created an autonomous agent that is
capable of driving a full-sized car in urban environments. What is most impressive is
that the Neural Network in DAVE-2 only takes inputs from camera images captured
by the camera of the car and the current steering wheel angle.

Similar to DAVE-2 Brody Huval et al. [14], have applied Neural Networks to images
captured in front of a car. The project’s focus was to implement a real-time system
evaluating Neural Networks performance in highway environment detecting lanes
and cars. Similar to DAVE and DAVE-2 they used supervised learning where they
trained on data recorded from human drivers. The training was conducted during
a 14 day period where they registered data for a few hours per day. It is evident
that data acquisition is a tedious process where many hours of human resources
are needed. When developing in the framework of Machine Learning, the bene-

2


1. Introduction

fits from simulation environments are tremendous since the data can be generated
instantly, instead of recording data in real life. Recently the project (CAD)2RL
showed promising results, where they trained collision avoidance using a discrete
RL-algorithm called Q-learning to a quadrotor. Remarkably they only used syn-
thetic data for training and deployed the agent in the real world without any fine
tuning on the parameters of the network with successful results [15].

One of the most popular simulation platforms for racing is Torcs [16], which has been
used to deploy numerous autonomous agents. Examples of algorithms utilized in
the environment are Monte Carlo tree search [17], evolutionary algorithms [18] and
Q-learning [19]. The Monte Carlo project used a Tree Search algorithm and forward
motion model to explore randomly in the action space to maximize an objective
function designed for racing. To use their forward motion model, they transformed
the sensor inputs to Euclidean space. Similar to the Monte Carlo project Loiacono
(et al.) [19], handcrafted the feature-vectors from the state-space and controlled the
car with high-level navigation inputs. Nevertheless, they train a neural network to
outperform the currently best AI-methods for overtaking in Torcs. An evolutionary
algorithm [18] has expanded the state-space with images and utilized the Fourier
transform to refine the images and train a Neural Network to drive autonomously.
Apart from Torcs, the CARMA project was inspired by DeepMind’s [2] progress with
the Deep Q Network (DQN) algorithm in the ATARI environment. They scaled up
the project in the Vdrift [20] environment and discretized the action space to fit the
DQN algorithm. Using a handcrafted and simplistic reward function together with
both sensory inputs and images they could obtain results where they outperformed
the handcrafted controller in three criteria, namely, average reward, average speed,
and top speed.

1.3 Purpose
The purpose of this project is to investigate Reinforcement Learning (RL) methods
for autonomous vehicle control. More specifically, the thesis will investigate two Re-
inforcement models called Deterministic Policy Gradient (DDPG), and Actor-Critic
with Experience Replay (ACER) using only image data and internal states of the
vehicle as input. The two RL-models will also be compared with a baseline obtained
via a Deep Learning technique called supervised learning or imitation learning. In-
vestigations of the different model’s robustness towards non-ideal traffic scenarios
will also be conducted. Training, simulation, and evaluation will be performed in a
synthetic environment that replicates the real world.

3


1. Introduction

1.4 Scope and Boundaries
The models evaluated in this thesis will use inputs to the network that are similar to
those utilized in the DAVE-2 project [13]. The inputs to the models will consist of
images captured from a camera directed in the forward facing direction of the ego-
vehicle, and internal states of the vehicle, i.e., velocity, acceleration, and jerk. The
agents and algorithms will be trained and evaluated in the Unreal 4 game engine [21]
in a synthetic world of the AstaZero proving grounds rural road [22]. The simulation
environment will also be greatly simplified since it only contains the ego-vehicle that
operates in the environment. Since the environment only will contain the ego-vehicle,
complex decisions regarding interactions with other road-users will be outside the
scope of this thesis.
The control signals will also be limited compared to the ones utilized during regular
driving and what is required for a fully autonomous vehicle. The control signals used
within this thesis will relate to acceleration and steering, signals such as indicating
and gear-selection will be excluded from the action-space.

1.5 Thesis outline
The remainder of this report is divided into six chapters, Theory, Method, Results,
Discussion, Conclusion and Future Work. The Theory chapter is subsequently di-
vided into two subcategories, Neural Networks, and Reinforcement Learning. The
theoretical chapter presents the reader to the theoretical aspects of the models and
algorithms used in the project. Furthermore, the Method starts with a small intro-
duction to the simulation environment used and moves onto introducing the imple-
mentation of the Imitation Learning process and the RL-algorithms. In the Results,
plots from the training are presented together with histograms of the performance
of the final driving policies, as well as some test of the general robustness for the
obtained driving-policies. The results are then interpreted in the Discussion, and
finally, the thesis is summarized with Conclusion and Future Work.

4


2
Theory

This master thesis mainly concerns two areas of science, namely Neural Networks
and Reinforcement Learning. Neural Networks are the backbone of the model and
acts as function approximator to the Reinforcement Learning algorithm. Reinforce-
ment Learning is a family of algorithms that iteratively tries to improve on its
current behavior, given some objective that characterizes the performance.

The following sections explain the essential concepts and frameworks. Firstly by
introducing the basic concept of designing and training Neural Networks, followed
by an introduction to Reinforcement Learning and the algorithms used within this
thesis.

2.1 Artificial Neural Networks
Neural Networks for artificial intelligence are mathematical models inspired by nat-
ural structures in the human brain and applied in modern computers. Typical
applications are to use the models as complex function approximators. In this chap-
ter brief explanations and motivations for these building-blocks in Neural Networks
are summarized. The mathematical models, algorithms, and concepts presented in
this section are relatively short. For a more thorough explanation the reader is
encouraged to take part of the theoretical material in [23].

2.1.1 Feed Forward Neural Networks
A Neural Network is simply a computational graph, with the objective to approxi-
mate some function f ∗(x). The Neural Network models the function through f(x, θ)
by adjusting its parameters θ. The input x, flows through layers of artificial neurons,
where an artificial neuron i, in layer j is defined as

aij = φ

(
N∑
k=0

wijkx
i
jk + bij

)
, wij, b

i
j ∈ θ. (2.1)

In the equation above φ is a nonlinear transformation, N is the number of input
nodes, b is a bias and w and x correspond to the weights and inputs respectively.
The forward pass for one node is illustrated in Figure 2.1 and Equation (2.1) can
be written in condensed form as

[a1
j , a

2
j , ...a

n
j ] = aj = φ

(
wT
j xj + bj

)
, (2.2)

5


2. Theory

where n is the number of output nodes. In a feed forward neural network all the
hidden activations, aj, from one layer are passed through the network as input to
the next layer.

Figure 2.1: A forward pass in layer j to node i where x is the input vector and
w is weight vector. The input values are multiplied with the corresponding weights
and summarized with the bias term. The summations are then passed through the
non-linear function φ to produce the output aji .

2.1.2 Activation Functions
For a feed-forward neural network to represent nonlinear mappings, non-linearities
called activation functions has to be inserted at different layers in the model. There
are many types of activation functions available, and many new ones are still being
developed. The functions presented here only cover the ones necessary for the reader
to understand the remaining context of the thesis.

Sigmoid Function

The sigmoid function was mostly used in neural networks prior to the introduction
of more modern activation functions such as the Rectified Linear Unit (ReLU). The
upside of the sigmoid function is that the output is bounded between 0 and 1, making
it suitable for modeling probabilities, e.g., logistic regression. The downside of the
logistic function is the large saturated regions, making it unsuitable for gradient-
based learning [24].
The sigmoid function can be described by the following equation

σ(x) = 1
1 + e−x

. (2.3)

Tanh Function

The hyperbolic tangent or tanh function was like the sigmoidal function mostly used
prior to the introduction of the ReLU, but due to the large saturated regions, it has
become less popular as an activation for the hidden layers. The tanh function can
be described as

tanh(x) = ex − e−x

ex + e−x
. (2.4)

6


2. Theory

Rectified Linear Unit

The ReLU is currently most commonly used activation function, due to that it does
not suffer from those large saturated regions as seen in the sigmoidal- and tanh-
function, making it more suitable for gradient-based learning. The ReLU is defined
by the following equation

ReLU(x) = max {0, x} . (2.5)
As can be seen from the equation it does not suffer from the large saturated regions
since it is active in half its domain.

Exponential Linear Unit

The Exponential Linear Unit (ELU) is an extension of the ReLU, but unlike the
ReLU it does not saturate for all negative values. Since the ELU has negative
values it pushes the mean activation towards zero, effectively reducing the need for
normalization of the input [25]. The ELU activation function can be described via
the following equation

ELU(x) =

x, if x > 0
α (ex − 1) , if x ≤ 0

, (2.6)

where the hyperparameter α governs the negative saturation limit of the ELU acti-
vation function. The most common value used for α is 1 and is used in this thesis.

2.1.3 Backpropagation
Backpropagation is an algorithm for calculating the gradient given a loss J(θ). Us-
ing forward propagation an output is generated by the neural network. The output
is then used for obtaining the scalar loss J(θ) which then propagates through the
network backward retracting the gradient with respect to the parameters in the
model ∇J(θ).

The backbone of backpropagation is the chain rule of calculus. The chain rule forms
derivative by multiplying known derivatives from known functions. Given function
y = g(x) and z = f(y) = f(g(x)) the chain rule yields the following

dz

dx
= dz

dy

dy

dx
. (2.7)

Generalizing the case to multidimensional scalars, where f(x), g(y), x ∈ Rm, y ∈
Rn, g maps from Rn to Rm and f maps from Rm to R

dz

∂xi
=
∑
j

∂z

∂yj

∂yj
∂xi

. (2.8)

In vector notation the equation can be reformulated as

∇xz =
(
∂y
∂x

)T
∇yz (2.9)

7


2. Theory

where
(
∂y
∂x

)T
is the Jacobian of g(x). The chain rule is a single step of the backprop-

agation algorithm. If the Neural Network is constructed as a feed-forward network,
the algorithm can calculate the gradients for each layer in the graph based on the
gradients from the nodes earlier in the graph. Naturally, the algorithm needs to use
the calculated gradients several times since the gradients from one layer are applied
to each node in the layer above it. An implementation of the backpropagation al-
gorithm then has to decide whether it needs to recalculate the gradients or store
them in memory. For complicated graphs, there can be an exponential number of
recalculations of the gradients which would be infeasible. The problem is solved by
trading computational time for memory, using dynamic programming.

2.1.4 Training and Optimization Algorithms
To learn or train a model is known as optimization. In its most naive implemen-
tation, the parameters are updated in the direction of the gradient of one sample
with a small step-size repeatedly until convergence. The method is called gradient
descent and is a slow method due to high variance and bias in the gradients. A
faster algorithm is the Mini-batch Stochastic Gradient Descent (SGD) that utilizes
the average gradient over a batch of randomly selected samples. Even if the SGD is
faster than the standard gradient descent, it can be considered slow and inaccurate
in some cases. The algorithm tends to be slow, particularly when facing narrow cliffs
in the parameter space. The gradients on each side of the cliff are pointing nearly
perpendicular to the optimal and causing the state to traverse back and forth of the
cliff more than downwards, see Figure 2.2. The algorithm also struggles with state
spaces that have local optimums where it gets stuck.

Figure 2.2: Standard gradient ascent searching for optimal solution through a
narrow cliff. The figure illustrate the struggle of the algorithm facing narrow cliffs
where it traverse back and forth instead of downwards.

Introducing the concept of momentum into the algorithm can increase the conver-
gence speed facing narrow cliffs and helps the algorithm to overcome shallow local
optima. The idea of momentum has its heritage in classical mechanics of motion.
The introduction of momentum introduces the variable v that governs the movement

8


2. Theory

of the particle/state in the algorithm. The variable v accumulates exponentially de-
caying gradient of the past. Thereby the particle/state will move in the direction
of aligned gradients in a sequence. The update step of the algorithm is formally
written as

v← αv− ε∇θ

(
1
m

m∑
i=1

L
(
f(x(i)|θ),y(i)

))
,

θ ← θ + v
(2.10)

where the hyperparameter α determines how fast the contribution of past gradients
will vanish, L is the loss function and ε is the learning rate

However, one of the most used optimization algorithms for neural networks is the
Adam optimization algorithm. Adam is short for Adaptive momentum, which is one
of the key features of the optimization algorithm. To prevent that the algorithm gets
stuck in a local optimum the algorithm adapts its learning rate ε by introducing first
and second order momentum to the gradients. Therefore the momentum enables
the algorithm to push through local optima and find better ones. As discussed
previously, the moment also allows faster learning because it adapts the learning
rate with momentum. Addition to that the Adam algorithm is also applying a bias
correction term. The biggest advantage of the Adam optimization algorithm is that
it is insensitive to the choice of hyperparameters. The pseudocode for the algorithm
is presented in Appendix A.

2.1.5 Convolutional Neural Networks
The most common operation between layers in a Neural Network is the matrix mul-
tiplication. Networks that have at least one layer that applies the mathematical
operation of convolution is called a convolutional neural network. Convolution oper-
ates on a time series of data and has linear properties which makes it applicable for
neural networks analyzing data-structures such as images, sensor readings and much
more. There are several implementations of convolution in Neural Network because
of digitization and dimension of layers, but they all derive from the mathematical
expression of convolution

s(t) =
∫
x(a)w(t− a)da or s(t) = (x ∗ w)(t) (2.11)

where the input data is from x(t) and the kernel w(t) is used to map x(t) to the
feature-map s(t). Due to the fact that computers can not handle real values the
formulation is generalizing into the 1 dimensional discretized form bounded by the
data

s(i) =
∑
m

x(m)w(i−m). (2.12)

Assuming that the kernel and the input data is stored in computer memory, it is
applied such that the functions are 0 outside of the valid scope. There are more
sophisticated methods for assigning values out of bounds that cope with several

9


2. Theory

issues regarding convolution. A portion of these methods will be explained further
down. Convolution has the property of being commutative, meaning we can stride
backward through the data instead of the kernel, i.e.

s(i) =
∑
m

x(i−m)w(m). (2.13)

The formulation above is more straightforward to implement since there are fewer
variations of valid values in the data than in the kernel. To implement the mod-
ified convolution in Equation (2.13) it is more convenient to implement the cross-
correlation which is the same as the convolution but without flipped kernel

s(i) =
∑
m

x(i+m)w(m). (2.14)

The flipped kernel is only useful for the commutative property, and that is not
important for neural networks. Finally, the 1-dimensional case can be extended for
several dimensions by adding more variables and summations. Given an image, the
equation would be formulated as

s(i, j) =
∑
m

∑
n

x(i+m, j + n)w(m,n). (2.15)

A visualization of the convolutional process is illustrated in Figure 2.3. Finally, in

Figure 2.3: Example of convolution with valid padding, input size of 3x4, kernel
size 2x2 and a stride of 1.

Neural Networks step size of convolution is called striding, a thorough explanation
of the concept is presented for the reader in the succeeding chapter.

Striding

One of the strengths about convolutional layers is that it can reduce the size of
parameters. In Section 2.1.5 the formulation of the convolution strides through the
image, step by step. However, when designing a convolutional neural network the
length of the stride can be chosen freely. In Figure 2.3 the stride is 1 and in Figure 2.4
the stride is 2. Notice that the output images are smaller in the later example
because the stride is larger and the steps between each analysis are larger. Applying

10


2. Theory

strides of larger magnitude enables the model parameters to shrink in size which
lowers the computational and statistical burden of the next layer. Mathematically
the equation (2.15) can be expanded with the striding parameter

s(i, j) =
∑
m

∑
n

x(i · st +m, j · st + n)w(m,n) (2.16)

where st is the striding parameter. The operation is commonly called down-sampling.

Padding

As mentioned in Section 2.1.5 values outside of the data-scope are filled up with zeros
to cope with cases when the kernel is investigating elements outside of the data. How
to assign values to these elements is called padding since we are recreating data to
extend the size of our output. Examples of padding are zero padding, valid padding,
and mirror padding. As mentioned in Chapter 2.1.5 zero-padding extend the input
with zeros where the kernel is sliding over the edge of the border. Valid-padding is
even simpler and does not pad at all. If a layer is using valid padding the kernel
will only stride over valid elements in the data, for illustrations see Figure 2.3 and
Figure 2.4. Lastly, mirror padding is when the data in the image-border are mirrored
to the padding and creates a pattern that is a copy of the data closest to the border.

Pooling

In one sentence, pooling layer gathers information from small areas in input data
and condense that information into a more dense representation. As in convolution
the pooling filter is convoluted over the data and summarizes the data in summa-
rized statistics. Common statistics to use is max-pooling, L2-norm, average and
the weighted average based on distance from the center pixel. An example of max-
pooling is illustrated in Figure 2.4. By analyzing a small area, pooling makes the
neural network robust against small translations of features. Therefore features do
not need to be located at their exact position to be detected by the Neural Network.

Figure 2.4: Example of max-pooling with valid padding, input size of 4x4, pooling
size 2x2 and stride=2.

2.1.6 Recurrent Neural Networks
One shortcoming of regular feed forward Neural Networks is that they can only take
inputs of a fixed size. The fixed input size makes problems with non-Markovian

11


2. Theory

samples hard to model with a regular feed forward neural network since the necessary
information is spread out over a sequence of samples.
Recurrent Neural Networks (RNN) are a family of Neural Networks that deals with
this issue by inserting a feedback loop in the network, effectively allowing information
to persist over a sequence of samples. See Figure 2.5 for an illustration of an RNN
unrolled over a sequence of samples.

Figure 2.5: Example of unrolled recurrent neural network.1

Long Short Term Memory Cell

The Long Short Term Memory cell (LSTM), is a type of RNN that does not suffer
from some of the common problems of recurrent neural networks such as vanishing
gradients and exploding gradients [26]. The LSTM consists of two feedback connec-
tions, the hidden state, h, and the cell state, C, and four layers that are commonly
called gates in the context of RNN. The cell state can be viewed as the memory, and
the hidden state corresponds to the output of the LSTM at the last time instance.

The first layers that interacts with the input to the LSTM is the forget gate, ft.
The forget gate combines the information from the hidden state and the input, xt,
to decide how which parts of the current cell-state should be kept for the next time
instance, i.e.

ft = σ
(
wf [ht−1, xt]T + bf

)
. (2.17)

The next interaction with the input xt is via the input gate. This gate consists of
two streams, first sigmoidal layers, it, that decides how much each element should
contribute to the new cell state via the following equation

it = σ
(
wi [ht−1, xt]T + bi

)
. (2.18)

The next step of the input gate is to propose new candidate values for the cell state,
C̃t via a tanh layer, i.e.

C̃t = tanh
(
wc [ht−1, xt]T + bc

)
. (2.19)

The cell state is then updated using the forget gate and the input gate, where C̃t is
scaled by it element-wise. The cell state update equation can be described as

1The unrolled figure is strongly influenced by Christopher Olah illustrations, http://colah.
github.io/posts/2015-08-Understanding-LSTMs/

12

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/


2. Theory

Ct = ft � Ct−1 + it � C̃t (2.20)

where, � denotes the Hadamard product or element-wise multiplication.
The last part of the LSTM-cell is the output gate, which determines which parts of
our current cell state we want to output. The output of the LSTM-cell will be a
filtered version of our current cell state, filtered with a sigmoid layer ot, i.e.

ot = σ
(
wo [ht−1, xt]T + bo

)
. (2.21)

For the final output the cell state, Ct, is mapped through a tanh function and
multiplied with the ot element-wise, i.e.

ht = ot � tanh (Ct) .

An illustration of a LSTM-cell can be viewed in Figure 2.6

Ct−1 Ct

xt

ht−1 ht

ht

σ σ tanh σ

× +

×

×

tanhft it
C̃t ot

Figure 2.6: An illustration of a LSTM-cell. All mathematical operations in the
figure are performed element-wise.2

2.2 Reinforcement Learning
Premises for RL are that there exists an environment and a controllable Actor.
The Actor is capable of changing the state of the environment with actions, e.g.,
move left or right. In a broad sense, the environment is rewarding the Actor based
on its current behavior and in reinforcement learning the objective is to find the
behavior/policy that maximizes the cumulative reward. This section will introcude
the reader to the basic theory in RL [27] and recent algorithms that are relevant for
the thesis [5][28][6].

2The illustration of the LSTM cell is influenced by the bloggpost http://colah.github.io/
posts/2015-08-Understanding-LSTMs/ written by Christopher Olah.

13

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/


2. Theory

2.2.1 Markov Decision Process
The environment for which an RL agent operates can formally be described as a
Markov Decision Process (MDP), where the environment is fully observable. More
formally this is known as the Markov property which states that the future is inde-
pendent of the past given the present, i.e.

P (st+1|st) = P (st+1|s1, s2, ..., st). (2.22)
An MDP is a tuple 〈S,A,P ,R, γ〉, of a set of states, actions, transitional probabil-
ities, rewards and a discount factor to keep the expectations finite in the case of an
MDP without terminal states.

• S is a finite set of states
• A is a finite set of actions
• P is the state transition probability matrix,
Pass′ = P (St+1 = s′|St = s, At = a)

• γ is a discount factor ∈ [0, 1]
Figure 2.7 illustrates a simple MDP with four different states.

Figure 2.7: Illustration of a MDP. The terminal state is s4 and the reward is -1 for
all actions and the probability for a transition is labeled near the transition arrows.

2.2.2 Partially Observable Markov Decision Process
In a Partially Observable MDP (POMDP), the agents state does no longer equal
the state of the environment, i.e., the Markov property does not hold for the state
representation. Consider an agent that navigates through a maze, where the state
is defined as a single first-person view frame. Clearly, the information provided
from that single frame is not sufficient to make an optimal decision, hence the
violation of the Markov property. In a POMDP the agent must construct its own
state representation to make optimal decisions, e.g., append the entire history to
the state.

2.2.3 Policy
The policy, π is what characterizes the agents behavior in the MDP, more formally
the policy is a distribution of actions given states,

π(a|s) = P (At = a|St = s). (2.23)

14


2. Theory

2.2.4 Discounted reward and objective function
For each action an Actor makes in the environment, it will earn a reward rt. The
cumulative reward can be written as a sum. Adding the discount factor makes
the future reward less important and makes the sum finite in, e.g., the continuous
problem without terminal states. The discounted reward can be defined as

Gt =
∞∑
k=0

γkrt+k+1. (2.24)

The goal of RL is to maximize the discounted expected return. Following a policy
π the objective function can then be formalized as

J = Eri,si∼E,ai∼π [G1] (2.25)

where state si is sampled from the environment E and actions are sampled from
the policy π. In RL the objective function can also be formulated as the value or
action-value function. These functions are described in the two following sections.

2.2.5 State Value Function
The state value function is defined as the expected sum of future rewards following
policy, π, i.e.

V π(s) = Eπ [Gt|St = s] . (2.26)

The value function can also be expressed recursively with the Bellman expectation
equation as

V π(s) = Eπ [rt + γV π(st+1)|St = s] . (2.27)

2.2.6 Action Value Function
The action value function can be decomposed similarly as for the state value func-
tion, starting with the return

Qπ(s, a) = Eπ [Gt|St = s, At = a] . (2.28)

to the recursive form obtained with the Bellman expectation equation

Qπ(s, a) = Eπ [rt + γQπ(St+1, At+1)|St = s, At = a] . (2.29)

2.2.7 Policy Gradient
Policy gradient methods perform gradient ascent on the policy objective function J ,
with respect to the parameters θ of policy π. The policy gradient can be defined as

∇J (θ) = Eπθ
[
∇θlogπθ(s, a)Qπθ(s, a)

]
. (2.30)

15


2. Theory

Monte-Carlo policy gradient method uses Gt as an unbiased sample of Qπθ(s, a).
Even though this gradient is unbiased, it still has large variance, which yields noisy
gradient estimates.
A common way to reduce the variance is to use estimates of Qπθ(s, a) and V πθ(s)
to calculate the Advantage function, i.e.

Aπθ(s, a) = Qπθ(s, a)− V πθ(s). (2.31)
A lower variance representation of the policy gradient can then be stated as

∇J (θ) = Eπθ
[
∇θlogπθ(s, a)Aπθ(s, a)

]
. (2.32)

Though this representation has lower variance, it does come with the cost of an
additional bias due to imperfections in the estimation of Qπθ(s, a) and V πθ(s).
Intuitively the choice of Aπθ(s, a) can be justified due to that it increases the proba-
bility of actions that are better than average and decrease the probability of actions
that are worse than average [29].

2.2.8 On-policy vs off-policy
RL algorithms that train its agents purely on experience retrieved from its current
policy is called on-policy algorithms. On-policy algorithms are sample inefficient
in the sense that if you train your agent, you will also change the agent’s policy
and behavior. This means that past experience is no longer valid to be utilized for
training. Off-policy algorithms are therefore preferable since they are capable of
using the experience that is obtained with other policies.

2.2.9 Replay buffer
Simulation time is computationally heavy, a common concept to lower the simulation-
time in RL is to learn from experience what the agent has been exposed to previ-
ously. To remember how the environment reacts to the agent’s actions the tuple
〈st, at, rt, st+1, tt〉 is stored in a data structure that can be accessed for a low com-
putational cost. For some algorithms, it is important to have chained data that
is decoupled as time sequences. The replay buffer is therefore extended to store
sequences of tuples 〈st+n, at+n, rt+n, tt〉 where n is the position of the tuple in the
time sequence.

2.2.10 Exploration vs. Exploitation
One fundamental decision in RL is the exploration vs. exploitation dilemma. While
exploitation means that the agent should conduct the best decision given the current
information available, it does not mean that the particular action is anywhere near
optimal. For the agent to make optimal or near optimal decisions, it has to explore
the action space to gather more information, to optimize the current policy. A policy
that has a high exploitation rate will have problems to converge towards an optimal
or near-optimal behavior. For a policy that has an exploration strategy that explores
too much, it will be hard to see any improvements, due to that the agent’s behavior

16


2. Theory

in large part appear random. A sensible guideline is to have an exploration strategy
that has a large initial exploration rate that decays over time to see improvements
in the policy and to visit parts of the state space that can be hard to detect with to
much exploration noise. A common strategy to use for continuous actions spaces is
to add zero-mean Gaussian noise for exploration.

2.2.11 Actor-Critic
The Actor-Critic architecture uses two structures to optimize the expected return.
The Actor and Critic operate together and is trained separately for their purpose
in the algorithms. The Actor defines the current policy and therefore is intended
to generate actions according to the current policy. The Critic’s task is to estimate
the value function to the problem. Learning is commonly on-policy, and the Critic
must learn what the expected action value that is conducted from the current policy
defined by the Actor. The Critic can then criticize action taken by the policy as a
Temporal Difference-Error (TD-Error). TD-Error is the temporal difference between
the value function estimates of the two different states. The evaluation is described
mathematically as

δt = rt+1 + γV (st+1)− V (st). (2.33)
The TD-error is then used for optimizing the parameters in the Actor and Critic
model. If δt > 0 the outcome of the current action, at, is better than expected and
it would therefore be desirable to increase the probability of π(at|st). The general
setup of the algorithm is illustrated in Figure 2.8.

Figure 2.8: Basic visualization of the Actor-Critic algorithm and setup.

2.2.12 Deterministic Policy Gradient
There is a lot of variants of Deterministic Policy Gradient(DPG). In this thesis, the
off-policy deterministic Actor-Critic will be considered, where ρ is the discounted

17


2. Theory

state distribution and β represents a distinct policy to the current policy π. For
more details see the paper [30]. Furthermore, the performance objective can be
stated from the value function or the action value function

Jβ(µθ) =
∫
S
ρβ(s)V µ(s)ds

=
∫
S
ρβ(s)Qµ

θ (s))ds.
(2.34)

The gradients for the parameters of the Actors model can then be approximated
with

∇θJβ(µ) ≈
∫
S
ρβ(s)∇θµθ(a|s)Qµ(s, a)ds

= Es∼ρβ
[
∇θµθ(s)∇aQ

µ(s, a)|a=µθ(s)
]
.

(2.35)

The true action value function is replaced with a general function approximator
Qw ≈ Qµ and is trained to minimize the true action-value function. Furthermore,
the basic steps of the algorithm are

TD-Error: δt =rt + γQw(st+1, µθ(st+1))−Qw(st, at)
Update Critic weight: wt+1 =wt + αwδt∇wQ

w(st, at)
Update Actor weight: θt+1 =θt + αθ∇θµθ(st)∇aQ

w(st, at)|a=µθ(s).

(2.36)

2.2.13 Deep Deterministic Policy Gradient
Deep Deterministic Policy Gradient (DDPG)[5] apply the DPG-algorithm using
Neural Networks as general function approximators. Applying Neural Networks in
RL with continuous actions spaces lead to three major issues: correlated data, in-
stability, and insufficient exploration. This chapter will describe three methods that
solve these issues.

Training Neural Networks require that the training data is independently and iden-
tically distributed [23], which is not the case when generating samples sequentially
in the environment. The DDPG algorithm uses the replay buffer to store previous
experience. The replay buffer can be utilized when a sufficient amount of data is
collected. The goal is to withstand the curse of correlated data. The loss function
for the Actor-Critic can then be formulated as the squared loss from samples

L(θQ) = Est∼ρβ ,at∼β,rt∼E
[
(Q(st, at|θQ)− yt)2

]
(2.37)

where
yt = r(st, at) + γQ(st+1, µ(st+1)|θQ). (2.38)

The Neural Network for the Critic is prone to diverge since yt is calculated with the
same network as we are optimizing. The solution for this problem is to make copies
of the networks and then update them with soft updates. It has been experienced
that it is most efficient to make copies of both the Actor and Critic to enable stability.
The copies are denoted as

Q′(s, a|θQ′)
µ′(s|θµ′)

(2.39)

18


2. Theory

and the soft updates can mathematically be formulated as

θQ
′ ← τθQ + (1− τ)θQ′

θµ
′ ← τθµ + (1− τ)θµ′ (2.40)

where τ � 1. Exploration of continuous action spaces is hard due to the infinite
amount of permutations that exist. In off-policy algorithms, the exploration can
be constructed independently from the learning algorithm. The simplest way of
constructing an exploratory Actor is to add an exploration noise to the Actors
action.

uexp(st) = µ(st|θµt ) +N (2.41)

where N can be chosen to fit with the environment. Pseudo code for the algorithm
can be found in Appendix B.

2.2.14 Actor-Critic with Experience Replay
The Actor-Critic with experience replay (ACER) [6], builds on the ideas presented
in Asynchronous Advantage Actor-Critic (A3C) [31].

Briefly summarized the A3C utilizes shared parameters between the Actor and the
Critic in the lower layers of the Neural Network and uses on-policy updates with an
advantage estimate to update the network. A3C is however like all other on-policy
algorithms sample inefficient by design. To make the collection of samples more
efficient, A3C utilizes multithreading and assigns a number of different workers to
explore the environment, effectively leading to less correlated data in the updates.
Another benefit of having a set of workers is that many different exploration strate-
gies can be deployed simultaneously.

Unlike A3C, ACER utilizes off-policy samples via an experience replay buffer, mak-
ing it much more sample efficient than its on-policy counterpart.

One of the main challenges when working with off-policy data is to get an accurate
estimate of the action value function for the current policy, π. To compensate for
trajectories generated from behavior policies, ACER uses the Retrace algorithm to
estimate Qπ(xt, at) [32]. Given a trajectory generated under a behavior policy µ,
Retrace can be expressed recursively via

Qret(xt, at) = rt + γmin
{
c,
π(at|xt)
µ(at|xt)

} [
Qret(xt+1, at+1)−Q(xt+1, at+1)

]
+ γV (xt+1)

(2.42)
where Q(xt, at) and V (xt) are the current estimates of Qπ(xt, at) and V π(xt) re-
spectively. As can be seen from the Equation (2.42), the Retrace estimator utilizes
the return for near on-policy behavior and otherwise bootstrap to the current state
value estimate. The parameter c, ensures that the variance of the estimator does
not become too large by preventing importance sampling ratios larger than c.

19


2. Theory

To separate the weights for the policy and the value estimates, the network is di-
vided into two computational streams with the earlier layers shared between the
two streams. To estimate Qπ(xt, at), ACER utilizes the dueling network architec-
ture [33], but adapts it to a stochastic version for continuous action spaces, i.e.

Q̃(xt, at) = V (xt) + A(xt, at)−
1
N

N∑
i=1

A(xt, a′i), a′i ∼ π(·|xt) (2.43)

where the subtraction of the sample average is to compensate for any potential bias
in the advantage estimate. The estimate of Qπ(xt, at) are updated using the mean
squared error loss, i.e.

LQ(x, a) = 1
N

N∑
i=i

(Qret(xi, ai)− Q̃(xi, ai))2 (2.44)

and to update the estimate of V π(xt, at), the gradients of the following loss are used
with respect to the current estimate of V π

LV (x, a) = 1
N

N∑
i=i

min
{

1, π(at|xt)
µ(at|xt)

}
(Qret(xi, ai)− Q̃(xi, ai))2. (2.45)

To calculate the policy gradient from off-policy trajectories, ACER utilizes truncated
importance ratios with a correction term to compensate for the bias introduced by
the truncation. The policy gradient for ACER can be expressed as

ĝacert =min
{
c,
π(at|xt)
µ(at|xt)

}
∇φ(xt)log π (at|(xt)

(
Qret (xt, at)− V (xt)

)

+ 1
N

N∑
i=1

[
ρt(a′i)− c
ρt(a′i)

]
+

(
Q̃(xt, a′i)− V (xt))

)
∇φ(xt)log π(a′i|xt).

(2.46)

where ρt(a′) = π(a′|xt)
µ(a′|xt) and [x]+ = max{0, x}.

20


2. Theory

Since the variance of policy updates often is larger than desirable, the stability of
policy gradient can be compromised, leading to so-called policy collapses. To get a
more stable behavior the per-step changes of the current policy has to be limited,
to prevent too large steps on the policy causing a collapse of the current policies
performance. Trust Region Policy Optimization (TRPO) provides a solution to
this problem [4], however, despite the effectiveness of TRPO it requires multiple
computations of Fisher-vector products for each update, which does not scale well
to larger models [6]. To prevent too large steps on the current policy ACER uses a
different version of TRPO that scales better to larger domains. The TRPO scheme
used in ACER keeps a running average over past policies by softly updating the
parameters θa for the average model towards the policy parameters, i.e.

θa ← τθ + (1− τ)θa (2.47)

The ACER TRPO-scheme restricts too large steps from the average policy by in-
troducing a linearized Kulback Liebler divergence constraint, i.e.

minimize
z

1
2 ||g

acer
t − z||22

subject to k < δ

where δ is a hyperparameter that affects the allowed deviation from the average
policy, i.e., the trust region, k corresponds to

k = ∇φµ(xt)

(∫ ∞
−∞

πθa(a|xt)log
πθa(a|xt)
πθ(a|xt)

da

)
. (2.48)

By inserting the constraint, it becomes a quadratic programming problem for which
the solution is:

z∗ = gacert −max
{

0, k
Tgacert − δ
||k||22

}
k.

Since the resulting gradient is only w.r.t. the output of the policy stream, i.e., the
mean µ of the policy distribution, the gradients w.r.t. θ remains to be calculated.
The remaining gradients can be calculated by using the properties of the chain rule
as

∂µθ(xt)
∂θ

z∗.

Since ACER works off-policy, it also uses an experience replay buffer, see Sec-
tion 2.2.9. Stored in the replay buffer is a sequence of tuples, with the first state,
action, next state, reward, terminal flag and the moments of the distribution, i.e.
〈st, at, rt, st+1, tt, µt, σt〉.

21


2. Theory

22


3
Methods

In this chapter, the simulation environment, and the simulation setup is described,
followed by how the simulation environment was used to train an imitation learning
agent using supervised learning. After that, a detailed explanation of the reward
function employed in the environment. Lastly, the exploration strategy and the
implementation details for the different models are presented.

3.1 Training Environment
The Training Environment is developed in the game engine Unreal 4. The goal
is to control a basic implementation of the standard car model provided from the
starter content in Unreal. However, the standard car did not provide measurements
for acceleration, jerk, heading relative to road curvature, lane position or frontal
images. Those features were therefore implemented into the car model. The states
that will be used as input to the RL models are summarized in Table 3.1.

Table 3.1: Parameter fetched from the simulation environment.

short explanation used as input
px global x position
py global y position
pz global z position
vx velocity x direction x
vy velocity y direction x
vz velocity z direction x
ax acceleration x direction x
ay acceleration y direction x
az acceleration z direction x
jx jerk x direction x
jy jerk y direction x
jz jerk z direction x
α lane position
ϕ relative heading angle

23


3. Methods

The measurement of the lane position is the distance to the center of the lane from
the center of the car and images are captured approximately in the windscreen of
the car and has a resolution of 80x240. Example images captured by the windscreen
camera are presented in Figure 3.1.

Figure 3.1: Examples of images captured by the camera located in the frontal
windscreen.

The lane position α is measured by calculating the horizontal distance to the road
surface edge and then approximating the center line as half of that distance. Because
the measurement is towards the edge of the road instead of the edge of the lane,
the lane center is approximated as 60% of the distance from the center lane to the
edge of the road. α is then calculated as the distance from the center of the car to
the center of the approximated lane-center. The normalized distance α is depicted
in Figure 3.2. Unfortunately, since the distance between the road edge and lane
marking is not constant the α value will deviate from the desired value since the
estimated lane center is imperfect.

Figure 3.2: Image demonstrating the α value in the reward function. If the car
is centered in the correct lane the value is 0. The value α is normalized to indicate
the outer boundaries of the correct lane with 1 and -1. If the car is out of the road
to the left side the value will be less then -2.

Furthermore, the surrounding visual is a model over the AztaZero proving ground for
autonomous vehicle testing. In this thesis, only the rural track will be considered.
The track is considered smooth with no sharp corners or steep hills but inherits
access-points from smaller roads in the form of T-junctions and slip-roads. The
shape of the rural road is illustrated in the Figure 3.3 without the T-crosses and
slip-roads marked.

24


3. Methods

-1500-1000-500050010001500

-600

-400

-200

0

200

400

600

800

Figure 3.3: Illustration of the rural road in simulation environment of AztaZero
proving ground without T-junctions and slip-roads marked.

3.1.1 Simulation Setup
The interface towards the training environment was designed to handle two cases of
control, one case for control of the steering angles, the brake and acceleration and
one for just the steering angles. The reference velocity was kept at a fixed value of
50 km/h during all simulations. In the case where the level of control includes both
acceleration and braking, the throttle and braking were modeled with a single tanh
function, where values larger than zero are viewed as a signal for the throttle, and
conversely, values smaller than zero are viewed as a control signal for the braking.
For the case where the agents level of control only includes the steering angle, the
throttle and brake are modeled with a simple PID-controller.

The control flow of the simulation environment is constructed such that the envi-
ronment executes the action decided by the agent for a fixed number of frames and
then pause the simulation while it waits for a new action from the agent. After the
agent has performed one step in the environment, the environment returns a new
observation, a reward and a boolean variable that states if the current state of the
environment corresponds to a terminal state or not. The environment only sets the
boolean corresponding to the terminal state to true if the agent is located outside of
the road or if it is traveling in the opposite direction. The nominal number of frames
per second in the simulation environment is 60 frames, however, this might vary over
time if the rendering of the current scene requires a significant computational load.
For all the simulations, the number of frames that the environment executes the
action was set to 4, yielding a frequency of roughly 15 Hz for decisions.

25


3. Methods

During the training, the initial position for the agents is sampled uniformly from
a set of 66 locations spread out equidistant and facing in both directions of the
road. All training was performed at the end of each episode to make the different
policies for the episodes more comparable. Since all the training was conducted
between different episodes, the episodic length had to be capped to allow training
to be performed on policies that otherwise would not reach a terminal state. Hence
a maximum episodic length was introduced, and the maximum episodic length used
for all simulation was set to 1000 steps.

For the training, each episodic reward was logged and compared to the 10 best
episodic rewards. If the current episodic reward is within the best 10 results, an
inference run was performed where the weights of the network with the current best
inference reward are stored. For the inference runs, the episodic length was extended
to 10 000 steps and start location was kept fixed to yield more comparable results.
For clarity, the training inference scheme is presented in Algorithm 1.

Algorithm 1 Inference training scheme
Initialize weights
i = 0
while i < max episodes do

Run 1 episode
if episode reward > min(top ten episode rewards) then

Run 1 inference episode
if inference reward > best inference reward then

Store weights for the model
Train model
i++

3.2 Imitation Learning
In this section, the motivation, method and model design of the agent obtained
via imitation learning will be described in detail. Firstly a short motivation of the
agent obtained from imitation learning is summarized, followed by the collection of
training data. Lastly, a description of the different parts of the imitation learning
network is presented.

3.2.1 Motivation
The purpose of training an agent via imitation learning for this thesis is twofold.
Firstly the agent can be used to collect data of good driving behavior and store
it in the experience replay buffer. The benefit of doing this is that there will al-
ways be access to reliable data when optimizing the policy, which at least intuitively
should speed up the learning process. The second benefit is that many of the evalu-
ated algorithms include recurrent LSTM units, which will lead to large demands on
memory if the convolutional layers have to be unfolded in time during optimization.
Therefore the convolutional layers obtained from the imitation learning process will
be stored and used to prepossess the images for the other algorithms.

26


3. Methods

3.2.2 Data Acquisition
The data collection for the supervised training was collected by sending continuous
control signals to the simulation environment with a Logitech gaming pad operated
by a human. Logging of the data aimed to be versatile and uniform for the situations
where the agent is plausible to be in. To be able to store data where the agent is
recovering from bad states a pause button was implemented into the system. The
pause button decides whether the system should store data or not and therefore
could operate the car to be in dangerous states, start logging and then operate the
car to a satisfied state. A session of 8 hours was performed and collected around
100.000 samples.

3.2.3 Models
In this section, the model used for the imitation learning will be described. The
architecture will be presented in two sections, one for the convolutional layers and
one for the fully connected layers. The separation of these sections is because the
convolutional layers will be reused in preceding models.

The Convolutional Layers

In order to decode images captured by the camera mounted on the windscreen, a
set of five convolutional layers was chosen. The input to the first convolutional layer
is an image captured of the current view for the camera. The image is cropped to
80 pixels in height and 240 in width with all three color channels, see Figure 3.1 for
a set of input images. To reduce the spatial dimensions of the data, the first four
convolutional layers are followed by a max pooling operation, with a stride of two
in both directions and a pooling region of 2 × 2. The last layer is followed by a
max pooling operation with a pooling region of 2 × 1 and a stride of two in the
height dimension and one in the width dimension. This was done deliberately since
intuitively it seems more probable that this part of the feature maps contains more
information regarding the lane position. For more details regarding the design of
the convolutional layers, see Figure 3.4.

Figure 3.4: Schematic overview of the convolutional network. All activation func-
tions are ELU in the convolutional layers. The input image is a normalized and
represented as 3 layers each representing the corresponding RGB value for that
pixel. The output is a 2880 units feature vector.

27


3. Methods

The Fully Connected Layers

The depth of the fully connected layers for the imitation learning agent was kept
relatively shallow. This was done deliberately to minimize the risk of vanishing
gradients and therefore increase the quality of the training of the convolutional
layers. The hidden activation’s used ELUs as activation function, and the activation
of the last layer was just a simple identity function. The architecture of the fully
connected layers is visualized in Figure 3.5.

Figure 3.5: A schematic overview for the fully connected network for the supervised
agent. The input is the feature vector from the convolutional stack and the output
is the steering angle.

3.2.4 Training
Before training the agent obtained via Imitation Learning, the dataset was split
into two different subsets, one for training and one for validation. The size of the
validation-set was chosen to be 4000 samples, and the remaining 996000 was used
for training. There was no regularization used during the training, except early-
stopping. Briefly, early stopping can be described as a technique to avoid overfitting
on the training data by storing the network that has the best result on the validation-
set, and stop the training when there is no improvement on the validation-set over
a fixed amount of steps.
The loss-function used during training was the mean squared error (MSE), i.e.

LMSE = 1
N

N∑
i=1

(yi − f(xi, θ))2 (3.1)

where f(xi, θ) is the output of the Neural Network given the image xi parametarized
by θ and yi is the label corresponding to image xi.
The optimization was conducted with the Adam-optimizer, with the learning rate
of 10−4 and a batch size of 256.

3.3 Reward Function
The reward function implicitly determines the optimization objective, and it is a
vital part since it is via the reward function the optimal behavior is described.
Designing a reward function that describes optimal driving behavior is notoriously

28


3. Methods

hard [34], since driving a car consists of many different types of scenarios that can
be hard to derive to an explicit equation. However since the scope of this thesis is
limited to have the ego-vehicle as the only road user and the environment consist of
a road that goes in a loop, many of the challenges of designing a reward function
for driving can be ignored.

To guide the agent towards the optimal behavior, the designed reward function was
chosen to be dense. The designed reward function consists of two parts, one that
assigns a reward based on the progress the agent is currently making and one that
is more focused on driver comfort metrics and therefore assigns a cost for large
accelerations and jerks. The reward for the agent’s progress was designed to be

Rprog =


cos(ϕ) v

vref
−
∣∣∣α v

vref

∣∣∣ , if v ≤ vref

cos(ϕ)
(

2v
vref
−
(

v
vref

)2
)
−
∣∣∣∣α( 2v

vref
−
(

v
vref

)2
)∣∣∣∣ , if v > vref

(3.2)

where vref is the reference speed, φ is the heading angle relative to the road, and α is
a normalized distance to the center of the right lane, see Figure 3.2. As can be seen
in the equation above, the reward promotes driving with small heading angles as
well as proper lane position. The reward is designed such that given both a negative
and positive offset with equal amplitude and the same steering policy, the negative
offset will yield a larger reward.

Studies show that humans are tolerant to acceleration and jerks up to a certain level
[35], at which the acceleration and jerks become much more intolerable. For this
reason, a cost function suggested in [36] that considers this, was used together with
the corresponding hyperparameters. The cost function used can be defined as

C(x) =


(
x
g

)2
, if x ≤ g(

5
6 + 1

6

(
x
g

)2
)6
, if x > g.

(3.3)

The cost function was used both for the acceleration and the jerk. The threshold pa-
rameter g, for which after the cost rapidly increases was set to 2 for the acceleration
and 1.5 for the jerk, based on the information provided in [35].
Apart from that, the cost function is based purely on acceleration and jerks, a cost
was also assigned to the angular velocity of the steering angle, i.e. ϕ̇. Combining the
reward for the progress with the different cost functions yields the following reward
function

R(at, xt) = Rprog − (βϕ̇ϕ̇+ βaC(a) + βjC(j)) (3.4)

where βi, i ∈ [ϕ̇, a, j] is a hyperparameter that decides much a influence the different
costs will have on the total reward. The hyperparameters used for the different costs
chosen to βa = 0.67, βj = 1.33 and βϕ̇ = 0.25.

29


3. Methods

3.4 Sequential Updates

Because the ACER-model utilizes sequential updates, a sequence of n observations
x0, x1, ..., xn were sampled from experience replay for each update. Even though
the LSTM-cell exploits the sequential structure of the data, naively calculating the
gradients for the entire sequence, might cause unwanted imperfections in the gradient
updates, due to that the first gradient estimates will be based on a non-existing
history. To mitigate this problem a number of skip steps, h, were used that allows
the internal states of the LSTM to build up useful information of the history. As a
result, the gradients for the sequence x0, x1, ..., xh, are not backpropagated through
the network to avoid any imperfections that a non-existing history of the LSTM-cell
might inflict.

3.5 Exploration Strategy

To explore the environment, all the evaluated models used a Gaussian distribution as
exploration noise and for the case where multiple actions were used the covariance
was modeled with a diagonal covariance matrix. The exploration strategy used
consists of decaying the covariance over time until a threshold value is reached,
where the covariance is kept constant to still allow some exploration. The decay
rate of the exploration noise is based on the number of steps the agent did take
in the environment for the current episode. By relating the amount of exploration
based on how much progress the agent is making in the environment rather than
just the number of episodes conducted. The agent will be more likely to carry
out lots of exploration while it is still learning the basics of the assigned task. The
larger decay rate yields smaller exploration noise when the general concept of task is
understood by the agent, but a smaller amplitude exploration noise is still necessary
to make corrections to better optimize the policy. The decay rate is controlled by
the parameter η. The exploration strategy is described in pseudocode format in
Algorithm 2.

Algorithm 2 Exploration Strategy
# Initialize σexp, σend, ∆σ

while i < max episodes do
Reset environment
j=0
while j < max steps ∧ ¬terminal do

take 1 step in environment
j++

# decay exploration
σexp = max {σexp − jη, σend}
i++

30


3. Methods

3.6 Reinforcement Learning Models
In this section, the necessary implementation details of the algorithms will be de-
scribed. Firstly the DDPG algorithms model and parameters will be described
followed by the complicated model of ACER along with its parameters. For all
test and models, the Adam optimizer was used with learning rate 10−4, first order
momentum 0.9 and second-order momentum 0.999.

3.6.1 DDPG-Agent
As explained in Section 3.1.1 the simulations are carried out in two cases, with one
and two actions respectively. Furthermore, the Actor and Critic are implemented
with three fully connected layers. The proposed action from the Actor is merged with
the output from the first layer in the Critic. The number of units used in each layer
and visual representation of the Actor and Critic network is illustrated in Figure 3.6.

Figure 3.6: Schematic overview of the neural networks for the Actor and Critic in
DDPG algorithm. The Actor acts upon the feature-vector from the pre-trained con-
volutional stack of the Imitation Learning agent, previous actions and the internal
states obtained from the simulation environment. The output is modeled with one
or two units depending on whether the model controls both steering and acceleration
or just steering angles. The Critic uses both the feature-vector and the action from
the Actor to estimate a Q-value. The action is concatenated with the output from
the first layer to avoid that the actions drown in the input feature-vector.

31


3. Methods

To avoid catastrophic forgetting and be able to learn from previous policies experi-
ence a replay buffer is implemented. The buffer is designed to hold 50.000 transitions
tuples. Before training, tuples were sampled from the buffer to form a batch of 256
transitions.
The exploration strategy follows the scheme presented in Section 3.5. The explo-
ration noise for the steering angles starts with a standard deviation of 0.25 and decay
for each episode until it stops at 0.025. Similarly, the exploration for the accelera-
tion is performed, the exploration starts at 0.2 and stops at 0.01. The soft update
parameter for the target network is set to 0.001. All parameters can be found in
condensed form in Table 3.2.

Table 3.2: Hyperparameters used for the DDPG-algorithm for both one and two
actions.

1-dim 2-dim
τ 0.001 0.001
σinitsteer 0.25 0.25
σendsteer 0.025 0.025
σinitacc NA 0.2
σendacc NA 0.01
ηsteer 3.75× 10−7 1.27× 10−6

ηacc NA 2.53× 10−7

Nbatch 256 256

3.6.2 ACER-Agent
To make the model capable of handling POMDPs, the original ACER architecture
was extended to include a LSTM-cell. By adding the LSTM-cell, the model is ca-
pable of exploiting the sequential structure of the data, with the drawback that it
will be harder to train.

The first layer after all the convolutions are performed is the LSTM-cell. The LSTM-
cell receives an observation filtered through the convolutional layers trained using
the supervised learning techniques. Worth noting is that the parameters for the
convolutional layers are not a part of the parameters, θ, of the model. After the
LSTM-cell the network is divided into two computational streams, one for the Critic
and one for the Actor, hence the parameters in the LSTM-cell are shared between the
Actor and the Critic. The computational stream for the Actor follows a conventional
feed-forward neural network structure. However the computational stream for the
Critic is divided into two different branches, one for the state value function V π(xi)
and one for the advantage function Aπ(xi, ai). The advantage function is calculated
via the stochastic dueling network technique as

Ã(xt, at) = wT
2 ELU

(
wT

1 [hin, at]T
)
− 1
N

N∑
i=1

wT
2 ELU

(
wT

1 [hin, ãi]T
)
, ãi ∼ π (·|xt)

(3.5)

32


3. Methods

Figure 3.7: A schematic overview of the ACER agents network. The input vector
from the image, previous action and measurements from the simulation is feed to
an LSTM-cell with 512 units. The stream of data is then split into two branches
of fully connected layers. The top stream proposes an action and lower stream
estimates the action-value function through the advantage module. Depending on
whether the acceleration is controlled through the model the action output units
are either 1 or 2. Where the later option both steering angles and acceleration are
controlled simultaneously. Equation (3.5) describes the mathematical operations in
the advantage module.

where hin are the hidden activations from the previous layers,
w1 is a [128× (dim(at) + dim(hin))] weight matrix and w2 is a [1× 128] weight
matrix.
The current estimate of Qπ(xt, at) is simply calculated by adding the two computa-
tional streams for Ã(xt, at) and V (xt), i.e

Q̃(xt, at) = Ã(xt, at) + V (xt). (3.6)

The policy, π, is modeled as a Gaussian distribution with a diagonal covariance ma-
trix. The covariance matrix was not modeled by the network. Instead, a constant
covariance matrix was used with the diagonal elements at a value of 0.32, as Z. Wang
et.al used in the paper [6]. For the agent to explore the environment and try out
different actions, the covariance matrix used for the sampling of actions followed the
exploration scheme presented in Section 3.5 and thereby deviated from the distri-
bution used for the gradients.

During the simulation episodes from the imitation, learning agent was inserted every
50th episode to provide high-quality data for the policy updates. However, the data
obtained from the imitation learning agent was strictly used for the policy updates
and not for the value-function updates For the case where only the steering-wheel
angle was considered, the hyperparameters in Table 3.3 was used.
There were no changes made to the setup for the simulation for the control of both
steering and acceleration except for the rather apparent extension of the action-
space. To speed up the training of the model, the bias of the output node was
initialized to the mean value for the acceleration obtained from the PID-controller.

33


3. Methods

The trust-region constraint δ, was chosen to be 0.01, to prioritize small steps on
the policy instead of large noisy steps to minimize the risk of policy collapses. The
value, τ , that controls the rate of the average policy updates was chosen analogously
with the one presented in the ACER paper [6], to 0.005. The sequence length used
for each update was chosen to be 80 transitions with a fairly large number of 20
skip steps to allow the internal states of the LSTM-cell to settle. The batch size
was chosen to be 30, yielding a number of 1800 transitions for each update, which
has to be considered as a very large batch, but empirically this showed to stabilize
the training. The experience replay stored last 100 episodes in memory and each
sequence was sampled uniformly from that set for each update. The initial standard
deviation for the exploration noise regarding the steering angle was set to 0.25 and
0.2 and decayed to values of 0.025 and 0.01 respectively. The rate of the decay of the
exploration noise is presented in Table 3.3, along with all the other hyperparameters
for the simulation.

Table 3.3: Hyperparameters used for the ACER-algorithm for both one and two
actions. In the table Nbatch denotes the number of sequences used for each batch,
Nadv denotes the number of samples used for Equation (3.5), l denotes the length
of the sequence and h denotes the skip count.

1-dim 2-dim
τ 0.005 0.005
δ 0.01 0.01
c 5 5
l 80 80
h 20 20
σinitsteer 0.25 0.25
σendsteer 0.025 0.025
σinitacc NA 0.2
σendacc NA 0.01
ηsteer 3.75× 10−7 1.27× 10−6

ηacc NA 2.53× 10−7

Nbatch 30 30
Nadv 100 100

34


4
Results

4.1 Training Results

4.1.1 Imitation Learning
The pre-training of the convolutional layers and consequently the training of the Im-
itation Learning agent was conducted analogously with the methodology described
in Section 3.2.4. The training and validation losses are illustrated in Figure 4.1. As
can be seen from the figure, the trends for both the training- and validation loss,
are continuously decaying throughout the training procedure. Due to this artifact
and time constraints, the early stopping never occurred during the training process.
Due to the time constraints, the parameters of the model with the best results on
the validation set after roughly 1.26 × 106 training steps were used. Even though
the smaller validation losses were obtained after that point, the parameters of the
convolutional layers, remained fixed to make the different results comparable.

0.5 1 1.5 2 2.5

10
6

10
-7

10
-6

10
-5

(a) Mean squared error loss on the
training-set during the training.

0.5 1 1.5 2 2.5

10
6

1

1.5

2

2.5

3

3.5

4

10
-5

(b) Mean squared error loss on the
validation-set during the training.

Figure 4.1: Mean squared error loss on both the training- and validation-set during
the training.

35


4. Results

4.1.2 DDPG
The training of the DDPG-agent was conducted analogously with the methodology
presented in Section 3.6.1. Since the DDPG-algorithm works off-policy, the Imi-
tation Learning agent was used to provide high-quality data for the replay buffer,
every 50th episode. Figure 4.2 summarizes the training results obtained for the
DDPG-agent. As can be seen from the Figure 4.2, both the average reward and the
accumulative reward remains relatively unchanged until approximately 400 episodes
after which the trend is increasing until roughly 1000 episodes. After approximately
1250 episodes the exploration noise has reached its minimum value, as can be seen
from the figure, this low level of explorations also yields a slightly decaying trend
on the model’s performance.

200 400 600 800 1000 1200 1400 1600 1800

-400

-200

0

200

400

600

800

1000

(a) Accumulative reward, R, per
episode.

200 400 600 800 1000 1200 1400 1600 1800

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

(b) Average reward per step for each
episode.

200 400 600 800 1000 1200 1400 1600 1800

-500

0

500

1000

1500

2000

2500

3000

(c) Accumulative reward, R, for each
inference run.

200 400 600 800 1000 1200 1400 1600 1800

0

0.05

0.1

0.15

0.2

0.25

(d) Standard deviation of exploration
noise for each episode.

Figure 4.2: Result and logs for the training of the DDPG-algorithm where only
the steering angle was considered for control. The dark-blue line in the uppermost
images is a smoothened version of the true data.

36


4. Results

4.1.3 ACER
The design and training of the ACER-agent were performed as described in Sec-
tion 3.6.2. The agent was trained for two levels of control, one for control only of
the steering and one for control of both the acceleration and steering. Firstly the
training results for control of only the steering will be presented followed by the
training results where the action-space is extended to include the acceleration.

The training of the agent for control of the steering angle is summarized in Fig-
ure 4.3. Similarly, as for the training of the DDPG-agent, the Imitation Learning
agent was deployed at every 50th episode, to fill up the replay buffer with high-
quality data. As can be seen from the figure, the per episode performance of the
agent starts to increase per episode almost immediately. The average reward per
step seems to almost increase linearly until roughly the point of 500 episodes, after
which the inference reward suggests that the increasing average reward is due to the
lower rate of exploration noise, as seen in Figure 4.3d.

0 100 200 300 400 500 600 700 800 900

-100

0

100

200

300

400

500

600

700

800

900

(a) Accumulative reward, R, per
episode.

100 200 300 400 500 600 700 800 900

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

(b) Average reward per step for each
episode.

0 100 200 300 400 500 600 700 800 900

-500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

(c) Accumulative reward, R, for each
inference run.

0 100 200 300 400 500 600 700 800 900

0

0.05

0.1

0.15

0.2

0.25

(d) Standard deviation of exploration
noise for each episode.

Figure 4.3: Result for the training of the ACER-algorithm where only the steering
angle was considered for control.

37


4. Results

For the training of the agent that controls both the steering and acceleration of the
vehicle, the policy stream was initialized with the parameters from the policy with
the best inference performance, see Figure 4.3c. As can be seen from Figure 4.4c,
it performs relatively well on the inference test directly at the start, due to that
the part of the policy that controls the steering is initialized with the best policy
from the previous training. From Figure 4.4a and 4.4b, it can be seen that both the
accumulative episodic reward and the average reward is increasing at a satisfactory
rate. However, due to the good initial steering policy, the agent utilizes the full
episodic length which yields a rapid decay of the exploration noise. After roughly
300 episodes the exploration noise was already at a minimum amount, but the agent
still struggled with controlling the acceleration. Due to this fact, the exploration
strategy was changed to include a fixed standard deviation for the noise of both the
acceleration and steering after 300 episodes, see Figure 4.4d. After the exploration
strategy was changed, the reward and average reward increased a little while before
saturating at around 500 episodes. Figure 4.4a and 4.4b, can, however, be hard
to interpret since a larger level of exploration noise effectively will lower both the
average reward per step and the accumulative episodic reward. The inference reward,
on the other hand, yields a more fair comparison between the performance of the
agent under the two exploration strategies, since the agents actions during inference
correspond to the modes of the distributions. From Figure 4.3c it can be seen
that the accumulative reward obtained during inference increases under the new
exploration strategy and reaches its maximum value after roughly a total of 1300
episodes.

4.2 Model Comparison

4.2.1 Driver Metrics

To compare the performance of the different agents, the normalized lane position α
and heading angle ϕ, were logged for two validation laps around the track. During
the validation laps, the control signals used for the different agents corresponded to
the modes of the distributions. Histograms of the normalized lane position, α and
heading angle ϕ for all the validation laps are summarized in Figure 4.5 and Fig-
ure 4.6. As can be seen from Figure 4.5b, the histogram for the ACER-agent that
controls both the steering and the acceleration, has a considerable fewer number of
samples than the other models. This is due to that the agent failed to complete the
two laps as it drove outside of the road during the first lap.

Figure 4.6, summarizes the histograms of the heading angle, ϕ, obtained from the
validation laps for all the agents. As for the case of the histogram of the normalized
lane position, α, the histogram for the ACER-agent for control of both the acceler-
ation and steering has considerably fewer samples due to that it failed to complete
the two laps.

38


4. Results

200 400 600 800 1000 1200 1400

-400

-200

0

200

400

600

800

(a) Accumulative reward, R, per
episode.

200 400 600 800 1000 1200

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

(b) Average reward per step for each
episode.

200 400 600 800 1000 1200 1400

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

(c) Accumulative reward, R, for each
inference run.

200 400 600 800 1000 1200 1400

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

(d) Standard deviation of exploration
noise for each episode.

Figure 4.4: Result for the training of the ACER-algorithm where both the steering
angle and acceleration were considered for control.

4.2.2 Comfort Metrics
To further analyze the performance of the agents from a human passenger perspec-
tive the comfort-metrics acceleration and jerk were logged during the two validation
laps of the agents. In Figure 4.7 and Figure 4.8 the red bins indicate levels of acceler-
ation and jerks where humans experience discomfort. As mentioned in Chapter 4.2.1
the histograms for the ACER agent with two actions contains fewer samples because
the agent drove off the road.

39


4. Results

(a) Histogram of normalized lane po-
sition, α, for ACER-agent with 1 ac-
tion.

(b) Histogram of normalized lane po-
sition, α, for ACER-agent with 2 ac-
tion.

(c) Histogram of normalized lane po-
sition, α, for the Imitation Learning
agent.

(d) Histogram of normalized lane po-
sition, α, for the DDPG-agent with 1
action.

Figure 4.5: Histograms of the normalized lane position α, for all the successful
implementations. The data was logged for two laps around the test-track.

40


4. Results

(a) Histogram of the relative heading
angle, ϕ, for ACER-agent with 1 ac-
tion.

(b) Histogram of the relative heading
angle, ϕ, for ACER-agent with 2 ac-
tion.

(c) Histogram of the relative heading
angle, ϕ, for the Imitation Learning
agent.

(d) Histogram of the relative heading
angle, ϕ, for the DDPG-agent with 1
action.

Figure 4.6: Histograms of the relative heading angle ϕ, for all the successful
implementations. The data was logged for two laps around the test-track. The unit
for the x-axis of all the plots are radians.

41


4. Results

(a) Histograms of accelerations
[m/s2] for the ACER-agent with 1
action.

(b) Histograms of accelerations
[m/s2] for the ACER-agent with 2
action.

(c) Histograms of accelerations [m/s2]
for the Imitation Learning agent.

(d) Histograms of accelerations
[m/s2] for the DDPG-agent with 1
action.

Figure 4.7: Histograms of accelerations [m/s2] , for all the successful implemen-
tations. The data was logged for two laps around the test-track. Any red bins
suggests that a human would be introduced to a large level of discomfort under
those conditions.

42


4. Results

(a) Histograms of jerks [m/s3] for the
ACER-agent with 1 action.

(b) Histograms of jerks [m/s3] for the
ACER-agent with 2 actions.

(c) Histograms of jerks [m/s3] for the
Imitation Learning agent.

(d) Histograms of jerks [m/s3] for the
DDPG-agent with 1 action.

Figure 4.8: Histograms of jerks [m/s3] , for all the successful implementations.
The data was logged for two laps around the test-track. Any red bins suggests that
a human would be introduced to a large level of discomfort under those conditions.

43


4. Results

4.2.3 Robustness
To evaluate the model’s performance in non-ideal scenarios, the models were eval-
uated on two different test-sequences. For the first test, the agent is exposed to an
error sequence that places the vehicle in the wrong lane of the road. For the second
test, the agent is exposed to an error sequence that forces the vehicle to make a
sharp left turn. The agent retains control of the vehicle at the point where the vehi-
cle crosses the line that separates the left lane from the right. The evaluated models
on the test-sequences were the two agents obtained via the ACER-algorithm, the
DDPG-agent for control of the vehicles steering and the agent obtained via Imita-
tion Learning.

The resulting trajectories for all the models exposed to the first test-sequence are
summarized in Figure 4.9. As can be seen from the figure all the models are capable
of recovering from the unwanted state where the agent is placed in the wrong lane.
The difference between the models are mainly the trajectories generated by the
different agents. As can be seen from the Figure 4.9, the DDPG- and ACER-agent
for control of the steering angle, recover from the unwanted state faster than the
other agents.
For the second test, all the agents did not perform as well as for the first one. As
seen in Figure 4.10, only the DDPG- and ACER-agent for steering-control managed
to recover from the unwanted state caused by the error-sequence. The seen in the
figure, both the ACER-agent for control of both the steering and acceleration as well
as the Imitation Learning agent barely applies any compensation on the steering,
and therefore fails to stay on the road.

44


4. Results

120140160180200220240260280300

540

545

550

555

560

565

570

575

580

585

590

Center of road

Center of lane

Agent path

Predefined path

Road edges

(a) Robustness test for ACER with 1
action.

120140160180200220240260280300

540

545

550

555

560

565

570

575

580

585

590

Center of road

Center of lane

Agent path

Predefined path

Road edges

(b) Robustness test for ACER with 2
action.

120140160180200220240260280300

540

545

550

555

560

565

570

575

580

585

590

Center of road

Center of lane

Agent path

Predefined path

Road edges

(c) Robustness test for the Imitation
Learning agent.

120140160180200220240260280300

540

545

550

555

560

565

570

575

580

585

590

Center of road

Center of lane

Agent path

Predefined path

Road edges

(d) Robustness test for the DDPG al-
gorithm with 1 action.

Figure 4.9: Robustness test for the different agents. Each agent is introduced to
a error sequence that places the vehicle in the wrong driving lane.

45


4. Results

120140160180200220240260280300

540

545

550

555

560

565

570

575

580

585

590

Center of road

Center of lane

Agent path

Predefined path

Road edges

(a) Robustness test for ACER with 1
action.

120140160180200220240260280300

540

545

550

555

560

565

570

575

580

585

590

Center of road

Center of lane

Agent path

Predefined path

Road edges

(b) Robustness test for ACER with 2
action.

120140160180200220240260280300

540

545

550

555

560

565

570

575

580

585

590

Center of road

Center of lane

Agent path

Predefined path

Road edges

(c) Robustness test for the Imitation
Learning agent.

120140160180200220240260280300

540

545

550

555

560

565

570

575

580

585

590

Center of road

Center of lane

Agent path

Predefined path

Road edges

(d) Robustness test for the DDPG al-
gorithm with 1 action.

Figure 4.10: Robustness test for the different agents. Each agent is introduced to
a error sequence that places the vehicle in the wrong driving lane.

46


5
Discussion

In this chapter, the results presented in Chapter 4 will be discussed, as well as the
overall performance of the models. The Chapter is divided into three sections, Su-
pervised Learning, Training Reinforcement Learning Models and Driving Behaviour
and Robustness. In the first section, a discussion is conducted regarding the train-
ing results of the convolutional network and the Imitation Learning agent. In the
next section, the training and learning from the Reinforcement Learning agent is
discussed, as well as a discussion regarding why some agents failed to learn the
task. Lastly the general driving behavior, comfort metrics are discussed in aspects
concerning the implementation errors.

5.1 Supervised Learning

As discussed in Section 4.1.1, the early-stopping strategy used for the supervised
training never materialized, since the validation error continued to decrease. The
trend of the validation error is most likely an artifact of large correlation of the sam-
ples. Although the data was divided into two subsets of individual samples, there is
still a correlation between the two sets, due to the fact that the data was collected on
the same track. A better strategy would probably have been to assign the samples
into different subsets based on where the data was collected on the track, effectively
leading to that the validation-set only containing parts of the track that are unseen
by the model during training.

Apart from the early-stopping strategy, there were no regularization techniques ap-
plied to the model during training. The absence of regularization is most likely
nothing that affected the performance of the models, on the data used in this thesis.
However, implementing regularization techniques such as dropout during training
could help the models to generalize on unseen data.

The Imitation Learning agent obtained via the supervised learning showed promis-
ing results on the test conducted within this thesis. The Imitation Learning agent
was never a part of the initial goals for the thesis, but more of an artifact of the
pre-training of the convolutional layers. Because of this, the model-design of the
Imitation Learning agent is in some sense naive, but based on the performance of
the obtained Imitation Learning agent it would be interesting to see how an imple-
mentation with recurrent units and better regularization would perform.

47


5. Discussion

As mentioned in Section 4.1.1, the supervised training was performed past the point
where the parameters of the model were stored. This was done since that there was
simply not enough time to wait any longer, combined with the assumptions about
the correlation of the data as mentioned previously in this section.

5.2 Training of Reinforcement Learning Models
Generally, the training of RL models is quite a lengthy process. Because of this
a lot of the time spent on this thesis has been dedicated to training the models
evaluated in this thesis. Since both the DDPG- and ACER-algorithm contains a
number of hyperparameters, there simply has not been enough time to perform any
hyperparameter tuning. To which extent the models could be improved with tuned
hyperparameters remains unknown, but the DDPG-algorithm is however known to
be sensitive to hyperparameter settings[37].

The DDPG-agent for control of one action initially showed a promising learning-
curve, but both the accumulative reward and the average reward did unfortunately
saturate after roughly 1100 episodes. The resulting agent did seem to understand
the basic concept of the task, since it managed to stay in the correct lane. From
Figure 4.5d it can also be seen that the agent kept a relatively good lane-position
throughout the two inference laps. However from Figure 4.6, it can be seen that it
failed to maintain a small value of the heading angle, which suggests an oscillating
behavior, which was also verified empirically by observing the agents behavior visu-
ally. The oscillating behavior could probably be reduced and thereby increasing the
agents performance, by applying a carefully selected filter on the output. However,
any further investigations on this subject were never conducted. Since the policy
gradient for the DDPG-algorithm is based on estimates of the action value function,
a biased action value function will also yield an imperfect policy gradient, which
we believe is a probable explanation of the oscillating behavior. A better approach
might be to combine gradients from on-policy returns with the deterministic policy
gradients, which is utilized with the Q-Prop[37] and Interpolated Policy Gradient
algorithm[38].

The DDPG-algorithm did however fail to produce an agent that was capable of
obtaining an accumulative reward larger than zero, for the case when the agent con-
trolled both the steering and acceleration. Attempts to train the agent were made
with and without data from the Imitation Learning agent included in the replay-
buffer. The agent showed promise during the training since it seemed to realize
that driving off the road will yield a negative reward, but it failed to find a good
combination of steering and acceleration to compensate for that behavior. Maybe
another exploration policy could have solved these problems. A popular approach
to model the exploration-noise for the DDPG-algorithm is to model the noise as an
Ornstein–Uhlenbeck process, which was suggested in the Continuous control with
deep reinforcement paper learning[5].

The agent that showed the most promising results among the different agents ob-

48


5. Discussion

tained via Reinforcement Learning methods was the ACER-agent with control of
the steering. As can be seen from Figure 4.3, the learning curve concerning the
accumulative reward and the average reward shows a positive trend more or less
throughout the whole simulation. From Figure 4.5a it can be seen that the agent
manages to maintain a relatively good lane position and Figure 4.6a shows that it
simultaneously manages to keep a small heading angle.

For the ACER-agent that controls both the acceleration and steering the increased
complexity of including the acceleration as a control-signal increased the training
time significantly. Due to the growing complexity and the number of required
training-steps required, the initial exploration strategy was not well suited to the
problem. From Figure 4.4d, it can be seen that the standard deviation of the ex-
ploration noise decayed too quickly and the simulation was therefore restarted and
restored with the weights of the agent with the best performance on the inference
tests at that point. What is seen in the figure is the episodic trace obtained from the
agent with the best inference results. The simulation was continued past the first
point of interruption, where the results indicated a decaying level of performance.
The final agent did not reach the same degree of performance as for the case when
only one action was considered. Even though the agent obtained the highest infer-
ence reward during training, it failed to replicate the result when the performance
metrics presented in Section 4.2.1 and 4.2.2 were evaluated since it crashed during
the data acquisition. The main reason for the higher inference reward is most likely
due to that control-system used on the acceleration for the other agents was not
properly tuned and therefore applied very large control-signal and thus also a signif-
icant penalty, due to large acceleration and jerks, as seen in Equation (3.4). Due to
the large control-signals obtained from the control system, data from the Imitation
Learning agent was not included in the replay buffer.

5.3 Driving Behaviour and Robustness
Due to an error in the implementation of the acceleration and jerk in the simulation
environment, the samples were not divided with the sampling period. The data used
for Figure 4.7 and Figure 4.8, were manipulated to compensate for that error. Based
on the results in Figure 4.7 and Figure 4.8, there seems to be a significant deviation
between the agents using a PID-controller for the acceleration and the ACER-agent
that handles this control-signal interna