Sensor Modelling with
Recurrent Conditional GANs
Recurrent Conditional Generative Adversarial Networks for Generating
Artificial Real-Valued Time Series
Master’s thesis in Complex Adaptive Systems

HENRIK ARNELID

Department of Physics
Chalmers University of Technology
Gothenburg, Sweden 2018


TIFX05, Master’s thesis 2018

Sensor Modelling with
Recurrent Conditional GANs

Recurrent Conditional Generative Adversarial Networks for Generating
Artificial Real-Valued Time Series

HENRIK ARNELID

Department of Physics
Chalmers University of Technology

Gothenburg, Sweden 2018


Sensor Modelling with Recurrent Conditional GANs
Recurrent Conditional Generative Adversarial Networks for Generating Artificial Real-
Valued Time Series
Henrik Arnelid

© Henrik Arnelid, 2018.

Supervisors: Edvin Listo Zec and Nasser Mohammadiha, Zenuity
Examiner: Mats Granath, Department of Physics

Master’s Thesis 2018
Department of Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Graphical visualization of 100 synthetic time series from the generative network
trained in the RCGAN framework.

Typeset in LATEX
Gothenburg, Sweden 2018

II


Abstract
Autonomous vehicles rely on many sensors in order for the vehicles to perceive their
surroundings. Consequently it is important with safety verification of the sensors which
typically is done by collecting data in many different scenarios which is time consuming
and expensive. For this reason, autonomous driving software companies are interested in
virtual verification where the scenarios are simulated. In this thesis we have developed
and used the Recurrent Conditional Generative Adversarial Network (RCGAN) in order
to model the longitudinal error of sensors. The RCGAN is a modification of the original
generative adversarial network (GAN) framework which makes use of recurrent neural
networks and conditioning the networks on auxiliary information. These changes allows
the model to learn and be able to generate realistic real-valued multi-dimensional time
series.

Keywords: Generative Model, Autonomous Vehicle, Recurrent Neural Networks, Gener-
ative Adversarial Networks, Sensor Modelling, Time Series

III


Acknowledgements
First of all I would like to express my gratitude to Edvin Listo Zec and the Data Analysis
Team at Zenuity for giving me the opportunity to write my master thesis at this inspiring
company and within a topic that I have a big interest in. Thank you for the discussions
and giving me feedback continuously throughout the project and showing much interest
in my work. Furthermore, I would like to thank Zenuity for accommodating me in their
productive work environment and providing me with all the data and tools that was
needed for my thesis. Finally, I would like to extend my thanks to Edvin, Erik, Majid
and Magnus for all the ping-pong matches as well as the daily push-up challenge.

IV


Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 4
2.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Different types of generative adversarial nets . . . . . . . . . . . . . 5
2.2 Conditional GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Recurrent Conditional GANs . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methods 10
3.1 Implementation of the networks . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Network training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Evaluation of the models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 Kullback–Leibler divergence . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Jensen-Shannon Divergence . . . . . . . . . . . . . . . . . . . . . . 12

4 Results 13
4.1 Summary of performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Recurrent Conditional GANs . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Recurrent MDNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Network trained with mean absolute error . . . . . . . . . . . . . . . . . . 18
4.5 Comparison of GAN extensions . . . . . . . . . . . . . . . . . . . . . . . . 20

4.5.1 Original loss function . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5.2 f -GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5.3 WGAN-GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Discussion 26
5.1 Comparison of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Comparison of training objective . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Conclusion 30
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A Mixture Density Networks 34
A.1 Recurrent Mixture Density Networks . . . . . . . . . . . . . . . . . . . . . 35

V


1
Introduction

The past few years has shown an ever increasing popularity in using machine learning
algorithms in many different research areas. These types of algorithms has many ap-
plications within the automotive industry such as advanced driving assistance systems
(ADAS), autonomous driving (AD) and manufacturing optimization [1]. A common chal-
lenge in these areas is to process data and extract important information from it, which
deep learning methods excel at. Deep learning involves machine learning algorithms that
utilize large neural networks with many hidden layers for solving tasks such as classifica-
tion and predictions [2].

Zenuity is a company that work with developing software for autonomous vehicles which
involves software for ADAS, highly automated driving (HAD) and AD. The vehicles are
equipped with many sensors which produce data that is processed in order for the vehicles
to perceive their surroundings. An important aspect of this is to evaluate the quality and
performance of sensors. In order to evaluate this, data has to be gathered from the sensors
in many different scenarios and conditions which is both expensive and time consuming.

1.1 Background

In order to evaluate the safety of the autonomous vehicles with a certain confidence more
than 450 million of kilometers has to be driven in order to provide a good statistic on
a low fatality rate [3]. This is a very time consuming and expensive task which cannot
be done in reality. Thus, Zenuity and other manufacturers resort to virtual verification
where models of the environment and sensors are made to match reality.

One aspect of these simulation environments which is considered in this master thesis is
to model the error of the sensors as accurately to the real world as possible. These errors
do usually not follow any known distribution and have an unknown correlation through
time which makes it difficult to model them. However, state-of-the-art generative models
has shown promising results in the generation of time series [4, 5]. One such class of
models are generative adversarial networks (GANs) which are specialized at generating
data which is similar to the real training data [6]. There are other approaches suitable
for generating real-valued time series data which involves hidden markov models or state-
based models [7, 8]. However, the focus will be on neural network based models in this
thesis.

GANs have been very successful in various image generation tasks, such as image-to-

1


1. Introduction

image translation [9, 10, 11, 12], image super-resolution [13, 14] and in text-to-image
synthesis [15, 16, 17], making the network architecture well suited to learn two dimensional
distributions well. This suggest that they are a viable choice for learning any distribution
given an appropriate network structure and framework. In this thesis we are interested
in sequential data distributions. GANs have previously been used for sequential data
generation, but these typically focus on discrete outputs such in language processing [18].
The amount of research for generating continuous real-valued time series is limited where
only two works are known to the author [19, 20].

1.2 Problem description

In this thesis we are interested in creating a generative model that is able to recreate
realistic time series that a sensor provided by Volvo Car Corporation produce. The host
car is equipped with the sensor as well as a LIDAR that provide the ground truth reading.
The setup of how the errors are recorded can be seen in figure 1.1 below. The error of
the sensor is dependent on properties of the specific scenario between the host and target,
such properties are relative position, angle, velocity, etc.

Figure 1.1: The image represents the setting of how the sensor errors are obtained. The dashed square
around at target vehicle is the sensor reading which is mismatched to the ground truth (GT) of the target
vehicle due to the error of the sensor. These errors εlgt and εlat is the main focus in this thesis.

The aim of this thesis is to create a model which with training is able to generate novel
samples of real time series of the errors seen in the figure above. The lateral and longit-
udinal error of the sensor is defined as:

εlgt = Ysensor − Ylidar

εlat = Xsensor −Xlidar

(1.1)

In this project different extensions of generative adversarial networks will be trained using
data from sensors provided by Volvo Cars. In order to model the temporal aspect of
the data we will use recurrent neural networks, more specific Long Short-Term Memory
(LSTM) cells [21], which gives the network a ”memory” so that previous inputs influence
newer inputs. Additional modifications have to be made to the original GAN in order
for it to be a suitable model for this given problem. These will be covered in the next
chapter.

The generative model has to be able to handle time series of arbitrary length as the
length of any set of scenarios can differ substantially from one another. Furthermore, the

2


1. Introduction

framework should be versatile enough so that it can be tuned for time series with different
amount of noise. Some signals may be cleaner than other and the network should still be
able to generate novel samples to the real data.

One of the major desired characteristics of the generative model is that the output should
not be deterministic, i.e. same output from two separate runs with the same input
sequence. The reason for this being that the sensors have different intrinsic noise as
well as noise coming from the surroundings which influence the signal. A deterministic
model would not be true to the noise aspect of the signal. However, the output should
still follow the trend of the real time series but output a unique time series each time the
same sequence of features is given to the model in order to represent the noise fairly.

1.3 Thesis outline

This thesis is composed of five main chapters. It will start with the Theory chapter where
Generative Adversarial Networks and extensions and modifications thereof are presented.
Next chapter in order is the Methods chapter where the implementation, model eval-
uation and work process is described. In the Results section plots of synthetic time
series is presented as well as histograms where the real and synthetic distributions can be
compared. Furthermore, in the Discussion chapter the results and explanations thereof
is given. Finally, the Conclusions chapter summarizes the main outcomes of this thesis
together with reflections over future work.

3


2
Theory

The generative model that was developed in this thesis is called a recurrent conditional
generative adversarial network (RCGAN) which follows the general framework of the
original GAN presented by [6]. A number of changes as well as extensions have been
implemented in order to create the generative model that is able to output sequences
of real valued data subjected to a conditional input. The foundation of GANs and the
extensions and modifications made in this thesis are presented in this chapter. Lastly, the
RCGAN framework is presented.

2.1 Generative Adversarial Networks

Generative adversarial networks (GANs) are a neural network framework aimed to gener-
ate new realistic samples given the distribution p(x) of the training data [6]. This kind of
neural network architecture has shown good results in generating realistic samples in many
different applications [9, 13, 15], where tasks involving images are predominant. In the
framework there are two different neural networks, a generator (G) and a discriminator
(D) network, which have conflicting objectives.

Figure 2.1: A schematical view of the
GAN framework. The generator takes a
randomly sampled latent vector z as in-
put and outputs a sample that is judged
by the discriminator which takes either
a real, x, or a fake sample, G(z), as
input at a given instance. The output
from D is the probability whether the
sample was real or not.

The discriminator network is being fed both real data and synthetic data produced by
the generator where the aim to estimate the probability if the sample is real instead of
coming from G. The generative model, G, is trained to capture the data distribution by
trying to maximize the error of D. Backpropagation is applied to both networks making
D better at telling data apart while G becomes better at synthesizing realistic samples.
In mathematical terms the GAN is trained to solve the following optimization problem:

min
G

max
D

Ex∼Pr [logD(x)] + Ez∼p(z)[log (1−D(G(z)))], (2.1)

4


2. Theory

where x is real data from the distribution Pr and z are samples from a prior distribution
p(z), which typically is drawn from U(0, 1) or N (0, 1). That is, the loss function is
maximized for the discriminator and minimized for the generator. The global optima for
the optimization is when Pr = Pg, in other words when the generators output distribution
PG matches the real data distribution. The training of the networks is split into two steps
where the discriminator is trained in one step and the generator in the other step.

Starting with the discriminator, samples from the real data x and from the latent space z
are drawn where z is fed through G. Both x and G(z) are fed through D, yielding D(x)
and D(G(z)). Next the loss Ex∼Pr [logD(x)] + Ez∼p(z)[log (1−D(G(z)))] is calculated
and the gradients are calculated and followed by updating the weights of D.

When the generator is trained samples from the latent space z are drawn and fed through
G whose output is fed through D. The output from the discriminator is then used in the
loss Ez∼p(z)[log (1−D(G(z)))] and the gradients are calculated followed by an update
the weights of G. It is possible to maximize the loss Ez∼p(z)[logD(G(z))] instead which
provide stronger gradients in early learning while maintaing the same fixed point of D
and G [6].

In the training scheme used in the GANs the generator never sees the true data directly,
only indirectly through the gradient from the discriminator. This avoids having features
in the input directly copied to the parameters of the network which in turn may gain
statistical advantages compared with other models [6].

GANs using the standard implementation are famously known to have a unstable training
process where a too good discriminator or generator can cause the model to collapse and
not output anything meaningful [22]. There are also issues with vanishing gradients if the
discriminator saturates. There is an abundance of previous research that has worked on
improving the training of the original GANs [22, 23, 24, 25]. A brief introduction to some
of the extensions will be given in the next section.

2.1.1 Different types of generative adversarial nets

The common objective for many variations of the GANs is to minimize the distance
between two distributions, namely the real data and the synthetic data. There are a
number of metrics suitable for this purpose, an example of one family of such metrics is the
f-divergence metrics. These are incorporated in different GAN extension implementations
where the optimization objective is modified. The structure of the neural net can remain
the same for the different types of GANs with the exception of the activation function on
the final output neuron of the discriminator. The change that is primarily made is to the
loss function of the networks making it easy to implement several different extensions in
order to find a suitable one for the problem at hand.

One of the more popular extensions is theWasserstein GANs, abbreviated WGANs, which
provides a modification to the network such that one tries to optimize with respect to the
Earth-Mover (EM) distance instead. This approach yields cleaner gradients in all parts of
the space unlike the original GANs which can experience vanishing gradients in some parts
of the space. Furthermore, the discriminator has to be a K-Lipschitz function for some

5


2. Theory

K which is done by clamping the weights to a fixed box, for example WD = [−0.1, 0.1],
where WD are the weights of the discriminator [23]. The optimization objective for the
WGAN is:

min
G

max
D

Ex∼Pr [D(x)]− Ez∼p(z)[D(G(z))] (2.2)

In the WGAN the activation of the last output neuron in the discriminator is removed
and is replaced by a linear activation. Enforcing the K-Lipschitz constraint by clipping
the weights is not an ideal approach and may lead to issues with the training process.
The authors of the WGAN have left options for enforcing the constraint for further in-
vestigation.

To improve the downsides with the gradient clipping, [24] proposed the WGAN with
gradient penalty (WGAN-GP) instead of clipping the gradients. The modified training
objective is given by:

L = Ez∼p(z)[D(G(z))]− Ex∼Pr [D(x)] + λEx̂∼Px̂

[
(||∇x̂D(x̂)||2 − 1)2

]
(2.3)

Where x̂ = εx + (1− ε)G(z), ε is drawn from U(0,1) and λ is the penalty coefficient.

The final extension that was considered is the f-GAN which is a framework where any
f -divergence can be used for training the GANs. As with the previous extensions the
structure of the networks remains the same where the big changes lie in the output
activation of D and the loss functions. So, let T (x) = gf (v(x)), where v(x) is the linear
output from D and gf (·) is the output activation function. Then the training objective is
defined by (2.4) below.

min
G

max
D

Ex∼Pr [Tω(x)]− Ex̂∼Pg [f ∗(Tω(x̂))] (2.4)

By selecting specific gf (·) and f ∗(·) functions it is possible to recover the desired f di-
vergences. Recommended choices of gf and f ∗ is presented in table 2.1 below [25]. Four
different f divergences has been tested in order to evaluate if they can stabilize or improve
the learning. The four different divergences are Total variation, Pearson1 χ2, Neyman χ2

and Squared Hellinger.

Table 2.1: Recommended final layer activation functions and conjugate functions for variation of the
f -GAN optimization objective [25].

Name Output activation gf Conjugate f ∗(t)
Total variation 1

2 tanh(v) t
Pearson χ2 v 1

4t
2 + t

Neyman χ2 1− exp(v) 2− 2
√

1− t
Squared Hellinger 1− exp(v) t

1−t

2.2 Conditional GANs

When the data is dependent on a set of features it is possible for the GANs to model this
by implementing a straight forward extension. The distribution p(x|c) can be learned

1There is another GAN modification called Least Square GANs [26] whose minimization objective also
is the Pearson χ2 divergence through a different set of functions. Due to this it was left out in this thesis.

6


2. Theory

by the network simply by adding the conditional feature vector c as input to both G
and D [6]. The loss function for the original implementation for both the generator and
discriminator would then take the following form:

LD = logD [x, c] + log (1−D [G [z, c] ; c])
LG = logD [G [z, c] ; c]

(2.5)

Where x are real samples, c is the conditional input and z are random samples from
a latent space. Here we wish to maximize both LD and LG. As previously mentioned,
in the original implementation of GANs one tries to minimize log(1−D(G(z))) for the
generator, however, by using the form above in (2.5) a much stronger gradient is obtained
early in the learning process [6]. The modifications are done analogously for WGAN and
f -GAN as well, where c is added to both loss functions for the generator and discriminator
in the same manner as in (2.5).

2.3 Recurrent Conditional GANs

The model that has been developed during this thesis is based upon the original GAN
framework with some modifications. Firstly, both the generator and discriminator are
recurrent neural networks instead of multilayer perceptrons or convolutional nets. The
second modification is to include the conditional input to both networks so that they can
make informed decisions/predictions given the current input. This conditional input can
be in the form of one-hot encoding or other labels, but in this thesis it is a continuous signal
of different properties of the car-tracking scenario mentioned in the problem description
section of the report.

In figure 2.2 below we can see schematical images of the architecture of the networks. In
the same manner as with the original GAN the generator takes a latent vector z as input
sampled from some known distribution. Furthermore, we feed the discriminator either
a real or fake time series where the network outputs a probability at each time frame
whether the sample is real or fake. All predictions for each time frame are then used in
the loss function. If a GAN modification such as f -GAN is used the interpretation of the
output is different as the sigmoid activation is no longer used, thus the output is not a
probability. Finally, both networks also take a conditional input with a set of continuous
features at each time frame.

7


2. Theory

(a) Structure of the generator. It takes input from
a latent space as well as a conditional input at each
time frame.

(b) Structure of the discriminator. It takes either
a real or fake time series together with the condi-
tional input as input at each time frame.

Figure 2.2: The two images show the architecture of the framework. Both RNNs in the generator and
discriminator respectively are LSTM based.

The internal structure of the networks presented above and how the data flows in them
is non-trivial. Starting with the generator, we have a ”noise” network and a ”context”
network within it, see figure 2.3a. It was found that when the noise and the conditional
input is given to the same deep RNN, similar to the structure of in figure 2.3b, the noise
would become almost non-existent after a few thousand iterations. By isolating the noise
to its own RNN it is possible to control the amount that is applied to the output in much
greater detail through varying the noise distribution, network size and the size of the
latent vector z. The context network gives the generator a memory of the sequence of
features that are given as the conditional input. Lastly, a skip connection was added for ct

so that information about the input is not completely lost and distorted in the deep RNN.
Skip connections are when the input is given to additional layers in the network and not
only to the first hidden layer, this can be seen in figure 2.3 where ct is skip connected to
a deeper layer of the network. This have for instance been used by [4] who saw improved
learning when training LSTM based networks.

The discriminator has a more simple structure where the data point(s) from either a real
or fake time series, xt, is being concatenated with the corresponding features at each time
frame and is given to a deep RNN. Skip connection was also used in this network for the
same reason. By adding the skip connection for ct to the fully connected layer for both
networks a more stable and faster learning was obtained during the training.

8


2. Theory

(a) The generator takes a sample, z, from the
latent space which is passed to a single layered
RNN meanwhile the conditional input is fed to
a multilayered RNN and each time frame t. The
output from both RNNs and the skip-connected
conditional input are all concatenated and fed
into a fully connected layer and the final output
is a linear activation of the inputs from the fully
connected layer.

(b) The discriminator takes a sample xt from a
time series, real or fake, together with the con-
ditional input at each time frame t. These two
inputs are concatenated and passed to a mul-
tilayered RNN whose output goes into a fully
connected layer together with the skip-connected
conditional input. Finally there is one output
neuron for the prediction of the specific time
frame.

Figure 2.3: The two figures show the internal structure of the generator (left) and discriminator (right).

The loss function in (2.5) is modified slightly when transferred to the time series applic-
ation. The new loss, (2.6), is obtained by simply taking the mean of the conditional loss
across the time dimension where n is the length of the given time series. The modification
to the loss is done analogously for WGANs and f -GANs as well.

LD = 1
n

n∑
i=1
{log (D [xi, ci])− log (D [G [zi, ci] ; ci])}

LG = 1
n

n∑
i=1

log (D [G [zi, ci] ; ci])
(2.6)

9


3
Methods

The main work has been divided into three parts, namely develop and implement the
models, construct a suitable training scheme and finally evaluate the trained models.
After these three steps the focus was shifted to hyperparameter tuning in order to improve
the results of the models, which is the step where a majority of the time was spent during
this thesis.

3.1 Implementation of the networks

The models in this thesis has been implemented in Python 3.6 using Tensorflow 1.5. Suit-
able network sizes that were used are 3 layers and 64 LSTM nodes in the deep RNNs
followed by the fully-connected layer with 64 neurons for both the generator and dis-
criminator. The 1 layered RNN in the generator also had 64 LSTM nodes whose input
z = [z1, z2, ..., z32]> are drawn from N (0, 1). Although the main focus has been to imple-
ment the RCGAN during this thesis an additional network types and training schemes
have been implemented in order to see how the performance of the RCGAN stacks up
against these. A supervised training scheme was implemented for the generator by simply
using its output in a mean absolute error loss function instead of pitching the generator
against the discriminator network in the GAN setting. An additional network type that
was implemented was a recurrent mixture density network (R-MDN) [27], which is a
type of network that output the parameters for a Gaussian mixture model (GMM) at
each time frame. See Appendix A for further explanation of the MDN and R-MDN. The
data points were sampled straight from the output GMMs at each time frame, thus not
including correlation between successive samples.

When training neural networks it is common to utilize GPU acceleration to decrease time
spent on training the models. However, here all models were trained using the CPU as
this was nearly ten times faster compared with GPU, which there are several reasons for.
One of them being that recurrent neural network execute in a sequential fashion which
limits the pace that the GPU can work in. The second reason is that batch training has
not been a viable option due to varying length of the sequences in the data set, which
further limit the benefit of training on a GPU.

10


3. Methods

3.2 Network training

The data set used for training contains 8,639 unique time series which are longer than 50
time frames. In total there are 1,828,577 time frames in the training data. The validation
set contains 1,365 unique time series with a total of 202,848 time frames.

Two different strategies have been used when training the networks in this thesis. The
first one involves feeding an entire time series, which can be of different lengths and
calculating all the outputs. This output is then used in the loss function and the mean of
it is taken in order to not have a skewed loss depending on the length of the time series.
Then the gradients are applied and the weights are updated. This method works well if
the data set contain mostly shorter time series as issues with vanishing gradients are not
as pronounced. Furthermore, when training on longer sequences the loss function can get
diluted as individual time frames does not impact the average of all the predictions from
the discriminator significantly, see (2.6), for these longer sequences.

If the time series in the training set are longer or contain finer details a different strategy
has been used. Here the time series are being split into smaller series which are fed to
network sequentially. In between succeeding partial time series the state of the recurrent
nodes has to be kept when the next part is fed so that the ”memory” of the entire
time series is maintained when the next set of predictions are calculated. After each
split sequence is fed, the loss function and gradients are calculated and the weights are
updated. Thus, the weights are updated much more often in this training scheme.

Before any training takes place the weights are initialized, which is done differently for
the LSTMs and the fully connected layers. The weights for the LSTMs are initialized
randomly where samples are drawn from a truncated normal distribution with mean 0
and standard deviation 0.1. Meanwhile, the weights of the fully connected layers are
initialized according to the Xavier initialization [28].

3.3 Evaluation of the models

Evaluating generative models is a difficult task as it is, for many generative models,
computationally intractable to calculate the likelihood [29]. Usually one tries to generate
visually appealing results that agree with the true distribution. However, it is not a
viable method in the long run as it would be unfeasible for an operator to manually
inspect each produced sample of the generative model. If the model is able to learn the
distribution of the true data implicitly it is considered successful. By evaluating a suitable
distance metric between the real and fake data distribution we can get an assessment of
the trained model. However, these will not capture the temporal aspect of data, only the
overall distribution. An estimation of how the model is capturing the temporal aspect
can be obtained by calculating the first difference between sequential time frames for both
real and fake data and then calculating the distance metric between those distributions.

11


3. Methods

3.3.1 Kullback–Leibler divergence

The Kullback-Leibler (KL) divergence is the foundation of a useful metric which has been
used to score the models in this thesis. The KL divergence is defined as:

K(P ||Q) =
∑

i

P (i) log
(
P (i)
Q(i)

)
(3.1)

for two discrete probability distributions P and Q. There are a number of drawbacks
with this metric. Firstly it is asymmetric and more important if there are samples in P
that has a zero probability in Q these will result in an infinite value of the metric, which
is undesired.

3.3.2 Jensen-Shannon Divergence

The Jensen-Shannon divergence is obtained combining the KL-divergence cleverly in order
to create a metric that is symmetric and without the possibility of approaching infinity
The Jensen-Shannon divergence is defined as:

JSD(P ||Q) = 1
2K(P ||M) + 1

2K(Q||M), (3.2)

whereM = 1
2(P+Q) andK(·||·) is the Kullback–Leibler divergence. The Jensen-Shannon

divergence is limited to the interval [0, log 2] when using the natural logarithm, where 0 is
obtained if P and Q are identical distributions and log 2 is obtained if said distributions
are completely disjoint. The square root of the JSD is called the Jensen-Shannon distance
(JSd) which is the metric that has primarily used to set a score to the trained models. This
metric cannot solely be used to evaluate the models as it does not consider the temporal
aspect of the data, further evaluation is done by visual inspection of the synthetic time
series.

12


4
Results

The results from the different models as well as the different training objectives is presen-
ted in this chapter. These are shown by plotting the same 100 time series as in figure 4.1,
which is the real counterpart of the upcoming plots of the synthetic time series. Further-
more, histograms of the errors and first difference of the errors in the entire validation
set are plotted together with the generated errors and first difference of these for a com-
parison between the distributions. Finally, the JSd between the real and synthetic data
distributions are given for all models.

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	�
���
�


Figure 4.1: The figure depicts the longitudinal error of 100 time series from the validation data set. All
errors have been rescaled due to confidentiality reasons.

All models has been trained using ADAM [30] with β1 = 0.5, β2 = 0.999 and a learning
rate of 10−5. Each GAN was trained for four epochs using a batch size of one time series.
The parameters β1 and β2 control the decay rates of moving averages of gradients which are
used in ADAM. Furthermore, a low dropout rate of 5% was applied to the discriminator
network. All models presented have been trained on reproducing the longitudinal error,
see figure 1.1. Lastly, the conditional feature vector contains a number of features such
as range to the target, relative angle between the vehicles and relative speed. All features
were normalized to zero mean and unit variance.

13


4. Results

4.1 Summary of performance

The JSd scores of the three different models is summarized in table 4.1 below. The
RCGAN that is proposed in this thesis is the best performing model by a significant mar-
gin. When the generator in the RCGAN is trained in a supervised fashion by minimizing
a distance metric instead a much higher JSd is obtained, i.e. the synthetic data distribu-
tion does not match the real data distribution. Lastly, the R-MDN is able to capture the
data distribution fairly well but with too much noise. Due to the noisier time series more
values are likely to be covered near the real error for the time series in the validation set
causing the JSd score of the error to become better whereas the first difference becomes
worse. Plots of synthetic data with the same feature vector time series as in figure 4.1 is
given for each model for a visual comparison in the upcoming sections.

Table 4.1: The performance against the validation data set for the three different models that were looked
at in this thesis.

Network type JSd 1st diff JSd
RNN (MAE) 0.377 ± 0.0006 0.508 ± 0.0004
R-MDN 0.153 ± 0.0008 0.158 ± 0.0016
RCGAN 0.076 ± 0.001 0.095 ± 0.0009

4.2 Recurrent Conditional GANs

In Figure 4.9 the same set of 100 synthetic time series has been plotted. We see that
there is an initial transient at the beginning of every time series before it settles. The
characteristics is fairly similar to what is seen in figure 4.1 with similar noise levels and
magnitude of the errors. There are however a few exceptions where the initial transient
is similar for many different time series and not as diverse as with the real data. There is
also a limited amount of unique features in the generated data such as sudden peaks or
drops.

14


4. Results

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	
��������������
�


Figure 4.2: The figure depicts 100 time series of the longitudinal error generated by the RCGAN. The
corresponding features to the 100 time series plotted in figure 4.1 has been given to the generator. All
errors have been rescaled due to confidentiality reasons.

In the histograms in figure 4.3 below together with the JSd value we see that the generator
is truly able to grasp the overall trend of the data well, with a JSd of 0.075. The lack of
unique features is reflected in the histogram of first difference where the tails of the true
data is not quite captured by the generator.

(a) Histogram of the generated errors,
JSd = 0.075

(b) Histogram of the first difference,
JSd = 0.095

Figure 4.3: The histograms depict real (blue) and synthetic (sandy-brown) data. The histogram to the
left show the distribution of the errors from all time series in the validation set. The right histogram
shows the distribution of the first difference of the same data.

In Figure 4.4, four time series from the validation set is plotted together with the synthetic
counter part. The mean of 100 runs show that the RCGAN is able to output trajectories

15


4. Results

that are close to the real data each time with a suitable noise level seen be the one standard
deviation fill around the mean of the 100 synthetic sequences.

(a) (b)

(c) (d)

Figure 4.4: The four plots show four different time series from the validation set. In the plots we see the
mean of 100 runs where the same sequence of features is fed to the generator each run, the fill around
the mean is ± one standard deviation. The black trajectory is one of the 100 synthetic sequences and
the orange is the actual sequence. All errors have been rescaled due to confidentiality reasons.

16


4. Results

4.3 Recurrent MDNs

In Figure 4.5 we see a different appearance of the 100 synthetic time series. With the
R-MDN model the noise on top of the error is much spikier compared with the real data,
despite this the overall trend of the time series seems to be captured well. The initial
parts of the generated sequences show a plausible initial spread compared with the true
data.

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	
��������������
�


Figure 4.5: The figure depicts 100 time series of the longitudinal error that were sampled from the output
GMMs in each time step of the recurrent MDN model. The corresponding features to the 100 time series
plotted in figure 5 has been given to the network. All errors have been rescaled due to confidentiality
reasons.

In the histogram of the errors in Figure 4.6a we see that the R-MDN is not as expressive
as the real data where the tails of the distribution is not being captured. When looking
at how the noise is represented in Figure 4.6b we clearly see that there is too much of it
where the tails of the synthetic distribution are larger than for the true data. Due to this
the tall peak at zero is not fully replicated.

17


4. Results

(a) Histogram of the generated errors,
JSd = 0.152

(b) Histogram of the first difference,
JSd = 0.160

Figure 4.6: The histograms depict real (blue) and synthetic (sandy-brown) data. The histogram to the
left show the distribution of the errors from all time series in the validation set. The right histogram
shows the distribution of the first difference of the same data.

4.4 Network trained with mean absolute error

From the plot of the 100 time series in Figure 4.7 below we see that when the generator
is trained using the mean absolute error instead of the discriminator network. In other
words, the output of the generator network is directly used in the MAE loss function for
a supervised training of the generator. The model trained in this setting is not able to
capture the noise of the true data. Furthermore, the sinusoidal shaped initial transient is
present for some of the sequences but not for all. Additionally the spread of the different
time series is much tighter than what is seen for the real data in figure 4.1.

18


4. Results

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	
��������������
�


Figure 4.7: The figure depicts 100 time series of the longitudinal error generated by the generator network
when trained using MAE instead of the discriminator network. The corresponding features to the 100
time series plotted in figure 4.1 has been given to the generator. All errors have been rescaled due to
confidentiality reasons.

The observations are further confirmed in the histograms in figure 4.8 where the synthetic
error sequences is clumped together at similar values with a short tail towards positive
values. The lack of expressiveness is reflected in the missed out tails compared with the
real data as well as in the very tight spread of the first difference of the synthetic errors.

(a) Histogram of the generated errors,
JSd = 0.377

(b) Histogram of the first difference,
JSd = 0.508

Figure 4.8: The histograms depict real (blue) and synthetic (sandy-brown) data. The histogram to the
left show the distribution of the errors from all time series in the validation set. The right histogram
shows the distribution of the first difference of the same data.

19


4. Results

4.5 Comparison of GAN extensions

The networks here have a different setup of hyperparameters than the best performing
network whose results were presented in 4.2 above. The amount of neurons in each layer is
lower in order to lower the amount of time spent training for the evaluation of the different
training objectives. Between all networks presented in this section all hyperparameters
and input features has been kept the same for a fair comparison between the trained
networks. In Table 4.2 below is a summary of the JSd for the different objective functions
that was tested in this thesis. As previously mentioned we cannot solely rely on the JSd
to judge the trained model as it does not cover the temporal aspect so the 100 time series
plots are also given under each respective section.

Table 4.2: The JSd metrics of the RCGAN when trained with different loss objective functions.

Loss function JSd 1st diff JSd
Original loss 0.0853 ± 0.001 0.114 ± 0.0008
WGAN-GP 0.115 ± 0.0006 0.123 ± 0.0006
f-GAN (Total-variation) 0.132 ± 0.001 0.0916 ± 0.0007
f-GAN (Pearson χ2) 0.110 ± 0.001 0.0765 ± 0.0008
f-GAN (Neyman χ2) 0.0982 ± 0.0009 0.0769 ± 0.0008
f-GAN (Squared Hellinger) 0.180 ± 0.0009 0.194 ± 0.0006

4.5.1 Original loss function

One of the best performing training schemes remain as the original loss function (2.6),
which allows the generator to learn the overall distribution well together with a realistic
amount of noise, see figure 4.9. Furthermore, it is able to learn some finer details in the
data but not as much as can be found in the real data. Examining the histograms in
figure 4.10 give a similar picture where the overall error distribution agree well between
the synthetic and real data distributions but the lack of finer details, e.g. spikes, result
in a bit lower first difference JSd.

20


4. Results

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	
��������������
�


Figure 4.9: The figure depicts 100 time series of the longitudinal error generated by the RCGAN trained
using the original loss function. The corresponding features to the 100 time series plotted in figure 4.1
has been given to the generator. All errors have been rescaled due to confidentiality reasons.

(a) Histogram of the generated errors,
JSd = 0.0845

(b) Histogram of the first difference,
JSd = 0.116

Figure 4.10: The histograms depict real (blue) and synthetic (sandy-brown) data. The histogram to the
left show the distribution of the errors from all time series in the validation set. The right histogram
shows the distribution of the first difference of the same data.

4.5.2 f -GAN

Training the RCGAN using the variational f -divergence objective functions, see table 2.1,
yields different characteristics to the model depending on the selected function. In figure
4.11 we see the 100 time series samples from each of the four f -divergence optimization
objectives that there were examined in this thesis. All four models share the same initial
transient characteristic where it lasts for a different amount of time frames for each model.

21


4. Results

We see for both Total variation and Pearson χ2, figure 4.11a and 4.11d respectively, that
many time series end with a steep ascent. Lastly, the noise levels seems to agree well
with the true data for each of the training objectives, although Squared Hellinger is
slightly smoother and lack unique noise features, e.g. spikes, that the others has. These
observations are reflected in the histograms in figure 4.12 and 4.13 where the histograms
agree fairly well for all except Squared Hellinger. The fast ascents at the end of time
series trained with the Pearson χ2 objective are reflected in the first difference histogram,
Figure 4.13d, where the tails are much wider for the synthetic distribution. These ascents
can be seen in Figure 4.11d where there are spikes approaching 0.4 between about 50 and
100 time frames.

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	
��������������
�


(a) Total variation

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	
��������������
�


(b) Neyman χ2

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	
��������������
�


(c) Squared Hellinger

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���

��
��
�

	
��������������
�


(d) Pearson χ2

Figure 4.11: The four figures depicts 100 time series of the longitudinal error generated by the RCGAN
trained using the variational f-divergence minimization objectives. The corresponding features to the
100 time series plotted in figure 4.1 has been fed to the generator. All errors have been rescaled due to
confidentiality reasons.

22


4. Results

(a) Total variation, JSd = 0.132 (b) Neyman χ2, JSd = 0.105

(c) Squared Hellinger, JSd = 0.179 (d) Pearson χ2, JSd = 0.109

Figure 4.12: Histograms of the validation set (blue) and the synthetic errors (red) from models with
different objective functions in the f -GAN setting.

23


4. Results

(a) Total variation, JSd = 0.091 (b) Neyman χ2, JSd = 0.075

(c) Squared Hellinger, JSd = 0.194 (d) Pearson χ2, JSd = 0.078

Figure 4.13: Histograms of the first difference for the validation set (blue) and the synthetic errors (red)
from models with different objective functions in the f -GAN setting.

4.5.3 WGAN-GP

As with the previously presented models we have an initial transient that is very similar
for all time series, seen in figure 4.14. Furthermore, the values of the generator seems
to lie with a tighter spread compared with the previously presented model with next to
no outlier values which is seen in the real data. Lastly, the histograms in figure 4.15
show that the trained model is not quite able to generate errors in the upper range of the
positive valued errors. In addition, the first difference histogram further reflect the tight
spread of values seen in figure 4.14 where the tails of the synthetic distribution are not as

24


4. Results

wide in comparison with the real distribution.

� ��� ��� ��� ��� ���
���

����

����

���

���

���

���

���

���

���
��
��
�

	
��������������
�


Figure 4.14: The figure depicts 100 time series of the longitudinal error generated by the RCGAN trained
using the WGAN with gradient penalty loss function. The corresponding features to the 100 time series
plotted in figure 4.1 has been given to the generator. All errors have been rescaled due to confidentiality
reasons.

(a) Histogram of the generated errors,
JSd = 0.116

(b) Histogram of the first difference,
JSd = 0.124

Figure 4.15: The histograms depict real (blue) and synthetic (sandy-brown) data. The histogram to the
left show the distribution of the errors from all time series in the validation set. The right histogram
shows the distribution of the first difference of the same data.

25


5
Discussion

The resulting model from this thesis has been proven to be successful at generating syn-
thetic time series. The network structure in figure 2.2 and 2.3 has shown great performance
when it comes to learning the general trends of the data as well as outputting a suitable
noise level on the signal. In this chapter we will discuss the strengths and drawbacks of the
different models and training objectives of the recurrent conditional generative adversarial
network that were looked at in this thesis.

5.1 Comparison of models

The recurrent conditional GAN is the best performing model on this data set. In the
data set there are sequences that has two common trends, starting out at higher values
and going towards lower or the other way around. The trained model is able to capture
both of these trends, see Figure 4.2 and 4.4. The real data have a highly auto-regressive
nature, i.e. the error is highly dependent on the error in previous time steps. In other
words if the error have been large for a few time steps it is likely to be large in the next
time steps as well. We see that the RCGAN is able to capture this auto-regressive nature
of the real data with realistic looking noise. So it seems that the generator has learned
how much of the input noise z should be let through on top of the signal.

For all GANs in this thesis we have observed an substantial initial transient for all syn-
thetic time series, where some of the trained models exhibit more and others less. This
transient makes the RCGAN have poor predictions in the beginning of the synthetic
sequences compared with the real data. Furthermore the transient is similar for many
different time series which is an undesirable trait that does not reflect the real data well.
This issue comes mainly from the LSTM nodes in the network whose internal states has
to be built up from feeding consecutive inputs. When enough time frames has been given
to the networks they start to grasp the context of the sequence and the output starts to
home in on the desired trajectory. Attempts to solve the issue with the initial transient
have been made by trying different initialization schemes. For example initializing the
LSTMs with noise from either uniform or normal distributions. This does indeed cause
the transient to behave differently both for good and for worse but the main issue with
this is that the learning of the model is slowed down greatly and becomes even more
unstable.

The recurrent mixture density network is an interesting approach to the problem, instead

26


5. Discussion

of injecting noise into the network the randomness is inherent from the model where you
sample from the output distributions. The nature of the data set in this thesis is heavily
auto-regressive which is completely missed when sampling independently from the output
distributions at each time frame. This is clearly seen in the 100 time series in Figure
4.5. One advantage to this network compared with the RCGAN is that the transient at
the beginning of the sequences only lasts a couple of time frames and seems to be more
diverse. This model could probably be improved significantly by for example, coupling
it to an auto-regressive (AR) model where the parameters has to be optimized after the
R-MDN network has finished training.

The performance of the network trained by minimizing the distance between synthetic and
real samples performs much worse compared with the GAN framework. When training
this regular LSTM network the noise levels decay quickly as the network learns not to
trust the one-layered noise network which in a sense wish to increase the error of the
output. This makes the output nearly deterministic which results in that the noise of the
real data is not captured in this approach. This method of training the generator was
only implemented in order to compare and justify the GAN framework in this application.
If there was more time to spend on it, it would probably be possible to construct a more
meaningful loss function than mean absolute error or mean squared error.

5.2 Comparison of training objective

When comparing the different training objectives exclusively on how well the overall
distribution match with the real data most of them perform similarly. The exception
being optimizing for the Squared Hellinger divergence in the f -GAN setting. Of all the
different loss functions that were tested the original loss function was able to learn the
error distribution best. When inspecting the synthetic sequences we see that it is able to
capture unique features of some sequences, e.g. the yellow peak at t ≈ 120, see Figure
4.9.

As previously mentioned, most work with GANs has been done to image generation and
manipulation which does not necessarily mean that the extensions will work well with time
series. From the results section we see that the original loss objective proved to be the
best out of the different loss metrics that were tested. The interpretation of the original
loss where the output is a probability whether the sample is real or not for every time
step feels intuitively like a meaningful loss metric unlike the others whose interpretation
remain open for discussion.

Each of the models trained with different f -GAN objectives seem to produce different
characteristics in the synthetic time series. Both Total variation and Pearson χ2 have
problems learning the last parts of the sequences where many ”explode” towards the end.
Meanwhile, Squared Hellinger seems to regularize away the noise producing smoother
time series than the three other objectives and it also have a wider spread of the errors in
the initial parts of the sequences. The best performing optimization objective of the four
is the Neyman χ2 f -divergence which had the best JSd scores of the f -GAN objectives. In
addition, the trained model was also able to produce the most visually pleasing sequences
of the four. Lastly, only a marginal improvement to the stability of the learning was

27


5. Discussion

observed for Squared Hellinger and Neyman χ2 and Total variation and Pearson χ2 were
more unstable during training than the original loss function.

The WGAN-GP extension was not the top performer, but it still performed fairly well.
This is surprising as the loss metric was highly unstable and could be many orders of
magnitude larger for some time series during training. Despite this, we saw a more stable
output distribution and learning during the training of the model compared with the
original loss function.

The choice of the WGAN-GP extension may not be ideal as the extension was developed to
improve most common problem that GANs are used for, namely image processing where
it performs well [24]. However, when it comes to comes to the RCGAN implementation
in this thesis with different lengths on the input time series problems with the gradient
penalty term arise. Firstly, taking the L2-norm of the gradients of time series with different
lengths skew the penalty term heavily for long time series compared with shorter ones.
Secondly, further investigation has to be made on how to deal with the conditional input
when calculating the gradient penalty term. Calculating the gradients with respect to
both x̂ in (2.3) and the conditional vector c was tested but no improvement was seen.
Despite this, we saw a more stable learning compared with the original GAN loss function
and the model learned the overall trend of the data fairly well. As with all models where
the LSTMs states were initialized with zeroes the sinusoidal transient in the beginning is
present.

5.3 Training

Many articles [22, 23, 24, 25] have pointed out that the original implementation of GAN
is unstable to train. According to the literature GANs often experience what is known
”mode-collapse” where the generator produce the same output no matter which input is
given to the network. If this happens the training procedure has to be reset. During the
work in this thesis mode-collapse has been experienced and has mainly been a result of
poor weight initialization or a too large learning rate. It was much less of a problem than
anticipated.

The main issue when training the models has been the instability in regard of convergence
to a stable minimum, which is the main reason for examining different training objectives,
e.g. f -GANs. The synthetic data distribution usually drift back and forth during training.
Sometimes training for another 200 iterations will give the model a higher JSd than before
and sometimes not. There are a few reasons for this, firstly there is an inherent instability
to the training of GAN which has been mentioned previously. Secondly, there are issues
with the data that is used for training where some time series contain irregularities.
These may cause the loss to behave oddly for these which draw the generator away from
a well matching output distribution. This is further emphasized by the fact that no batch
training is used as the sequences in the training data are of different lengths. The reason
for this being that there is no support in Tensorflow 1.5 for feeding the network sequences
of different length. Attempts to address this has been made by filtering out time series
with anomalies from the data set, such as when the error jumps a significant amount
between two consecutive time frames. The back and forth drift can also be combated to

28


5. Discussion

an extent by using a relatively low learning rate which seems to dampen the drift.

Finally, it is possible to obtain a well-performing RCGAN model quite quickly in terms of
wall-clock time. Training the model on an Intel Core I7-7700HQ processor for less than
30 minutes can result in a model with a JSd of 0.15 and visually acceptable sequences.
Training the RCGAN in order to obtain a top performing model will take just a few hours
which is great for trying out different data sets and parameter settings.

5.4 Model evaluation

In the results section we see examples of why a good JSd value not necessarily imply a
well trained model. A good example of this is for the Pearson χ2 objective in f -GANs
where both JSd values for the synthetic data and first difference are low, meaning a good
agreement between the distributions. Meanwhile, when the time series are inspected visu-
ally in Figure 4.11d we see that some time series ”explode” towards the end which is a
behaviour that is not found in the real data. These quick ascents inflate the amount of
values that falls in the tails of the histograms which exploit that the metrics is invari-
ant under permutation. Until better metrics for time series evaluation is found, visual
inspection of the generated sequences remain important.

29


6
Conclusion

In this thesis a recurrent conditional generative adversarial network framework has been
developed. The framework fulfills the desired characteristics where the model can handle
time series of arbitrary length as well as being able to tune the noise levels to the specific
time series distribution that is wished to be learned. As it is possible to run the network
on arbitrary long sequences it is also possible to train the framework on data sets where
the length of sequences are different. Furthermore, the networks output is able to follow
the trend of the real data where a believeable amount of noise is applied to the signal.

The recurrent mixture density networks also perform quite well and is able to capture the
trend of the time series. However, since the samples are drawn independently from the
output GMMs in each time frame we miss out on the auto-regressive nature of the real
data.

Having the generator trained in a supervised fashion by removing the discriminator and
replacing it by a loss function whose purpose is to minimize the distance between synthetic
and real samples proved to not be a suitable approach. The unique time series criteria is
not fulfilled and the noise in the output signal is not satisfactory. The network structure
in the generator has proven to work well and yield good results when trained in the
RCGAN framework, so the network have the capability of learning the distribution but
is not captured in this setting. This further promote the use of the RCGAN framework
for training the generative model.

6.1 Future work

The RCGAN architecture presented in this thesis will continue to be developed as it has
shown great results for generating realistic time series. Firstly, a new initialization method
or modification to the internal architecture have to be developed that allow the model to
produce a more realistic initial transient. Secondly, further work has to be made in order
to make it possible to utilize mini-batch training for sequences of different lengths, which
is a change that may help stabilize the learning. Thirdly, the model will be implemented
in the simulation environments at Zenuity in order to further evaluate and improve the
model. It would also be interesting to find or develop other suitable scoring metrics in
order to improve the evaluation of the trained models.

Finally, it would be interesting to implement other state-of-the-art generative models in
order to compare it against the RCGAN architecture. One approach which has shown

30


6. Conclusion

impressive results in generating synthetic wave-forms in speech-synthesis tasks is Wave-
Net [5], which should be able to perform well in the error sequence task as well.

31


Bibliography

[1] A. L. et al., “”Deep learning in the automotive industry: Applications and tools”,”
IEEE International Conference on Big Data, pp. 3759–3768, 2016.

[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http:
//www.deeplearningbook.org.

[3] Kalra, Nidhi and Susan M. Paddock. ”Driving to Safety: How Many Miles of Driving
Would It Take to Demonstrate Autonomous Vehicle Reliability?”. Santa Monica,
CA: RAND Corporation, 2016. https://www.rand.org/pubs/research_reports/
RR1478.html.

[4] A. Graves, “Generating sequences with recurrent neural networks,” CoRR,
vol. abs/1308.0850, 2013.

[5] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kal-
chbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for
raw audio,” CoRR, vol. abs/1609.03499, 2016.

[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “”Generative Adversarial Networks”,” ArXiv e-prints,
June 2014.

[7] E. L. Zec, N. Mohammadiha, A. Schliep, “Modelling Autonomous Driving Sensors
Using Hidden Markov Models”, under review, 2018.

[8] E. Karlsson, N. Mohammadiha, “A Statistical GPS Error Model for Autonomous
Driving”, in Proc. IEEE Intelligent Vehicles (IV), June 2018.

[9] P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with condi-
tional adversarial networks,” CoRR, vol. abs/1611.07004, 2016.

[10] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation
using cycle-consistent adversarial networks,” CoRR, vol. abs/1703.10593, 2017.

[11] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain image generation,”
CoRR, vol. abs/1611.02200, 2016.

[12] M. Liu and O. Tuzel, “Coupled generative adversarial networks,” CoRR,
vol. abs/1606.07536, 2016.

32


Bibliography

[13] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang,
and W. Shi, “Photo-realistic single image super-resolution using a generative ad-
versarial network,” CoRR, vol. abs/1609.04802, 2016.

[14] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár, “Amortised MAP
inference for image super-resolution,” CoRR, vol. abs/1610.04490, 2016.

[15] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what
and where to draw,” CoRR, vol. abs/1610.02454, 2016.

[16] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative
adversarial text to image synthesis,” CoRR, vol. abs/1605.05396, 2016.

[17] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. N. Metaxas, “Stackgan:
Text to photo-realistic image synthesis with stacked generative adversarial networks,”
CoRR, vol. abs/1612.03242, 2016.

[18] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial
nets with policy gradient,” CoRR, vol. abs/1609.05473, 2016.

[19] O. Mogren, “C-RNN-GAN: continuous recurrent neural networks with adversarial
training,” CoRR, vol. abs/1611.09904, 2016.

[20] C. Esteban, S. L. Hyland, and G. Rätsch, “Real-valued (medical) time series gener-
ation with recurrent conditional gans,” CoRR, vol. abs/1706.02633, 2017.

[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput.,
vol. 9, pp. 1735–1780, Nov. 1997.

[22] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen,
“Improved techniques for training gans,” CoRR, vol. abs/1606.03498, 2016.

[23] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” ArXiv e-prints, Jan.
2017.

[24] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved
training of wasserstein gans,” CoRR, vol. abs/1704.00028, 2017.

[25] R. Nock, Z. Cranko, A. K. Menon, L. Qu, and R. C. Williamson, “f-gans in an
information geometric nutshell,” CoRR, vol. abs/1707.04385, 2017.

[26] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, and Z. Wang, “Multi-class generative adversarial
networks with the L2 loss function,” CoRR, vol. abs/1611.04076, 2016.

[27] C. M. Bishop, “Mixture density networks,” 1994.

[28] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward
neural networks,” in Proceedings of the thirteenth international conference on artifi-
cial intelligence and statistics, pp. 249–256, 2010.

[29] L. Theis, A. v. d. Oord, and M. Bethge, “A note on the evaluation of generative
models,” arXiv preprint arXiv:1511.01844, 2015.

[30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR,
vol. abs/1412.6980, 2014.

33


A
Mixture Density Networks

Mixture density networks (MDNs) are a type of neural networks whose output para-
metrizes a mixture model [4, 27]. Subsets of the outputs are used to define different
parameters of the mixture model. Where one set is used to define the mixture weights
(π) and the rest is used to define the individual components of the mixture model. For
example, if the output parametrizes a Gaussian Mixture Model (GMM) there is a need
of three outputs per mixture component, two additional components for the location (µ)
and scale (σ) parameters. In Figure A.1 a schematical view of an MDN is presented, the
output is a GMM with two mixture components in this example.

Figure A.1: A schematical view of an
MDN where the output mixture is a
GMM. For each input x a new set of
parameters {πj , σj , µj}j=1,2 is obtained.
Here π denote the mixture weight, σ the
scale parameter and µ the location para-
meter. From this output-GMM samples
are drawn.

For each parameter a suitable activation function is applied in order to bring their values
into meaningful ranges, see (A.1) below. In order to obtain a valid discrete distribution
the mixture weights are normalized by applying the softmax function which makes the
mixture weights to sum up to unity. Furthermore, no activation function, e.g. linear
activation, is applied to µ to not limit the location of the mixture components. Finally,
the scale parameters are exponentiated to limit them to positive values.

πj = exp(π̂j)∑M
j′=1 exp(π̂j′)

=⇒ πj ∈ (0, 1),
∑

j

πj = 1

µj = µ̂j =⇒ µj ∈ R

σj = exp
(
σ̂j
)

=⇒ σj > 0,

(A.1)

34


A. Mixture Density Networks

where j is the number of mixture components. The MDN is trained using the negative
logarithm of the likelihood, according to (A.2).

L(x) = − log
∑

j

πjN (x|µj, σj)
, (A.2)

where x are real samples.

A.1 Recurrent Mixture Density Networks

The mixture density network can be combined with recurrent neural networks into a Re-
current Mixture Density Network (R-MDN) which allows the output mixture components
to not only be conditioned on current input but also on previous inputs. A schematical
image of a R-MDN is presented in Figure A.2, at each time step a sample S is drawn
from the output distributions.

Figure A.2: The structure of the RMDN. Each time step a set of features is given to the network which
outputs the parameters for the GMM from which samples S are drawn.

For the R-MDN a different mixture model is obtained in each successive time frame, this
is formulated in (A.3) below.

πj
t =

exp
(
π̂j

t

)
∑M

j′=1 exp
(
π̂j′

t

) =⇒ πj
t ∈ (0, 1),

∑
j

πj
t = 1

µj
t = µ̂j

t =⇒ µj
t ∈ R

σj
t = exp

(
σ̂j

t

)
=⇒ σj

t > 0,

(A.3)

where j is the mixture components and t is the number of time steps in the sequence.
The training objective is also modified according to (A.4) to include the new temporal
aspect of the outputs.

L(x) =
T∑

t=1
− log

∑
j

πj
tN (xt|µj

t , σ
j
t )
 (A.4)

35