DEPARTMENT OF MECHANICS 
AND MARITIME SCIENCES 

CHALMERS UNIVERSITY OF TECHNOLOGY 
Gothenburg, Sweden 2020 
www.chalmers.se 

Learning a Better Attitude  
A Recurrent Neural Filter for Orientation Estimation 
Master’s Thesis in Systems, Control and Mechatronics 
 

AMANDA FRANSSON 
OSCAR LUNDSTRÖM 
 

Master’s thesis 2020:16

Learning a Better Attitude

A Recurrent Neural Filter for Orientation Estimation

AMANDA FRANSSON
OSCAR LUNDSTRÖM

DF

Department of Mechanics and Maritime Sciences
Division of Vehicle Engineering and Autonomous Systems

Chalmers University of Technology
Gothenburg, Sweden 2020


Learning a Better Attitude
A Recurrent Neural Filter for Orientation Estimation
AMANDA FRANSSON
OSCAR LUNDSTRÖM

© AMANDA FRANSSON, OSCAR LUNDSTRÖM, 2020.

Supervisor: Niclas Berglind, CPAC Systems AB
Examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences

Master’s Thesis 2020:16
Department of Mechanics and Maritime Sciences
Division of Vehicle Engineering and Autonomous Systems
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Scematic illustration of a recurrent neural filter, as explained in Section
2.4.2.

Typeset in LATEX, template by David Frisk
Gothenburg, Sweden 2020

iv


Learning a Better Attitude
A Recurrent Neural Filter for Orientation Estimation
AMANDA FRANSSON
OSCARLUNDSTRÖM
Department of Mechanics and Maritime Sciences
Chalmers University of Technology

Abstract
In the current paradigm of sensor fusion orientation estimation from inertial mea-
surements unit sensor data is done using techniques derived with Bayesian statis-
tics. These derivations are based on assumptions about noise distributions and hand
crafted equations describing the relevant system dynamics. Machine learning, and
more specifically neural networks, may provide an alternate solution to the prob-
lem of orientation estimation where no assumptions or handcrafted relationships
are present. This thesis aims to investigate whether a neural network-based filter
can achieve a performance comparable to, or exceeding that of, the more conven-
tional extended Kalman filter. Two network architectures based on long short-term
memory layers are proposed, trained, evaluated and compared using data from the
Oxford inertial odometry dataset. Of the two suggested model architectures the so-
called recurrent neural filter is found to give a the better performance. The recurrent
neural filter has a structure inspired by Bayesian filtering, with a prediction and an
update step, allowing it to output a prediction in the event of missing data. Further,
the evaluated models are trained to estimate orientation as well as a parameterized
error covariance matrix. Our results show that the suggested recurrent neural filter
outperforms the benchmark filter both in average root mean square error and in
execution time. The result also indicates that the machine learning-based approach
for sensor fusion problems may be an attractive alternative to hand crafted filters
in the future.

Keywords: sensor-fusion, state estimation, absolute orientation estimation, recur-
rent neural filter, recurrent neural network, RNN, LSTM, IMU, MARG.

v


Acknowledgements
We would like to thank CPAC Systems for providing resources and giving us the
opportunity to conduct our thesis with them. We would also like to thank Niclas
Berglind, our supervisor, for supporting us and helping us get started with our work,
and Peter Forsberg, our examiner for the many hours spent giving us feedback and
input on our thesis, despite the unexpected circumstances caused by the COVID-19
pandemic.

Amanda Fransson & Oscar Lundström, Gothenburg, May 2020

Supervisor: Niclas Berglind, CPAC Systems AB
Examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Statement and Aim . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Theory 7
2.1 Representing Orientation . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The Inertia Measurement Unit . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Conventional IMU Data Fusion . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 The Complementary Filter . . . . . . . . . . . . . . . . . . . . 11
2.3.2 The Extended Kalman Filter . . . . . . . . . . . . . . . . . . 12

2.4 Machine Learning and Neural Networks . . . . . . . . . . . . . . . . . 14
2.4.1 Recurrent Neural Networks and Long Short-Term

Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 The Recurrent Neural Filter . . . . . . . . . . . . . . . . . . . 17
2.4.3 Estimating Covariance . . . . . . . . . . . . . . . . . . . . . . 18
2.4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.4.1 Drop-out . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.5 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.6 Domain Randomization . . . . . . . . . . . . . . . . . . . . . 21

3 Methods 23
3.1 Machine Learning Framework . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Separation into 2D and 3D Problem . . . . . . . . . . . . . . . . . . . 23
3.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Synthetic 2D Data . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 3D MARG Dataset . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 The Recurrent Neural Network . . . . . . . . . . . . . . . . . 26

ix


Contents

3.4.2 The Recurrent Neural Filter . . . . . . . . . . . . . . . . . . . 27
3.5 Loss Function and Error Metrics . . . . . . . . . . . . . . . . . . . . . 27

3.5.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.2 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.6 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7.1 The 2D Case Experiments . . . . . . . . . . . . . . . . . . . . 33
3.7.2 The 3D Case Experiments . . . . . . . . . . . . . . . . . . . . 33

4 Results 35
4.1 Results of the 2D Models . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Results of the 3D Experiments . . . . . . . . . . . . . . . . . . . . . . 37

4.2.1 Comparison and Benchmarks . . . . . . . . . . . . . . . . . . 38
4.2.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Discussion 41
5.1 Performance and Uncertainty of the Developed ML Models . . . . . . 41

5.1.1 Research Methodology . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Relation and Contribution to the Field . . . . . . . . . . . . . . . . . 44
5.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Conclusion 47

Bibliography 49

x


List of Figures

2.1 Illustration of the axis angle representaiton. From [29], CC0 1.0 . . . 8
2.2 Block diagram of a complementary filter estimation angle and angular

velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Illustration of an RNN unfolded trough time. . . . . . . . . . . . . . . 15
2.4 A illustration of the LSTM layer. Yellow boxes indicate linear lay-

ers with activation functions and pink ellipses indicate element-wise
operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 The Recurrent Neural Filter as presented in [24]. . . . . . . . . . . . . 18

3.1 A 10 s data sample from the test set (OxIOD), showing readings from
the accelerometer, gyroscope, and magnetometer in the sensors body
frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Illustration of the RNN architecture evaluated. The green box rep-
resents an LSTM layer and the blue boxes represents fully connected
layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Illustration of the RNF architecture evaluated. The green boxes rep-
resents an LSTM layer and the blue boxes represents fully connected
layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Plot of a 2D test sequence with orientation estimates from a trained
RNN and the benchmark EKF. . . . . . . . . . . . . . . . . . . . . . 36

4.2 The estimated sequence of quaternions (left) and the estimation error
in Euler angles (right) from the RNF and the EKF in 3D. . . . . . . 38

4.3 Plot showing an orientation sequence and estimation (left), and the
estimation error (right) produced by the trained RNF when input
data was missing every other time step. . . . . . . . . . . . . . . . . . 40

4.4 Plot showing an orientation sequence and estimation (left), and the
estimation error (right) produced by the trained RNF when input
data was missing after 400 time steps. . . . . . . . . . . . . . . . . . . 40

5.1 An illustrative example of two univariate state estimators, with dif-
ferent SNR assumptions applied on two sequences of different length.
Ground truth is represented by the blue line, the estimate by the
orange, and the measurements by crosses. . . . . . . . . . . . . . . . 43

5.2 The dynamics covered in the source domain (OxIOD) might not cover
the target domain sufficiently. . . . . . . . . . . . . . . . . . . . . . . 44

xi


List of Figures

xii


List of Tables

3.1 The distributions used for domain randomization of the noise parame-
ters. These distributions are sampled for each generated 2D sequence.
12 is a vector of length two containing all ones. . . . . . . . . . . . . 25

4.1 Recorded test results from the trained 2D RNN model versions. . . . 35
4.2 Recorded test results from the trained 2D RNF model versions. . . . 36
4.3 Performance comparison of the benchmark EKF and the best in-

stances of the proposed RNN and RNF for the 2D problem. . . . . . 37
4.4 Recorded test results from the trained 3D RNN model versions. . . . 37
4.5 Recorded test results from the trained 3D RNF model versions. . . . 37
4.6 Comparison of best results in 3D along with the models execution time. 39

xiii


List of Tables

xiv


1
Introduction

In this chapter, the background of the thesis will be presented together with a
summary of the current state of the field of machine learning based orientation
estimation. The research questions and the contributions of the thesis are then
presented, followed by an overview of the outline of the report.

1.1 Background
Sensor fusion is, simply put, the practice of combining measurements from different
sources to infer some property with greater certainty than what would have been
possible had the measurements been considered separately. Sometimes the prop-
erty of interest is hidden or latent and can not be estimated unless several different
measurements are considered together. This inference of latent states from mea-
surements on related quantities has a significant number of application areas and is
essential for a range of technologies such as robotics, image processing, navigation,
and situation awareness [1],[2],[3]. A commonly used filter in the field of sensor
fusion is the Kalman filter (KF) [2],[4], mainly due to its relative simplicity and the
properties of optimality it provides. The KF is a recursive Bayesian estimator that
is optimal1 under the condition that i) the process and measurement noise is white2,
ii) the exact covariance of the noise is known and iii) that the model is linear and
precisely known3.

KFs are often tuned via trial and error until an adequate performance is achieved
[7], and thus the optimal estimate is not necessarily produced. Furthermore, KFs
are generally not sufficient in cases with significantly non-linear processes, for which
they tend to diverge. Work has been done on filters adapted to deal with cases
where the conditions i)-iii) above are not met. The extended Kalman filter (EKF)
[8] and the unscented Kalman filter (UKF) are both examples of filters designed
to deal with non-linear processes [4]. In applications where there is a plant4-model
mismatch and/or the noise covariance is unknown, it is possible to make use of an
adaptive Kalman filter (AKF) [9].

1Optimal in the sense that it minimizes the mean squared error. [5]
2White noise here refers to noise that is zero mean, uncorrelated over time, and of finite variance.
3The KF is still the optimal linear filter even though better non-linear filters may exist [5].

Also, although linear model abstractions are powerful tools, linear systems are very rare in the real
world[6].

4Plant here refers to the model of the dynamics of a system.

1


1. Introduction

Sensor fusion in general and KFs, in particular, are important techniques since
they can convert noisy sensory data to useful and reliable information. Despite the
proven capabilities of classical probabilistic sensor fusion techniques, such as the
KF, machine learning (ML) methods have started to transform the field entirely
[10]. The fundamental shift from manually describing the dynamics of the system
(from which information is inferred from) to learning it from examples opens up
the door to solving problems previously too complex to model [10]. Looking at the
progress made with neural networks (NN) in the recent decades, in tasks such as
classification, regression, and sequence-to-sequence modeling between domains, it is
not far-fetched to imagine employing NNs in filtering applications. Some researchers
have headed in this direction, striving to combine NNs with KFs or done filtering
with NNs without involving KFs at all [11],[12].

A specific sensor fusion problem where such an attempt can be investigated is the
objective of retrieving the absolute orientation estimation from an inertial measure-
ment unit (IMU) and a magnetometer. This combination of sensors is sometimes
referred to as a Magnetic, Angular Rate, and Gravity (MARG) sensor. An IMU
does not provide enough information to infer absolute orientation. Instead, it has
to be inferred from the measurements of acceleration, angular velocity and an ad-
ditional corrector, such as magnetic field direction. Orientation estimation using
IMU/MARG data is often done with EKFs, complementary filter, or more recent
methods such as the explicit complementary filter (ECF) and gradient descent based
orientation filter (GDOF) [13],[14],[15]. Often the linear acceleration measured by
the accelerometer is assumed only to be influenced by gravity, which is a reasonable
assumption under many conditions. However, in the instances where the assump-
tion does not hold, it can lead to significant errors in orientation estimation unless
dealt with properly. Designing filters that deal with the situation where this as-
sumption does not hold is possible and has been done [16]. These filters often adapt
the covariance of the measurement when the acceleration magnitude differs from the
gravitational acceleration, or omit some measurements altogether. Common to both
methods is that they add both computational and implementational complexity.

1.1.1 Related Work
There exists a variety of implementations that use NN to replace, or work in unison
with, the KF and versions thereof. Such implementations vary both in architecture,
application area, and overall approach. The applications found to be most relevant
for this thesis are described below.

Neural Network Moderated Kalman Filter

In reference [12], where an NN in conjunction with a KF is proposed. The NN acts
as a supervisor to monitor and improve the prediction accuracy of the KF algorithm.
In an experiment this setup is used to predict actual temperature from noisy sensor
data and the NN is used to estimate the extent of error in current readings and
to update the process error covariance in the KF. Compared to a stand alone KF,
where the process error covariance is static, the proposed implementation was able

2


1. Introduction

to reduce the root mean squared error (RMSE) of the prediction noticeably.

Neural Network as Extended Kalman Filter Substitute

Another example is given in reference [17], where a sensorless approach to estimate
the shaft speed of a dc-motor for closed-loop control using an NN is presented. The
proposed NN algorithm performance was compared with the performance of an EKF
and found to achieve a better performance.

Deep Kalman

In reference [18], a hybrid solution called the Deep Kalman Filter is presented. The
key purpose of the filter is to learn the system model, and the application area of
the study was to perform improved IMU positioning estimation by IMU error mod-
eling. Further, the IMU error was modeled using long short term memory (LSTM)
layers. This LSTM-based IMU error modeling is added to the filter alongside the
standard filtering steps: prediction and update. The study showed promising results
compared to the conventional filtering with an EKF.

OriNet

OriNet is a recently proposed network based on recurrent neural networks (RNN),
that provides an orientation estimation from a single IMU sensor [19]. The net-
work is trained with data from small flying robots5, and is the first network to
provide an end-to-end ML solution for orientation estimation. The architecture of
OriNet consists of two parts. The first part combines four LSTM channels and fuses
the learned feature representations of the channels in a slow fusion gate [19]. The
second part consists of another LSTM and fully connected gates that output the
difference between the quaternion of the last time-step and the present time-step
[19]. Allegedly6, OriNet improved the ability to estimate orientation with roughly
80% compared to reference [20] on the same dataset.

Sensor De-noising Network

The authors of reference [22] propose a method of de-noising IMU readings using a
CNN. The achieved results are of such a quality that, using dead reckoning on the
data de-noised by their model, they still manage to outperform OriNet on test data
from the same dataset.

5The authors of reference [19] note that the data they use makes the assumption of stationary
periods non-applicable.

6The performance of OriNet is reported in comparison with the quaternion-based complemen-
tary filter presented in [20] alongside the previously mentioned GDOF. However, their presented
results of the GDOF is seemingly extremely poor, something that does not coincide with the
performance of the GODF in other literature. Further, the GODF is not intended to estimate
absolute orientation in 3D without access to magnetometer measurements [14],[21], but to the best
of our knowledge, this is exactly what was done in the OriNet paper. As a consequence, the 80%
improvement is questionable even though the method is still promising.

3


1. Introduction

IONet

Yet an other example of an attempt to fuse sensor data with NN is IONet7 [11] which
estimates odometry from IMU readings. IONet is based on bi-directional LSTMs,
and performs windowed smoothing8 rather than filtering.

Recurrent Neural Filter

The recurrent neural filter (RNF) presented in [24] is an autoencoder/decoder archi-
tecture based on LSTMs. The RNF takes inspiration from conventional recurrent
Bayesian estimators such as KFs where the update step and the prediction step are
done separately. The encoder portion the architecture is comprised of LSTM layers
responsible for the different Bayesian filter steps while the decoder consists of an
feed forward NN. One of the significant novel contributions of this approach regards
the memory of the LSTM layers. Instead of a particular layer using an individual
separate hidden state, the hidden and cell states of the LSTM is shared horizontally
with the other filter steps. To a large part this makes the structure similar to the
distinct steps of the KF. The hidden state shared by the LSTM layers are in this
analogy equivalent to the state estimate in the KF. A strong motivation for using the
RNF structure is that it enables filtering when the system dynamics are unknown
while still maneuvering missing data points.

Field Prospects

To date, and to our knowledge, no studies proposing end-to-end ML solutions for
estimation of absolute orientation using IMU/MARG data has been published. The
previously mentioned IONet [11], OriNet [19] and the sensor de-noising network
[22] all tackle orientation estimation relative to an initial orientation. Reference [10]
concludes that the first direction of future studies within the field of data fusion with
ML methods should be to explore more application areas. Equally important, also
according to [10], artificial NNs are particularly suitable and able to model various
non-linear relationships that are difficult to express using hand crafted models.

1.2 Problem Statement and Aim
Although the KF, and its adaptations for nonlinear systems (EKF and UKF), are
sufficient in many applications, they have some drawbacks. As a first accessible
example, the tuning is typically performed by a weighting of the diagonal elements
of the covariance matrices. This tuning is often done arbitrarily until a “good
enough” result is achieved [7]. This is, of course, implementation-dependent, and
if the performance satisfies all demands the filter will be useful. Nevertheless, this
manner of tuning does not ensure that the best possible solution is utilized, which

7The authors behind IONet also published OxIOD [23], a datset with IMU readings and ground
truth (GT) estimates from a vision system.

8Smoothing is the filtering of data with access to both past and future measurements. Windowed
smoothing means that its smoothing over a fixed length window, and that the hidden state is not
shared between windows.

4


1. Introduction

in turn would be too time-consuming to actually find. More severe is the Markovian
assumption9 present in Bayesian filters, such as the KF [18]. Even though this
assumption simplifies the system modeling noticeably and makes the KF efficient,
it also makes it insensitive to system behavior with longer correlation times. On the
whole, complicated error models with a correlation over long periods are neglected
in the KF.

Further, a specific dynamic system can often operate in different regions of the dy-
namic state space, which can be hard to model in the same state estimator with
classical models. In some applications, this is solved by dividing the dynamics
into different modes, where each mode approximately describes different regions. A
straightforward example of this is the trade-off between modeling a car with a con-
stant velocity10 or coordinated turn11 model. Both of these models capture some
aspects of the dynamics, but depending on whether the car is turning or traveling
somewhat straight, their ability to describe the dynamics in the current mode will
differ. There exist approaches for when a system exhibits this kind of multifaceted
dynamics, that exploits the fact that describing the dynamics of the modes sepa-
rately requires less effort than modeling every aspect of interest of the dynamics
in one monolithic model. Examples of filters that do this are Interacting Multiple
Models filters and Switching Kalman Filters [25]. However, the manual labor and
the knowledge about the system dynamics required, and the computational burden
is somewhat prohibitive when using these filter alternatives.

Finally, when modeling the state error distributions in KF, EKF and UKF one is
limited to additive noise with multivariate white Gaussian distributions. The KF is
also limited to modelling measurement and process noise as additive, although the
EKF and UKF can, with minor modifications, be designed to deal with non-additive
noise as well. Filter performance is generally adversely affected when the noise is not
distributed in the way modeled, and outlier measurements12 can have a large impact.
An example of a method that explicitly deals with non-Gaussian distributions is the
famed particle filter [26], which comes at the cost of computational expense.

This thesis aims to benefit from the applicable traits of ML, particularly NNs, in
order to estimate absolute orientation using IMU and MARG sensor data. In the-
ory, an ML approach will be able to estimate altering system dynamics and noise
distribution as well as be able to exploit system behavior correlated over time.

9Given information about the most recent state of a system with the Markov propery, informa-
tion about any previous state will be uninformative [6].

10A constant velocity model assumes constant velocity. The acceleration’s affect on velocity is
modeled as noise.

11A coordinated turn model assumes constant velocity along a circle segment. Changes in radius
and velocity are modeled as noise.

12Outliers are samples that deviate from the distribution of the rest of the population. They
can be due to both natural variance in the population, or due to some error with the observation.

5


1. Introduction

1.2.1 Research Questions
Is the application of machine learning approaches for sensor fusion purposes a proper
alternative? Can indications of this, from related research, be supported?

How well does a NN based filter compare with a EKF in terms of quality (RMSE)
and efficiency (computational expense) in estimating orientation from noisy IMU
data?

Can a measure of the certainty of the produced estimates, similar to the covariance
of the EKF estimations, be found for an NN based filter?

1.3 Delimitations
This thesis will be limited to the problem of absolute orientation estimation with
IMU and MARG sensors. Hence, there will not be any attempt to estimate absolute
position even if spatial translations are present, and no additional sensor data will be
used. Further, the developed algorithms will be benchmarked against a traditional
EKF and not any of the mentioned extensions that deal with issues present in
the EKF. Further, in reference [10] a list of criteria, that a data fusion method
ideally should fulfill, is established. The list is proposed to make a comprehensible
evaluation of different data fusion algorithm more accessible. Based on the entries
in mentioned compilation, this thesis will be limited to evaluate developed models
in terms of quality, efficiency, robustness and tested on real world data. Thus,
evaluation of properties within privacy, extensibility, and stability will be outside
the scope of this thesis.

1.4 Contribution
The primary contribution of this study is to suggest an ML method that estimates
the absolute orientation using IMU/MARG. Secondly we hope this thesis will con-
tribute to extending the field of sensor fusion with ML and showcase the suitability
of utilizing ML for data filtering.

1.5 Thesis Outline
This thesis constitutes of six chapters. The current chapter, the introduction, is
the first out of the six. In the second chapter, relevant theory is presented. More
specifically, theory on orientation representations, IMUs and ML is covered. Further,
in the third chapter, the methods used in this thesis are motivated and accounted
for. In chapter four the results are compiled. In the fifth and sixth chapters the
presented results are first interpreted and discussed, and subsequently conclusions
are drawn.

6


2
Theory

2.1 Representing Orientation
The mathematical description of orientation is based on attaching different frames
of reference to the objects of interest. A simple example is two frames of reference,
the body frame b, and the world frame i. The body frame is fixed to the object
for which the orientation is to be modeled and the world frame is stationary. The
conversion between the two frames is easiest represented with a rotation matrix Rib

relating the vectors v in the two frames

vi = Ribvb. (2.1)

In equation (2.1), vi is a vector in the world frame, and vb is the same vector only
represented in the body frame. The rotation matrix Rib applies the rotation in the
Euclidean space from frame b to frame i.

Any rotation can be described using three consecutive elemental rotations. Thus,
the rotation matrix R, parametrized as

R = Rx(ϕ)Ry(ϑ)Rz(ψ), (2.2)

can also describe any rotation in 3D space. Rx, Ry, and Rz are all elemental ro-
tations about the axes x, y, and z with angles ϕ, ϑ, and ψ respectively. This
parametrization is referred to as Euler angles or Tait-Bryan angels. The angles are
in some application fields called roll, pitch, and yaw (ϕ, ϑ, and ψ). Further, the
parametrization of a rotation matrix into single-axis rotations has two different pos-
sible interpretations depending on if an intrinsic or extrinsic rotation convention is
used. As a result, the rotation matrix R can represent either the extrinsic rotation
about fixed axes z, y, x in that specific order, or the intrinsic rotation about the axes
of rotation x′, y′, z′. Furthermore, applying the rotation gives different results in the
two cases. From here on, only extrinsic representations will be used.

The use of Euler angles is appealing, because they are somewhat visualizable and
perhaps even the most intuitive representation for human comprehension. However,
the Euler angles bring ambiguities in the form of gimbal lock, which occurs when no
unique solution can be found for the Euler representation. Formatting a machine
learning problem in such a way that input maps to outputs 1:1 makes modelling
the output distribution as Gaussian viable. In problems where one input can map

7


2. Theory

Figure 2.1: Illustration of the axis angle representaiton. From [29], CC0 1.0

to multiple correct outputs the usage of sum-of-squared-errors loss functions tend
to result in models that predict the mean [27]. This motivates the employment of a
quaternion representation which is free of the gimbal lock problem.

A quaternion q with basis {1, i, j, k} is expressed as

q = a+ bi+ cj + dk. (2.3)

The real part of the quaternion, a, is often referred to as the scalar part, and the
imaginary part, bi+cj+dk, is called the vector part of the quaternion. Quaternions
that are used to represent orientation should be of unit length, i.e. |q| = 1 should be
satisfied. A quaternion can be seen as an encoding of an axis-angle representation.
The quaternion can then be interpreted as

q = cos
(
θ

2

)
+ ex sin

(
θ

2

)
i+ ey sin

(
θ

2

)
j + ez sin

(
θ

2

)
k, (2.4)

where e =
[
ex ey ez

]T
is the unit vector that gives the direction of the axis around

which the angle θ is applied. An illustration of the axis-angle rotation can be viewed
in Figure 2.1 Note that, keeping the unit constraint on the quaternion also ensures
that the axis e remains of unit length and in turn this will keep the represented
rotation interpretable [28].

The unit constraint, on the orientation quaternions, is one of the few drawbacks with
this representation. The constraint demands re-normalization after any operation
that is not guaranteed to preserve the constraint, and it can also bring difficulties in
optimization algorithms since the unity norm constraint is quadratic [28]. Further-
more, the quaternion representation is redundant in that that q and −q represent
the same rotation. To explain, a rotation of magnitude θ around an axis a is equal to
a rotation of −θ around the axis −a pointing in the opposite direction. To remove
the redundancy of the representation it is possible to use canonized quaternions
instead. In the canonized quaternions the scalar part of the quaternion vector is
always positive.

8


2. Theory

There are a few ways in which a metric of the error between two quaternions can
be expressed. In this thesis, we will have an estimation q̂ of the ground truth
orientation quaternion q. A metric of how close the estimation is to the ground
truth is necessary for the evaluation of different model performances. As a first
alternative to the error metric, the Euclidean distance between two 3D orientations
δq can be used [30]

δq = min (‖q − q̂‖, ‖q + q̂‖) , (2.5)

where ‖ . ‖ is the 2-norm and δq ∈
[
0,
√

2
]
.

A second alternative to the error metric is the quaternion orientation error or dif-
ference rotation quaternion [31]

qe = q ⊕ q̂−1. (2.6)

Here, ⊕ indicates the Hamilton product [32] and q̂−1 is the quaternion conjugation.
Note that the difference error quaternion is still a unit quaternion and, as the name
indicated, interpreted as the rotation difference between q and q̂.

As an extension of difference rotation quaternion, the angle of the difference rotation
θqe can also be calculated. According to equation (2.4) the scalar part of qe will be
equal to cos

(
θqe

2

)
and further, from extension of equation (2.6) the scalar part of qe

will also be equal to the dot product of q and q̂ which gives [30]

θqe = 2 arccos (‖q · q̂‖) , (2.7)

where θqe ∈ [0, π].

In conclusion, we have the Euclidean distance δq, the difference rotation quaternion
qe, and finally, the angle of the distance quaternion θqe as possible error metrics.
Neither alternative is directly comprehensible, and as a possibility, one can also
calculate the error in Euler angles from the difference rotation quaternion. The error
in Euler angle is at first hand meant to act as a visualization tool, and the error
is easiest retrieved by simply transforming qe to Euler angles. The transformation
from quaternions to Euler angles is, for example, derived in [28].

2.2 The Inertia Measurement Unit
A device containing a three-axis gyroscope and a three-axis accelerometer is com-
monly referred to as an IMU. There are a wide range of IMU application areas such
as robotics, navigation systems and virtual reality [33]. Much of the widespread
use of IMUs is a result of the microelectromechanical system (MEMS) technology
with which the IMUs have become both space efficien and cheap [9]. The gyroscope
measures angular velocity while the accelerometer measures specific force i.e. ac-
celeration relative to free-fall. Specific force is different from the more conventional
linear acceleration. The specific force of a stationary object on the surface of earth
will be constant 1g upwards while the linear acceleration in the same case will be
zero.

9


2. Theory

When using IMUs in applications, to estimate position and orientation, one ma-
jor concern is the presence of integration drift [14]. Since the measurements are
derivatives of the desired observation the usage of dead reckoning, where the state
is tracked by integrating the derivatives, is appealing. However, measurements are
imperfect and the raw sensor readings will include both noise and biases. Integrating
these measurements will lead to accumulation of errors over time. A signal with a
constant error, integrated once, will lead to an estimation error that grows linearly.
Integrating once more and the error will exhibit a quadratic growth over time [34].

In the objective of orientation estimation, assumptions about periods of zero linear
accelerations are common for the purpose of utilizing the accelerometer data to act
as a corrector for the roll and pitch angles and thus prevent the drift caused by
integrating the gyro. Still, the drift issue remains for the yaw angle which entitles
the use of additional sensors such as a magnetometer or global navigation satellite
system (GNSS) in order to include a correctional term in the remaining angle. How-
ever, magnetometers are sensitive to disturbances from ferromagnetic objects and
electrical devices in their surroundings [16] while GNSS perform poorly when the
satellite signal is obstructed. Thus, neither is suitable in every application. In con-
trast, the measurements from the gyroscope and accelerometer are independent of
any such environmental attributes [35].

Along side the drift issue, IMUs also suffers from complicated error sources that have
both computational (numeric range and resolution), random, as well as systematic
sources [18]. Generally, the collected IMU error sources are [36]:

• Input Range
The input range is the maximum magnitudes the IMU can measure. This error
source is more application dependent and need to be considered in the design
process rather than in the fusion process, specifically if the application is of
high-dynamic-range and risk of saturation may occur. Also to be considered
is the amount of vibrations since substantial vibrations can lead to saturation
if the device is at close to its dynamic-range border. Another subject of the
input range is the bandwidth which can cause aliasing.

• Bias Repeatabilty [deg/hr], [m/s2]
Due to a number of different influences such as altering physical properties and
initial conditions, the IMU initial bias will be different each time the device
restarts. However, if the bias is very similar each restart (i.e. the repeatability
is high), the bias can to some extent be compensated for by tuning he device.
Bias repeatability is frequently also referred to as Run-to-Run or Turn-On to
Turn-On Bias.

• Bias Stability [deg/hr/hr], [m/s2/hr]
During the IMU run-rime the initial bias, that is sometimes simplified as a
constant, changes. Generally, this is caused by changing temperature or me-
chanical stress and it is common that manufactures adds temperature com-
pensation to the IMU which increases the stability. This error source also goes
by the name In-Run Bias.

10


2. Theory

• Scale Factor [ppm]
The scale factor error describes the ratio between the sensed input and mea-
sured output of the sensor. The ideal ratio is, of course, one-to-one but when
this is not the case, the output is proportional to the input with a scale factor.
Additionally, the scale factor also has a non-linear part that can be described
separately, but often is described with one collective value.

• Sensor Misalignment [mrad]
The six degrees of freedom (DoF) IMU has three gyroscopes and three ac-
celerometers set orthogonal to each other. The placement is however imperfect
causing correlation between sensors. Sensor misalignment errors originates
both from non-orthogonality between accelerometers or gyroscopes (within
sensor sets) as well as between sensor sets. There is also the phenomenon of
package misalignment where the attachment of IMU is misaligned with the
object that it is attached to.

• G Dependency [deg/hr], [m/s2]
When a Coriolis effect based gyroscope is subjected to acceleration along its
oscillation axis the acceleration will have an impact on the sensor readings.
This effect is called G dependency, because gravity is often the main source of
acceleration.

• Latency
When IMU sensors are aided by cooperating sensors, latency can arise. In
other words the latency error source emerge when IMU and the aiding sensor
both senses same motion but their timing disagree. If this timing disagreement
is substantial this error can become problematic.

2.3 Conventional IMU Data Fusion
There are many possible filter implementations that can be used to estimate the
absolute orientation from IMU or MARG data. In this section two approachable
ways will be covered.

2.3.1 The Complementary Filter
The complementary filter provides a straightforward solution to attain orientation
estimates from IMU data. The accelerometer is a good indication of orientation
under the assumption of stationary, or at least slow-moving conditions. Furthermore,
even though the accelerometer output is noisy, it is accurate over long periods of
time and does not suffer from accumulating drift when estimating orientation [16].
In contrast, the gyroscope provides a reliable measurement of the angular rate at
each time step but integration will cause errors to accumulate over time. Hence, the
accuracy properties of the accelerometer and the gyroscope complement each other
to assure accuracy on long and short timescales, respectively [16].

11


2. Theory

The main idea in the complementary filter is to low-pass filter the angle estimates
from the long-term reliable accelerometer readings while high-pass filtering the in-
tegrated estimates from the short-term reliable gyroscope readings. Despite the
relatively simple implementation, the complementary filter can provide smooth and
accurate estimates and demands less computation than for instance the EKF [37].
The structure of this filtering technique can be viewed, more in-depth, in Figure 2.2.
Further, the complementary filter using only IMU can only estimate the roll and
pitch but given MARG measurements the complementary filter can also estimate
yaw, such an implementation can be found in [16].

Accelereometer 
Reading

Low-Pass Filter

Gyroscope 
Reading

Nummeric 
Integraton

+

High-Pass Filter

Angle Estiomation

Angular Velocity 
Estimation

Conversion to 
Angle

Figure 2.2: Block diagram of a complementary filter estimation angle and
angular velocity.

2.3.2 The Extended Kalman Filter
As mentioned in the introduction, the EKF is a commonly used filter in the field of
sensor fusion, with the aim to take nonlinear effects into account [16]. The EKF is
a two-step algorithm, constituted of the prediction and update steps.

The non-linearity of the EKF lies in the motion model f(.) and measurement model
h(.) which can be non-linear as long as they are differentiable. Here, in equation
(2.8) and (2.9), Fk and Hk are defined as the Jacobians of f(.) and g(.) with respect
to the state x. The Jacobiance are evaluated at the current state estimation, and
in the case of Fk also at the current control signal.

Fk = ∂f

∂x

∣∣∣∣∣
x̂k−1|k−1,uk

(2.8)

Hk = ∂h

∂x

∣∣∣∣∣
x̂k|k−1

(2.9)

Prediction step

State prediction: x̂k|k−1 = f(x̂k−1|k−1,uk). (2.10)

Covariance prediction: Pk|k−1 = FkPk−1|k−1F
T
k +Q. (2.11)

12


2. Theory

In the prediction step, the predicted state vector x̂k|k−1 is computed via the prop-
agation of the best state estimate from the previous time step x̂k−1|k−1 and the
control vector of the current time step uk through the motion model f(.) as given in
equation (2.10) [38]. Also in the prediction, the covariance prediction matrix Pk|k−1
is calculated as in equation (2.11) where Q denotes the process noise covariance
matrix [38].

Update step

Innovation: vk = yk − h(x̂k|k−1) (2.12)
Innovation covariance: Sk = HkPk|k−1H

T
k +R (2.13)

Kalman gain: Kk = Pk|k−1HkS
−1
k (2.14)

State update: x̂k|k = x̂k|k−1 +Kkvk (2.15)
Covariance update: Pk|k = Pk|k−1 −KkHkPk|k−1 (2.16)

The goal of the update step is to calculate the best state estimation and covariance
given new observations. First, the innovation vk is attained as in equation (2.12)
and can be understood as the difference between the new given observation yk
and the expectation of the observation given the state prediction x̂k|k−1. Second,
the innovation covariance Sk is calculated as in (2.13) where R is measurement
noise covariance matrix [38]. Third, the Kalman gain Kk is achieved according to
(2.14). The Kalman gain is then the influence that decides the relative impact of
the innovation compared to state prediction, which can be viewed in equation (2.15)
where the new updated state is estimated x̂k|k. Finally, covariance is also updated
as in equation (2.16).

Up until now, the general steps of the EKF have been described, and in order to
adapt the EKF to produce orientation estimates, a number of design choices are
required. Firstly, the state vector can be designed with different orientation conven-
tions, such as Euler angles or unit quaternions. Moreover, the state can be composed
of either only the actual orientation or, the orientation and the rotation rate or, aug-
mented to also allow estimation of sensor biases. Then, depending of the design of
the state vector, motion and measurement models have to be developed for chosen
state representation. Furthermore, the covariance matrices Q and R introduce tun-
ing parameters. The measurement noise covariance Q can usually be set based on
the uncertainties of the sensors. The process noise covariance R is more tricky how-
ever, and should reflect the certainty of the chosen motion model. The ratio between
the trace of the two matrices also affects the performance. The ratio tr(Q)/tr(R) is
sometimes referred to as signal to noise ratio (SNR), and has large implications on
the convergence rate [5]. The specific design choices made in this thesis are stated
in the methods chapter, Section 3.6, but as can be understood there exist a variety
of possible implementations. A few examples of implementations which adopt unit
quaternions as the orientation encoding can be found in [13],[16],[39].

13


2. Theory

2.4 Machine Learning and Neural Networks
NNs have recently become the most prominent approach in ML due to their ability
to model complex relationships in data. An NN consists of a layered structure of
operations that maps input features to some output. One of the simplest NN layers
is the linear feed-forward layer. The linear feed-forward layer performs a linear
mapping Rn → Rm of a feature vector, x ∈ Rn. It does this by taking the matrix-
vector product of a weight matrix W ∈ Rm×n and x and by adding a bias term
b ∈ Rm

z(x,W, b) = Wx+ b. (2.17)
However, layering many of these linear feed-forward layers is of no interest, as they
can be reduced to a single layer operation. Hence, to model more complex rela-
tionships, a key feture of NNs are their non-linearities. In the context of NNs,
non-linearities are referred to as activation functions1, since the general feature of
the non-linearity is (typically) to suppress small input and let through large input,
i.e. be inactive or activate. One such function is the sigmoid function,

σ(z) = ez

ez + 1 (2.18)

which will be close to zero for z < 0 and 1 for z > 0. The output of an activation
function is referred to as an activation,

a = σ(z(x,W, b)). (2.19)

Alternating linear feed-forward layers and activation functions gives rise to models
that, if the different weight matrices and bias vectors are carefully selected, can
model almost any relationship. NNs consisting of only linear feed-forward layers and
activation functions are sometimes referred to as multi layer perceptrons (MLPs).

The most common way of selecting weights and biases is to randomly initialize them,
and subsequently tune them using stochastic gradient decent (SGD) based optimiza-
tion techniques in utilizing gradients computed with back propagation [40]. This is
called training and is often done using examples of input with known desired out-
put, which is referred to as supervised learning. A specific SGD based optimization
technique which has been widely adopted since its release in 2014 is the ADAM
optimizer, which key feature is adaptive moment estimation[41].

A loss function is a measure of the error between model output f(x,W) and desired
output y, L(f(x,W), y). It is this function that is minimized with respect to the
model parametersW , (i.e. W and b) during training. Minimizing L with respect to
W with SGD-based methods requires the computation of ∂L

∂W , which is done with
back propagation. The model weights are updated by subtracting the gradient,
scaled by some factor. The scaling factor applied to the gradient is referred to as

1The term activation function stems from the fact that artificial NNs are inspired by biological
NNs. Biological neurons fire or activate when they are excited sufficiently enough to overcome
inhibitory input. Likewise, an artificial neuron modeled as σ(z(x,W, b)) will fire if weighted input
excites the artificial neuron sufficiently.

14


2. Theory

the learning rate. In regression models, some simple examples of loss functions are
mean square error and mean absolute error.

Apart from the mentioned sigmoid activation function

tanh (z) = 2
1 + e−2z − 1, (2.20)

and the Rectified Linear Unit

ReLu(z) = max(0, z) (2.21)

will be used in the following sections.

2.4.1 Recurrent Neural Networks and Long Short-Term
Memory

While MLP networks are static networks and unaffected by the input of the previous
time step, RNNs are able to process time series information [42]. RNNs achieve this
by using a hidden state h which stores information about the previous inputs x to the
network, enabling the detection of temporal correlations between events separated
by time [43]. This ability motivates why RNNs are useful in forecasting applications
using sequence data. In short, RNNs adds memory to the system and in the most
simple structure of an RNN, the activation a at time t is available to the same layer
at the next time step, illustrated in Figure 2.3.

Figure 2.3: Illustration of an RNN unfolded trough time.

As mentioned, NNs are often trained using SGD-based optimization methods. Due
to the hidden state, the gradient methods needs to be adjusted and one well-
established way to train RNNs is to use Back Propagation Through Time (BPTT)
[44]. The problem with BPTT, for RNNs, is that gradient forms a product of Jaco-
bian matrices for the different time steps [43]. Hence, if the elements in the Jacobians
are less than one this will cause the norm of the gradient to decrease exponentially
towards zero. Unfortunately, the opposite can also occur and the norm of the gradi-
ent will then grow exponentially. These issues makes it impractical to train RNNs
and are referred to as the vanishing and exploding gradients problem respectively
[43].

15


2. Theory

As a solution to the problem with vanishing as well as exploding gradients, the LSTM
unit was proposed in 1997 [45]. In the original paper the LSTM unit consisted of two
multiplicative gate units, namely the input gate and the output gate. Shortly after
that, in 1999 [46] an additional forget gate was proposed and, today the architecture
where each LSTM unit contains all three gates is the most common. The verison
with a forget gate is shown in Figure 2.4.

Figure 2.4: A illustration of the LSTM layer. Yellow boxes indicate linear layers
with activation functions and pink ellipses indicate element-wise operations.

At each time step the LSTM unit is feed information from three sources: i) the input
vector xt is given, ii) the hidden state vector of the last time step ht−1 which is also
the output of the last time step, and iii) the cell state from last time step ct−1. In the
original paper [45], the cell state is referred to as the constant error carrousel which
runs back and creates a connection to the past. The functions most important of
the cell state is firstly, that it can store information over arbitrarily long periods of
time which gives the LSTM its memory capabilities. Secondly, by construction, the
gradient of the cell state has an additive property, instead of multiplicative as in
the case with the original RNN, which prevents the gradient from vanishing during
backpropagation [44].

As mentioned, the LSTM unit includes the three multiplicative gates whose purposes
are to preserve and regulate the cell state. Firstly, the activation from the forget
gate ft enables the possibility for the network to learn what information to forget
and what to remember. The activation is given by [46]

ft = σ

(
Wf

[
xt
ht−1

]
+ bf

)
, (2.22)

whereWf is the learned weight matrix and bf the learned bias of the forget gate. The
σ-function assures that the activation ft will be bounded between 0 and 1. Hence,
when ft is element wise multiplied with the cell state the value of ft determines how
much of each element in ct−1 will be remembered. One can say that the forget gate’s
activation can reset the cell state [44].

16


2. Theory

Secondly, the input gate activation vector it determines from which elements, of the
current input xt, information should be added to the cell state. The input gate
activation vector is given by

it = σ

(
Wi

[
xt
ht−1

]
+ bi

)
, (2.23)

which is similar to the calculation of ft only with different weight matrix and bias
vector. The input gate activation is then multiplied with the input activation vector
c̃t which contains information from the current input.

c̃t = tanh
(
Wc

[
xt
ht−1

]
+ bc

)
(2.24)

In short, the input gate and c̃t enables writing to the memory of the LSTM and
altogether, equations (2.22), (2.23) and (2.24) determine what the updated cell
state will be [45],[46]

ct = ft · ct−1 + it · c̃t. (2.25)

Here, “·” imply element wise multiplication.

Lastly, the activation of the output gate ot decides what information, from the
current cell state, to output and consequently enables reading from the cell state.

ot = σ

(
Wo

[
xt
ht−1

]
+ bo

)
(2.26)

Moreover, the new hidden state and the output of the current time step is given by

ht = ot · tanh (ct). (2.27)

All in all, the multiplicative gates and the cell states constitutes the majority of
what is required to understand the basics of the LSTM unit [45],[46].

2.4.2 The Recurrent Neural Filter
The RNF, presented in reference [24], is an autoencoder/decoder architecture based
on LSTMs and inspired by Bayesian filtering. Instead of replacing the conventional
Bayesian filtering framework with a single "block" network, the key idea behind the
RNF is to benefit from the structure of the Bayesian filtering steps. The motivation
behind this is that it enables filtering when the system dynamics are unknown while
still maneuvering missing data points. The encoder portion of the architecture is
comprised of LSTM layers responsible for the different Bayesian filter steps. These
LSTMs share a single cell state and hidden state2, h, which they modify sequentially.
The architecture of the RNF is illustrated in Figure 2.5. The decoder consists of
an MLP, and is responsible for converting the network state representation to the
desired estimate.

2Here h refers to both the cell state and the hidden state.

17


2. Theory

Figure 2.5: The Recurrent Neural Filter as presented in [24].

The RNF in [24] has three LSTM layers; propagation, input dynamics and error
correction. The propagation step and the input dynamics step are together the
equivalent of propagation step in a Bayesian filter. The propagation LSTM is solely
intended for propagation and does not take any input beyond the shared hidden
state. The input dynamics LSTM takes the control signal or input signal, u, as
input and is intended to adjust the prediction made by the propagation LSTM. The
last of the three horizontal LSTMs, the error corrector, takes the measurements, x,
and adjusts the prediction made to take the measurements into account. This is
analogous to the update step in the Bayesian framework.

To enforce a consistent representation of the state estimation in the hidden state,
h, two measures are taken. i) The three LSTMs are trained with the same decoder
and ii) skip training is employed. In skip training the elements of h̃t and ht are
randomly overwritten with values from previous states, h̃′t and h̃t respectively. The
probability of this skip to occur is a tune-able hyperparameter called skip-rate.

2.4.3 Estimating Covariance
Knowing the error covariance of the estimates produced by the filter networks is
useful for fusing the estimated orientation with other estimates, and also for evalu-
ating performance. In reference [47], they explore modeling of uncertainty of state
estimations from neural networks. They model their k-dimensional state estimation
as a multivariate Gaussian,

p(y|x) = 1√
(2π)k|Σ(x)|

e−
1
2 (y−f(x))T Σ(x)(y−f(x)), (2.28)

in the same way the state estimate in a Kalman filter is modeled. Where y is
the ground truth. The networks are trained to output the mean, µ ∈ Rk, and a
parameterized covaraince matrix Σ(s, r) where s ∈ Rk is the vector parameterization
of the diagonal elements and r ∈ Rk(k−1)/2 is a vector parameterization of the off-
diagonal elements, see equation (2.30). The mean and covariance can be expressed

18


2. Theory

as a function of the network input x,

µ = f(x)
Σ(s, r) = Σ(x).

(2.29)

The parameterization of the covariance matrix,

Σij(s, r) =

i = j, σ2
i = esi

j < i, ρijσiσj = tanh(rij)
√
esiesj

(2.30)

describes the diagonal and the upper triangular portion of the matrix with i and j
being row and column indices both ranging from 1 to k. The covariance matrix is
symmetric, and lower triangular portion follows from the upper.

Minimizing the negative log likelihood of the parameterized multivariate Gaussian
distribution (see eq (2.28)) is what allows for simultaneously training the model to
estimate the mean as well as the error covariance. Hence

L(x,y) = 1
2(y− f(x))TΣ(x)(y− f(x)) + 1

2 ln |Σ(x)| (2.31)

is used as loss function.

The authors of reference [47] also suggest two minor alterations in order to improve
numerical stability when using equation (2.31) as loss during training. The first
suggestion is multiplying r with a constant (1 − ε), with ε ≈ 0.001. The second
suggestion is to multiply the input to tanh with a small constant, α ≈ 0.05, to avoid
saturating the activation function.

2.4.4 Regularization
To prevent overfitting, of model parameters on noise in the training data, regulariza-
tion techniques are employed. There are many different ways to achieve regularizing
effects. Some of the common approaches are the following:

Early Stopping

When improvements are seen in the training loss but not in the validation loss, it is
a key indication of poor generalization to data in the validation set. Early stopping
is the practice of stopping the iterative training when no improvements are seen in
some chosen metric or in the calculated loss on the validation set [48].

Mini Batches

Mini batches is the term commonly used to refer to the batching of training data.
The key idea is to consider several samples in a single update step during training.
This is done by computing the gradient of the loss function, evaluated over all the
samples in the mini batch, with respect to the model parameters. The impact of
noise in the data will be smaller, the gradient will be a better approximation of the
gradient computed using the entire dataset, and more general solutions are achieved.
Batch processing also improve the efficiency of operations during training [49].

19


2. Theory

Weight Decay

Weight decay is the introduction of a penalty of weights into the loss function. The
most common way of doing this is to use L1 or L2 regularization, where the L1 or
L2 norm of the weights is added to the loss function, scaled by some constant. The
reasoning behind doing this is that solutions with smaller weights tend to generalize
better and be more robust to noise [50].

Batch Normalization

Batch normalization is the practice of normalizing activations of intermediate layers
in a neural network based on metrics of a batch. This is done to deal with what
is referred to as covariance shift. Covariance shift is when the changes in weights
of a layer, during training, causes the distribution in its activations to shift. Subse-
quently, the inputs to the next layer represents new features, which its weights are
not tuned for [51].

2.4.4.1 Drop-out

One approach that almost always improves model performance is to train several
similar models and average their predictions. However, this is an expensive ap-
proach. Drop-out is a method inspired by this averaging of many different models,
and is done by dropping out a subset of the model, resulting in a "thinned" network.
By sampling a different subset network during each training step, and only updating
the weights of this subset, a regularizing effect is achieved [52].

2.4.5 Feature Scaling
The numerical range of the input features affect the rate of convergence of NNs
during training. If the features are ill-conditioned3 SGD based optimization will be
slow to converge. This can be attributed both to oscillating network weights and the
differing magnitude of the update steps for the first layer and the consecutive layers.
To improve the convergence rate, the input features and the output labels are often
re-scaled [53]. Two common ways to do feature scaling are min-max normalization,
equation (2.34), and standardization, equation (2.33) [54].

std(x) =
√

1
N

ΣN
i=1 (xi −mean(x))2 (2.32)

z = x−mean(x)
std(x) (2.33)

z = x−min(x)
max(x)−min(x) (2.34)

3Features with magnitude that differs vastly results in the gradient of the loss function with
respect to the model parameters to vary in magnitude as well. As a consequence, a learning rate
appropriate for one model parameter will be much too large for another, causing either unstable
or slow training. Features with large magnitude can also have a disproportionate influence on
predictions.

20


2. Theory

The mean and standard deviation or the min and max are computed from the test
data. All data fed to the network is then scaled in the same way, both during
training and inference. Min-max normalization of features is an attempt to ensure
that the scaled features are constrained to [0, 1]. Likewise, standardisation is an
attempt to ensure that the distribution of the scaled features has zero mean and
variance one.

2.4.6 Domain Randomization
When training ML models on synthetic data there is often a reality gap, where the
simulated data differs from data captured in the real world. A simple way to deal
with some of these differences is to use domain randomization (DR). In DR relevant
parameters in the simulation are randomized to reflect the uncertainty of their real
world counterpart [55].

21


2. Theory

22


3
Methods

This chapter will cover the methods used to answer the questions posed in Section
1.2.1. First, the tools used throughout the process are described, followed by a
description of the problem set-up, and the data sets used. Then, the model archi-
tectures, as well as the loss function and metrics used when training and evaluating
the models, are described. Subsequently, the benchmarks used for performance
evaluation of the model architectures are presented, and the considerations done
when choosing them are accounted for. Finally, the experiments performed and the
hyperparameters used during training are reported on.

3.1 Machine Learning Framework
TensorFlow is a machine learning interface that supports both training and infer-
ence phases. The interface is developed open source by Google and suitable for
most machine learning application areas. The flexibility of TensorFlow lies in that
it allows for experimenting with new models and researching with sufficiently high
performance while still robust enough to allow for the deployment of machine learn-
ing models [56]. TensorFlow is categorized as a low-level deep learning framework
while its application programming interface Keras can be considered a high-level
framework. Keras is most commonly used with TensorFlow as backend, although it
supports a few other low-level frameworks [57].

3.2 Separation into 2D and 3D Problem
The main focus of this thesis is to estimate absolute orientation utilizing ML ap-
proaches. Related literature, [11] and [19], employs supervised learning, which is an
accessible approach, and therefore also adopted in this thesis. Here, a separation of
the problem into two parts will be established.

To investigate the general suitability of utilizing an ML approach for sensor fusion
purposes a simplified 2-dimensional problem, with fewer DoFs, is set up. Here, the
sensor device will be allowed to rotate around one single axis, and therefore only
one angle will be estimated.

In order to avoid the discontinuity arising where angles transition between +π and
−π, the angles fed to and from the network are represented using

[
sin θ, cos θ

]
. This

representation is also establishing normalization of the inputs and outputs to the

23


3. Methods

NNs. Further, the representation can easily be converted back into angles using
θ = arctan 2(sin θ, cos θ).

As a second task, the problem of retrieving an absolute orientation estimation in
3D shall be investigated. As explained in Section 2.1, a quaternion representation
of the absolute orientation can be beneficial for the 3D case. Further, the quater-
nions used will be canonized such that the scalar part of the orientation quaternion
is always positive. The reason for adopting canonized quaternions is that it sim-
plifies the problem for the network such that the network does not have to learn
two representations for the same orientation, requiring more complex models of the
output distributions [27]. In short, the 3D networks will be trained to estimate the
canonized orientation quaternion representation with supervised learning.

3.3 Datasets

Two different data sets are used for separate problems of estimation, a 2D orienta-
tion, and an absolute orientation in 3D, respectively. Namely, the sufficient number
of correctors, in 2D and 3D, have influenced the choice of datasets.

3.3.1 Synthetic 2D Data

Gnss-ins-sim [58] is a simulation tool for generating 3D IMU, MARG, and GNSS
sensor data. It is also capable of generating 2D data, which is how it is used in the
following experiments. The tool consists of a python module that enables specific
settings to mimic the true behavior of IMU sensor noise (bias, random walk, and
bias stability). The module enables the attainment of the ground truth both in
quaternions and Euler angles. In 2D, it is sufficient to represent the orientation
with a single axis rotation, and therefore the quaternion representation will not be
necessary.

A training set containing 60 000 sequences, each consisting of 50 samples sampled at
100 Hz, is generated with DR of the noise parameters (see Table 3.1). The sequence
length had to be limited mainly to reduce computation per update step, but also to
allow for a large variety of sequences relative to the dataset size. Similarly, a test
set containing 500 sequences, each consisting of 1 000 samples sampled at the same
frequency, is generated in the same noise domain. In these simulations, the initial
orientation is sampled from U(−π, π), and the initial angular velocity was set to
zero. The data is generated without linear acceleration. During training, 5% of the
training set is used for validation.

1These are typical values, based on [59]
2This is the correlation coefficient used by gnss-ins-sim to model the bias as a first order Gauss-

Markov process [60].

24


3. Methods

Table 3.1: The distributions used for domain randomization of the noise
parameters. These distributions are sampled for each generated 2D sequence. 12 is

a vector of length two containing all ones.

Error source Unit Typ.1 Distribution
gyro bias [deg/hr] 1440 U(±[400, 2000])
gyro angle random walk [ deg√

hr
] 0.15 U(0.1, 0.2)

gyro bias instability [deg/hr] 2 U(1, 3)
gyro bias instability corr.2 - - 100
acc. bias [m/s2] 1.37 · 10−2 U(1210−2,122 · 10−2)
acc. vel. random walk [m/s√

hr
] 0.012 U(±12[0.008, 0.022])

acc. bias instability [m/s2] 3.6 · 10−5 U(122 · 10−5,125 · 10−5)
acc. bias instability corr.2 - - 12200

3.3.2 3D MARG Dataset

For training the models in the 3D case, the OxIOD dataset [23] is used. In contrast
to the 2D dataset, OxIOD is not synthetic. It consists of 158 sequences totaling
43 km in traveled distance. Each sequence contains MARG sensor readings as well
as a reference orientation from a computer vision system. The dataset is provided
with a suggested training/testing split of the data3, which is used in the following
3D experiments.

The authors of the OxIOD dataset compares it with similar datasets and presents
a higher accuracy ground truth than other datasets. However, they only compare
translational accuracy [23]. This reported accuracy is 0.5 mm at 100 Hz 4.

OxIOD was chosen for its larger size compared to similar datasets, while still provid-
ing an absolute orientation. Further, the dataset includes different types of motions
making it somewhat diverse. The data sequences were recorded using a smartphone
attached to different parts of a person5. There are, for example, sequences with the
phone in a walking person’s pocket, hand, handbag, and on a trolley. There are also
variations where the person is running, walking, and walking slowly. An example of
a sequence, with a person carrying the phone in a handbag whilst walking, can be
seen in Figure 3.1. reference.

3The suggested train/test split for OxIOD is to use specific sequences for training and testing,
covering each of the different types of motion.

4For reference, the datasets with the second-highest accuracy are TUM IV (1 mm, 200 Hz) [61]
and the dataset that OriNet is trained on, EuRoC MAV (1 mm, 200 Hz) [62].

5To the best of our knowledge, no PhD student was severely harmed during the data collection.

25


3. Methods

1

0
Sp

ec
ifi

c 
fo

rc
e

[g
 m

/s
²]

Accelerometer Readings

x
y
z

1

0

1

An
gu

la
r v

el
.

[ra
d/

s]

Gyroscope Readings

0 200 400 600 800 1000
Elapsed Time [10 ms]

25

0

25

M
ag

ne
tic

 fi
el

d
st

re
ng

th
,  

[
T]

Magnetometer Readings

Figure 3.1: A 10 s data sample from the test set (OxIOD), showing readings
from the accelerometer, gyroscope, and magnetometer in the sensors body frame.

All of the data in OxIOD is sampled at 100 Hz. We processed the data by dividing it
into shorter sequences: a training set containing 51 784 sequences, each consisting of
50 samples and a test set containing 342 sequences, each consisting of 1 000 samples.
Of the training data, 5% is used for validation.

3.4 Model Architectures
Two main types of model structures are develop for the purpose of estimation orien-
tation. Both of architectures follow an encoder-decoder structure with a recurrent
encoder consisting of LSTM units. The encoder outputs both the estimate as well
a parameterized covariance matrix from Section 2.4.3.

3.4.1 The Recurrent Neural Network
Given the time-series nature of the IMU/MARG data and the dynamics of the ori-
entation state, some sort of RNN is a natural choice, since RNNs do not require a
fixed look-back window. More specifically, LSTMs were considered, in favor of sim-
pler RNNs. This, due to their robustness towards vanishing gradients, as described
in Section 2.4.1. Also, despite the success of attention-based models in fields such
as natural language processing and image captioning, the fact that they rely on a
discretization of the output makes them unsuitable when dealing with continuous
outputs [24].

The proposed RNN models, designed and evaluated in this thesis, have a simple
encoder-decoder structure. The encoder consists of a single LSTM layer, while the
decoder consists of an MLP with the first layer having double the size of the hidden
state. The second layer of the MLP has 2k+ k(k−1)

2 units, with k being the dimension
of the estimate (k = 2 in the 2D case and k = 4 in the 3D case).

26


3. Methods

Figure 3.2: Illustration of the RNN architecture evaluated. The green box
represents an LSTM layer and the blue boxes represents fully connected layers.

3.4.2 The Recurrent Neural Filter
A variation of the RNF presented in Section 2.4.2 is evaluated, in which only two of
the three separate LSTM encoders are used. Since there is no natural control signal
in the case of orientation estimation from IMU/MARG data, the input dynamics
layer is omitted and no control signal is used. This results in a slightly different RNF
than the one proposed in reference [24] where the part of the RNF that models input
dynamics has been removed. The resulting structure is illustrated in Figure 3.3. The
two steps then constitute a prediction and an update step, where all measurements
are given to the update part. The RNF model uses the same type of decoder as the
previously described RNN. During training the decoder decodes the hidden state of
both the LSTMs comprising the propagation step, and the LSTM comprising the
error correction step. [24]

Figure 3.3: Illustration of the RNF architecture evaluated. The green boxes
represents an LSTM layer and the blue boxes represents fully connected layers.

3.5 Loss Function and Error Metrics
The negative log likelihood loss function and the root mean square of Euclidean
distance metrics used have a direct relation to the performance of the ML model.
While the loss function has a direct connection to the training of the network where
it is minimized, the metric is detached from the training and can be chosen more
freely.

27


3. Methods

3.5.1 Loss Function
Both 2D and 3D experiments have been performed with the negative log likelihood
function (equation (2.31)) which was described in Section 2.4.3. The benefit of
estimating the error covariance of the estimates produced is first that it is useful for
fusing the estimate with other information, and also for evaluating the performance
of the estimates.

3.5.2 Error Metrics
Two types of error metrics are used to measure and compare performance of the
trained models over the entire test set. Given a time series error vector δ the root
mean square error RMSE can be calculated as

RMSE(δ) =
√

1
n

(δ2
1 + ...+ δ2

n) (3.1)

where δi is the scalar error at time i ∈ [1, n]. Averaging over the RMSE gives, our
first error metric, the average root mean square error (aRMSE). Similarly we attain
our second error metric from averaging the quantity

RMSE20(δ) =
√

1
n− 20(δ2

21 + ...+ δ2
n) (3.2)

Including the errors prior to the estimation has converged makes it difficult to com-
pare performance on sequences of different length. The second metric is referred
to as the aRMSE20 and is utilized to reduce the impact of sequence length on the
metric, as well as to provide a metric in which the EKF is not favoured due to being
provided the ground truth as prior in the first time step. 20 time steps was found
to be sufficient to achieve robustness towards the sequence length.

In the 2D case, the network only estimates one angle. The error δ,

δ = θ − θ̂ (3.3)

in this case, consists of the difference between the ground truth θ and the estimation
θ̂ in degrees. This error δ is then used to calculate the aRMSE and the aRMSE20
as described in equations (3.1) and (3.2).

From Section 2.1 we have the Euclidean distance δq, the difference error quaternion
qe and the angle of the error quaternion θqe as possible error metrics in 3D rotations
represented by quaternions. However, since the difference error quaternion is not a
scalar, it is inconvenient to use it as a metric. Furthermore, the angle of the error
quaternion brings a more complex calculation compared to the Euclidean distance.
Therefore, the Euclidean distance δq is chosen as the 3D-error metric, which we
calculate the aRMSE and the aRMSE20 as in equations (3.1) and (3.2).

28


3. Methods

3.6 Benchmarks
Two EKFs, one for 2D orientation estimation and one for 3D orientation estimation,
are used as a benchmark for model performance. Section 2.3.2 introduced the general
theory behind the EKF, here the specific design choices of each of the benchmark
EKFs will be presented.

2D EKF

The EKF benchmark used in the 2D case utilize a state vector x consisting of the
angle θ and its derivative ω,

x =
[
θ
ω

]
. (3.4)

The chosen motion model, f(x), assumes constant angular velocity and models the
angle by integrating the velocity over time. Further, the motion model constrains θ
by wrapping it to the range [−π, π), in order to avoid the the problem of θ = 2π+θ.
The wrapping function w(.)[−π, π) is defined as

w[−π, π)(x) = mod (x+ π, 2π)− π. (3.5)

The motion model and the Jacobian of the motion model, F (x), are then given by

f(x) =
[
w[−π π) (θ + ∆t ω)

ω

]
, F (x) =

[
1 ∆t
0 1

]
(3.6)

.

The update step of the 2D EKF is rather straightforward. The chosen measurement
vector y consists of the gyroscope reading ωg, and the sin(.) and cos(.) of the angle
retrieved from the normalized accelerometer reading θa,

y =

 ωg
sin (θa)
cos (θa)

 . (3.7)

The measurement model h(x) and its Jacobian H(x) are defined as

h(x) =

 ω
sin (θ)
cos (θ)

 , H(x) =

1 0
0 cos (θ)
0 − sin (θ)

 . (3.8)

Given the measurement and motion model, the 2D EKF then follows the steps as
explained in Section 2.3.2.

29


3. Methods

3D EKF

In the 3D case, an EKF, estimating the absolute orientation in quaternions, is used
to benchmark the developed ML models. The chosen state vector x is

xk = qk, (3.9)

where q is the unit quaternion encoding the true orientation at the discrete time-
step k. Further, since the derivative of qk is excluded from the state, the angular
rate ωk is instead included in the control signal6 uk. Note, that the control signal
is expressed in the body frame and constitutes a three-axis rotation in Euclidean
space. Other design alternatives are for example to use an augmented state vector
to estimate the bias of the gyro and the accelerometer. It is also possible to include
the quaternion rates as a part of the state vector. Both of these alternatives could
potentially improve the estimation accuracy of the EKF, but also increases the
complexity of the implementation.

The continuous-time derivative of the orientation quaternion, q̇, with respect ω (see
reference [31]) is

q̇ = 1
2


0 −ωx −ωy −ωz
ωx 0 ωz −ωy
ωy −ωz 0 ωx
ωz ωy −ωx 0


︸ ︷︷ ︸

=Ω(ω)

q, (3.10)

which can also be re-written as

q̇ = 1
2


−q1 −q2 −q3
q0 −q3 q2
q3 q0 −q1
−q2 q1 q0


︸ ︷︷ ︸

=Ω̄(q)

ω. (3.11)

Given the derivative, the discretized motion model f can be chosen as

f
(
x̂k|k−1,uk,vk

)
=
(
I + ∆t

2 Ω (uk)
)

x̂k−1|k−1 + ∆t
2 Ω̄

(
x̂k−1|k−1

)
vk, (3.12)

where ∆t is the sampling time, and vk is the process noise. The process noise is
modeled as an additive noise contribution to the angular rate ω in the body frame,
i.e.,

uk = ωk + vk. (3.13)

Thereby, the gyroscope measurements can be utilized as the control vector and the
gyroscope measurement noise will a part of the process noise.

6The signal from the gyro is relatively accurate, and rapidly changing. Including it in the
state vector and subsequently estimating it will generally not result in large improvements in its
estimation. However, estimating it results in a larger state vector and subsequently larger system
matrices and more computation.

30


3. Methods

In order to simplify the expressions in equation (3.12), we introduce

Fk =
(
I + ∆t

2 Ω (uk)
)
, Gk = ∆t

2 Ω̄
(
x̂k−1|k−1

)
. (3.14)

By assuming vk ∼ N (0, Qk), the prediction step of the EKF, analogous to equations
(2.10)-(2.11), is given as

x̂k|k−1 = Fkx̂k−1|k−1, (3.15)
Pk|k−1 = FkPk−1|k−1F

T
k +GkQkG

T
k . (3.16)

Note that the vk is given in the body frame and translated to the world frame via
Gk. Thus, Gk is also applied to the process covariance Qk in equation (3.16).

The update step of the EKF is divided into two separate steps which updates the
measurement models of the accelerometer and the magnetometer, respectively. This
separation is introduced mainly to introduce adaptivity to the algorithm, since the
EKF could potentially be run without the magnetometer data. Furthermore, the
separation makes it possible to turn off either of the measurement models, which
increases the transparency of the verification and debugging process.

The accelerometer measurements ya,k can be modeled as

ya,k = (R(qk))
T
(
g0 + fa,k

)
+ ea,k, (3.17)

where R(qk) is the function that transforms the current orientation qk to its corre-
sponding rotation matrix. Further, g0 is the nominal gravitation vector, and fa,k

is the linear acceleration. Both g0 and fa,k are expressed in the world frame and
added together, they form the specific force. Thus, the transposed rotation ma-
trix, R(qk)T , converts the specific force to the body frame. Lastly, ea,k denotes the
accelerometer measurement error, also in the body frame.

Equation (3.17) can be approximated using Taylor expansion, and with an assump-
tion of zero linear acceleration, fa,k = 0, the accelerometer measurement model can
be written as

ya,k ≈
(
R
(
x̂k|k−1

))T
g0 +

(
dR

dx

(
xk − x̂k|k−1

))T
g0 + ea,k. (3.18)

Then let,

Hk =
(
dR

dx

(
x̂k|k−1

))T
g0, (3.19)

and the accelerometer measurement model component of the EKF update, similar

31


3. Methods

to equations (2.12)-(2.16) in Section 2.3.2, becomes

vk = ya,k −
(
R
(
x̂k|k−1

))T
g0 (3.20)

Sk = HkPk|k−1H
T
k +Ra,k (3.21)

Kk = Pk|k−1HkS
†
k (3.22)

x̂k|k = x̂k|k−1 +Kkvk (3.23)
Pk|k = Pk|k−1 −KkSkK

T
k (3.24)

Similar to how the measurement model of the accelerometer has been derived, the
magnetometer part of the update step can be attained. The magnetometer mea-
surement model is developed from

ym,k = (R (qk))
T
(
m0 + fm,k

)
+ em,k, (3.25)

where m0 is the earths magnetic field strength in the world frame, fm,k constitute
other magnetic fields present, and em,k is the measurement error.

The tuning used in this thesis, that was found to give the best performance on the
test sequences, was

Qk =

1 0 0
0 1 0
0 0 1

 , Ra,k =

150 0 0
0 150 0
0 0 150

 , Rm,k =

200 0 0
0 200 0
0 0 200

 . (3.26)

As a reminder Qk is the process noise covariance, Ra,k is the accelerometer measure-
ment covariance, and Rm,k is the magnetometer measurement covariance. Further,
the initial state of the EKF was chosen close to the ground truth for each test
sequence. This was done in order to assure that the EKF would converge in all
sequences.

3.7 Experiments
A quantitative approach is taken in the regard of training the developed models.
Here, different variations of two main types of architectures, the RNN and the RNF
from Section 3.4, are trained and evaluated in both 2D and 3D. The approach is
quantitative in that several models are trained with the same hyperparameter set-
tings but different, randomly initialized, weights. The performance of these multiple
instantiations is then evaluated in terms of best and average performance, i.e. the
aRMSE and aRMSE20 metrics. Different settings are then compared to each other
to indicate which parameter setting works better. This systematic parameter test-
ing is commonly referred to as grid search [63], and the parameters evaluated in the
experiments are the hidden state size, and the drop-out rate. Other parameters like
the batch size, early stopping, learning rate, and the skip-rate are varied and chosen
by engineering judgement.

Other relevant information for the experiments are i) that the Adam optimizer is
chosen, ii) gradient norm scaling is applied with the maximum L2 norm set to 0.5,
iii) the training data is shuffled after each epoch to ensure unbiased subsequent
updates, and iv) the training validation split is resampled for each trained model.

32


3. Methods

3.7.1 The 2D Case Experiments
In the 2D case three different RNN types, with different recurrent state sizes, are
trained with three different drop-out rates, resulting in nine different version of
RNNs. For each RNN version, 20 instances are trained with randomly initialized
weights, to verify consistent results. Similarly, three different RNF types are trained,
also with varying recurrent state size, but not with drop-out due to problems with
divergence. These models (the RNNs and the RNFs) are all trained with a learning
rate of 10−3. Training is done for 60 epochs, or until the loss function does not
improve for three consecutive epochs (early stopping). The synthetic gyroscope and
accelerometer measurements are all standardized using mean and standard deviation
computed from the entire training set.

3.7.2 The 3D Case Experiments
The models designed to estimate orientation in 3D are trained to predict the can-
onized quaternion. For reasons explained in Section 2.4.5, we know that feature
scaling can be beneficial for the convergence rate of the developed RNN and RNF.
Therefore, in this thesis the accelerometer, magnetometer and gyro data are feature
scaled using standardization. The ground truth labels, however, are already scaled
due to the constraint of the quaternions being unit length.

The 3D models are trained on sequences 50 time steps long, corresponding to half a
second worth of sensor readings. In total the models are trained on 51 784 sequences,
whereof 5% are used for validation. The test set consists of 342 sequences with a
length of 1 000 time steps. Further, models are trained with a learning rate of 10−2

for the first five epochs followed by 10−5 during the rest of training. Training is
done for 60 epochs, or until the loss function does not improve for three consecutive
epochs (early stopping).

33


3. Methods

34


4
Results

In this chapter the results of the performed experiments are presented. Firstly,
the results of the proposed 2D RNN and RNF models are displayed and compared.
Secondly, the results of the 3D models are presented and compared to the benchmark
EKF. Finally, results from an ablation study, and a missing data test with the 3D
RNF is reported.

4.1 Results of the 2D Models

Table 4.1 provides a comparison of the performance between the different RNN
versions, trained for the 2D problem, in terms of the aRMSE and the aRMSE20.
The errors reported from the 2D models are in degrees and twenty instances of each
version was trained. Additionally, the best result achieved for each version is also
reported on in Table 4.1 in the columns min(aRMSE) and min(aRMSE20).

The results in Table 4.1 show that two parameter settings, version 1 and 8, are
delivering the best performance. This result is somewhat counter-intuitive since the
best parameter settings are farthest apart in the table.

Table 4.1: Recorded test results from the trained 2D RNN model versions.

RNN
Version

Hidden

State Size

Drop

Out

max

aRMSE
aRMSE min

aRMSE

max

aRMSE20
aRMSE20

min

aRMSE20

1. 8 0.0 3.16 0.5613 0.2297 1.573 0.2803 0.1226
2. 8 0.2 2.5 0.5464 0.1711 1.591 0.3108 0.134
3. 8 0.4 6.326 0.6758 0.1946 6.381 0.5176 0.1153
4. 16 0.0 2.538 0.4743 0.2564 2.52 0.3695 0.1742
5. 16 0.2 0.6635 0.3889 0.1982 0.5844 0.3274 0.1584
6. 16 0.4 1.692 0.459 0.1992 1.69 0.3841 0.1674
7. 32 0.0 1.663 0.4024 0.1909 1.628 0.3697 0.1832
8. 32 0.2 1.317 0.358 0.1655 1.307 0.3396 0.1653
9. 32 0.4 1.341 0.4319 0.212 1.326 0.3825 0.1871

Figure 4.1 provides a test sequence of the generated 2D data, estimation from a
trained RNN model, and estimates from the benchmark EKF. Notably, the proposed
RNN manages fairly well to estimate the ground truth of the test sequence.

35


4. Results

Figure 4.1: Plot of a 2D test sequence with orientation estimates from a trained
RNN and the benchmark EKF.

Similarly to Table 4.1, Table 4.2 also shows a comparison between the performance,
in terms of aRMSE and aRMSE20, but for the proposed 2D RNF models. Also here,
20 instances of each version have been tested, and the best instance of each version is
indicated in terms of min(aRMSE) and min(aRMSE20). Note that the RNNs where
the dropout was different from zero failed to converge for the 2D RNN experiments,
and thus, are not reported in the table. Remembering that the motivation behind
the 2D tests is to investigate the suitability of applying ML to sensor fusion problems.
Finding an indication of which parameter setting works better is thus not crucial in
the 2D case.

Table 4.2: Recorded test results from the trained 2D RNF model versions.

RNF
Version

Hidden

State Size

Drop

Out

max

aRMSE
aRMSE min

aRMSE

max

aRMSE20
aRMSE20

min

aRMSE20

1. 8 0.0 2.772 0.6884 0.2575 2.638 0.4963 0.1593
2. 16 0.0 0.5101 0.3052 0.1597 0.4786 0.2777 0.1453
3. 32 0.0 0.4433 0.2799 0.1499 0.445 0.2798 0.1504

In Table 4.3 a comparison between the orientation estimation errors generated by
the benchmark EKF, the best instances of the proposed 2D RNNs and RNFs can
be reviewed.

36


4. Results

Table 4.3: Performance comparison of the benchmark EKF and the best
instances of the proposed RNN and RNF for the 2D problem.

Model aRMSE aRMSE20

RNN 0.1655 0.1226
RNF 0.1499 0.1453
EKF 2.187 0.0884

4.2 Results of the 3D Experiments
In Table 4.4 and 4.5 the aRMSE and aRMSE20 for the trained instances of the
proposed RNN and RNF models are presented. For the RNN architecture the best
parameter setting was found for version 9, and the RNF architecture version 6
indicates best results. Note that, for the RNF with smaller hidden state sizes, 8 and
16, was not trained with drop out since it was concluded that larger hidden state
sizes was needed.

Table 4.4: Recorded test results from the trained 3D RNN model versions.

RNN
Version

State

Size

Drop

Out

max

aRMSE
aRMSE min

aRMSE

max

aRMSE20
aRMSE20

min

aRMSE20

1. 8 0.0 0.2708 0.1694 0.1442 0.2706 0.1689 0.1439
2. 8 0.2 0.1796 0.1613 0.1489 0.1795 0.1610 0.1486
3. 8 0.4 0.2108 0.1668 0.1422 0.2108 0.1664 0.1411
4. 16 0.0 0.3139 0.1586 0.1326 0.3139 0.1582 0.1321
5. 16 0.2 0.1695 0.1468 0.1330 0.1694 0.1464 0.1325
6. 16 0.4 0.1669 0.1410 0.1271 0.1667 0.1405 0.1265
7. 32 0.0 0.2129 0.1394 0.1170 0.2121 0.1390 0.1164
8. 32 0.2 0.1619 0.1354 0.1189 0.1618 0.1350 0.1183
9. 32 0.4 0.1613 0.1328 0.1148 0.1609 0.1323 0.1141

Table 4.5: Recorded test results from the trained 3D RNF model versions.

RNF
Version

State

Size

Drop

Out

max

aRMSE
aRMSE min

aRMSE

max

aRMSE20
aRMSE20

min

aRMSE20

1. 8 0.0 0.3218 0.1615 0.1304 0.3222 0.1613 0.1302
2. 16 0.0 0.297 0.1305 0.1099 0.2974 0.1304 0.1097
3. 32 0.0 0.119 0.106 0.0949 0.1188 0.1059 0.0947
4. 32 0.1055 0.1152 0.1067 0.096 0.1151 0.1065 0.0959
5. 32 0.2254 0.1332 0.1175 0.1065 0.1331 0.1174 0.1064
6. 64 0.0 0.1056 0.09569 0.082 0.1055 0.09553 0.0817
7. 64 0.1055 0.1035 0.09704 0.0871 0.1034 0.09691 0.0869
8. 64 0.2254 0.108 0.09955 0.0921 0.1079 0.09941 0.092

37


4. Results

In Figure 4.2 a plot of a ground truth orientation quaternion and estimations of the
ground truth can be viewed. Alongside the orientation quaternion, the estimation
errors in Euler angels can also be viewed. The estimates are produced by the best
performing instance of the RNF model version 6 and the benchmark EKF. Also
visible in the plot, is the estimated error covariance of the RNF estimates and the
calculated error covariance of the EKF estimates. The test sequence chosen for
Figure 4.2 was selected with the intention to reflect the average performance of the
best RNF version, and the plot neither shows the best or the worst performance.

Figure 4.2: The estimated sequence of quaternions (left) and the estimation error
in Euler angles (right) from the RNF and the EKF in 3D.

4.2.1 Comparison and Benchmarks

The best results of the 3D experiments are compared with the benchmark EKF in
Table 4.6. A closer inspection of the table shows that the proposed RNF performs
better than both the EKF and the RNN for the chosen metrics. Also visible in
the table is the number of trainable weights for the NN models, and the average
one-step computation time for all three models. The average time to compute one
step with the 3D EKF is 0.33668 ms on the used hardware1. The same figure for
the RNN and RNF is 0.07017 ms and 0.1929 ms respectively.

1All of the models and filters reported on are run on a Lenovo ThinkPad T490s with a Intel®
Core™ i7-8565U CPU @ 1.80GHz × 8. The TensorFlow models are run on the CPU.

38


4. Results

Table 4.6: Comparison of best results in 3D along with the models execution
time.

Model RMSE RMSE20 Trainable Weights Execution Time

RNN 0.1148 0.1141 8 398 0.07017
RNF 0.08200 0.08170 45 282 0.1929
EKF 0.1017 0.1021 - 0.3368

4.2.2 Ablation Study

Removing parts of a network architecture to evaluate the impact it has on perfor-
mance is referred to as an ablation study2 [64]. The two presented model architecture
types are very similar. The proposed RNN architecture is essentially the same as the
proposed RNF architecture, trained without separate LSTMs for propagation and
prediction. Any differences in performance of the two model types can be attributed
to this LSTM layer responsible for propagation. Comparing the performance of 3D
RNN version 7 and 3D RNF version 3, two models with no dropout and identical
state size, the RNF has a 23% lower aRMSE and aRMSE20 with an increase from
8 398 to 13 774 (roughly a 65% increase) trainable weights.

4.2.3 Missing Data

In Section 2.4.2 the RNF model’s potential capability to handle missing data points
was introduced. Figure 4.3 and Figure 4.4 shows the same sequence, estimated with
the same model, as in Figure 4.2. The only difference is that sensor data was only
given to the proposed RNF every other time step in Figure 4.3, and no sensor data
was given to the RNF in Figure 4.4 after 400 time steps.

2The term ablation study comes from experimental biology and refers to the removal of brain
tissue, with the intent to study the impact its removal has, and infer its function.

39


4. Results

Figure 4.3: Plot showing an orientation sequence and estimation (left), and the
estimation error (right) produced by the trained RNF when input data was

missing every other time step.

Figure 4.4: Plot showing an orientation sequence and estimation (left), and the
estimation error (right) produced by the trained RNF when input data was

missing after 400 time steps.

40


5
Discussion

The aim of this study was to investigate the viability of using ML-based models
to estimate absolute orientation given MARG data. Two models with different
attributes have been developed, namely an RNN based and an RNF based design.
In this chapter, the results reported in chapter 4 will be discussed, followed by
some insights into the future potential of similar ML-based orientation estimation
approaches.

5.1 Performance and Uncertainty of the Devel-
oped ML Models

An initial objective of the project was to verify that the problem is at all assessable
by an ML approach. To answer this, simpler experiments in 2D were executed where
the current study found that the 2D models were able to out-performed the EKF on
synthetic sequences without linear acceleration in the aRMSE metric. However, in
the aRMSE20 metric, the EKF still performs better than the ML models (see Table
4.3). The shown convergence of the presented models in 2D indicates the feasibility
of an ML approach that estimates orientation.

The presented ML models have been shown to perform similarly to the EKF bench-
marks with an aRMSE of comparable magnitude. More specifically, for the 3D
results, this thesis has reported that the best instance of the RNF model performs
better than the EKF in both the aRMSE and the aRMSE20 metric (see Tabel 4.6).
This result indicates that the proposed RNF model is a better choice of architecture
compared to the RNN-base one. Another feature that argues in favor of the RNF
is that the RNF has the potential to handle occurrences of missing data-points, see
Figures 4.3 and 4.4. In real-time applications, this property may be highly valuable.

Another finding was that the execution time was approximately eight times shorter
for the proposed RNF compared to the EKF, and the proposed RNN was even
faster. A shorter execution time implies a less computationally expensive filter,
which speaks in favor of the ML models. Intuitively, it is easy to believe the NNs
to be heavy to run due to the number of weights and non-linearities found in the
activation functions. This is not the case here, and a contributing factor is that
the proposed networks are not very deep. A note regarding the comparison of the
proposed models and the benchmark EKF is that it is highly dependent on the
specific EKF implementation used. Still, a reasonable execution time is necessary

41


5. Discussion

to, at all, consider the developed models as feasible.

The third research question regarded the possibility of quantifying the certainty of
the measurement. As a solution, the error covariance estimation was included in
the loss function presented in Section 2.4.3. The plot in Figure 4.2 shows the result
of the implemented covariance estimation. Furthermore, the covariance estimation
itself also appears to be of reasonable magnitude in relation to the estimation and
its error. At this stage, the correctness of the covariance estimation has not been
evaluated enough. Still, the inclusion of error covariance estimation is a highly
desirable trait in many applications and should be explored further in future work.

The relatively good aRMSE20 of 2D RNN version 1 might be an indication of the
small state size forcing the network to learn a much simpler, less rich, representation
of the system dynamics.

5.1.1 Research Methodology

A number of factors affects the findings of this study. As mentioned in Section 3.6,
the benchmark EKF is given a prior close to ground truth to ensure convergence on
all test sequences. The proposed NN models however, are not given any good prior,
their hidden state is initialized with all zeros at the start of a sequence. Potentially,
the presented ML models can learn to perform well both with accurate priors as well
as when the priors are poor. Training the models with short sequences, as described
in Section 3.7, has potentially optimized them to converge at a rate that is beneficial
given the length of the training sequences. Converging fast will have a greater impact
on decreasing the loss on shorter sequences due to the relative contribution of the
errors at each time step. If an NN does not have enough parameters to model the
relationships that are required to make a good estimate when its prior is poor, as
well as when its prior is good, there has to be a trade-off. An illustration of the
potential trade-off between training on long versus short sequences can be found in
Figure 5.1. Since the models are evaluated on test sequences 20 times longer than the
training sequences, if this trade-off is at play, different length of training sequences
might alter performance1. A way to steer this trade-off in the direction that is
desired (fast convergence/accurate estimates over long time) could be weighting the
loss. By weighting the loss of the first time steps higher a fast convergence could be
encouraged. Alternatively, a lower weight of the loss at the first time steps could
encourage more accurate estimates after convergence. Nevertheless, it can not be
excluded that that the performance of the developed model could improve with
longer training sequences.

1If minimizing the metric on the test data is the main concern, choosing training sequences of
the same length as the test sequences will most likely yield the best result. However, doing so is
both computationally expensive and the resulting model might still not perform better in real time
applications with near infinite length sequences.

42


5. Discussion

1.0
0.5
0.0
0.5
1.0

Hi
gh

 S
NR

as
su

m
pt

io
n

RMSE: 0.32

Short sequence

RMSE: 0.20

Long sequence

0 2 4 6 8 10
1.0
0.5
0.0
0.5
1.0

Lo
w 

SN
R

as
su

m
pt

io
n

RMSE: 0.36

0 10 20 30 40 50

RMSE: 0.17

Figure 5.1: An illustrative example of two univariate state estimators, with
different SNR assumptions applied on two sequences of different length. Ground

truth is represented by the blue line, the estimate by the orange, and the
measurements by crosses.

Another aspect that may influence the way the results are perceived is the bench-
mark EKFs. It can be noted that the EKFs were not developed with the intention
to make them “perfect”, but rather to achieve satisfactory orientation estimates.
Further, the EKFs are given a quite unfair advantage in their priors, as previously
menti