DEPARTMENT OF MECHANICS AND MARITIME SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2020 www.chalmers.se Learning a Better Attitude A Recurrent Neural Filter for Orientation Estimation Master’s Thesis in Systems, Control and Mechatronics AMANDA FRANSSON OSCAR LUNDSTRÖM Master’s thesis 2020:16 Learning a Better Attitude A Recurrent Neural Filter for Orientation Estimation AMANDA FRANSSON OSCAR LUNDSTRÖM DF Department of Mechanics and Maritime Sciences Division of Vehicle Engineering and Autonomous Systems Chalmers University of Technology Gothenburg, Sweden 2020 Learning a Better Attitude A Recurrent Neural Filter for Orientation Estimation AMANDA FRANSSON OSCAR LUNDSTRÖM © AMANDA FRANSSON, OSCAR LUNDSTRÖM, 2020. Supervisor: Niclas Berglind, CPAC Systems AB Examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences Master’s Thesis 2020:16 Department of Mechanics and Maritime Sciences Division of Vehicle Engineering and Autonomous Systems Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Scematic illustration of a recurrent neural filter, as explained in Section 2.4.2. Typeset in LATEX, template by David Frisk Gothenburg, Sweden 2020 iv Learning a Better Attitude A Recurrent Neural Filter for Orientation Estimation AMANDA FRANSSON OSCARLUNDSTRÖM Department of Mechanics and Maritime Sciences Chalmers University of Technology Abstract In the current paradigm of sensor fusion orientation estimation from inertial mea- surements unit sensor data is done using techniques derived with Bayesian statis- tics. These derivations are based on assumptions about noise distributions and hand crafted equations describing the relevant system dynamics. Machine learning, and more specifically neural networks, may provide an alternate solution to the prob- lem of orientation estimation where no assumptions or handcrafted relationships are present. This thesis aims to investigate whether a neural network-based filter can achieve a performance comparable to, or exceeding that of, the more conven- tional extended Kalman filter. Two network architectures based on long short-term memory layers are proposed, trained, evaluated and compared using data from the Oxford inertial odometry dataset. Of the two suggested model architectures the so- called recurrent neural filter is found to give a the better performance. The recurrent neural filter has a structure inspired by Bayesian filtering, with a prediction and an update step, allowing it to output a prediction in the event of missing data. Further, the evaluated models are trained to estimate orientation as well as a parameterized error covariance matrix. Our results show that the suggested recurrent neural filter outperforms the benchmark filter both in average root mean square error and in execution time. The result also indicates that the machine learning-based approach for sensor fusion problems may be an attractive alternative to hand crafted filters in the future. Keywords: sensor-fusion, state estimation, absolute orientation estimation, recur- rent neural filter, recurrent neural network, RNN, LSTM, IMU, MARG. v Acknowledgements We would like to thank CPAC Systems for providing resources and giving us the opportunity to conduct our thesis with them. We would also like to thank Niclas Berglind, our supervisor, for supporting us and helping us get started with our work, and Peter Forsberg, our examiner for the many hours spent giving us feedback and input on our thesis, despite the unexpected circumstances caused by the COVID-19 pandemic. Amanda Fransson & Oscar Lundström, Gothenburg, May 2020 Supervisor: Niclas Berglind, CPAC Systems AB Examiner: Peter Forsberg, Department of Mechanics and Maritime Sciences vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Problem Statement and Aim . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Theory 7 2.1 Representing Orientation . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 The Inertia Measurement Unit . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Conventional IMU Data Fusion . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 The Complementary Filter . . . . . . . . . . . . . . . . . . . . 11 2.3.2 The Extended Kalman Filter . . . . . . . . . . . . . . . . . . 12 2.4 Machine Learning and Neural Networks . . . . . . . . . . . . . . . . . 14 2.4.1 Recurrent Neural Networks and Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 The Recurrent Neural Filter . . . . . . . . . . . . . . . . . . . 17 2.4.3 Estimating Covariance . . . . . . . . . . . . . . . . . . . . . . 18 2.4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.4.1 Drop-out . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.5 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.6 Domain Randomization . . . . . . . . . . . . . . . . . . . . . 21 3 Methods 23 3.1 Machine Learning Framework . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Separation into 2D and 3D Problem . . . . . . . . . . . . . . . . . . . 23 3.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Synthetic 2D Data . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 3D MARG Dataset . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1 The Recurrent Neural Network . . . . . . . . . . . . . . . . . 26 ix Contents 3.4.2 The Recurrent Neural Filter . . . . . . . . . . . . . . . . . . . 27 3.5 Loss Function and Error Metrics . . . . . . . . . . . . . . . . . . . . . 27 3.5.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.2 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7.1 The 2D Case Experiments . . . . . . . . . . . . . . . . . . . . 33 3.7.2 The 3D Case Experiments . . . . . . . . . . . . . . . . . . . . 33 4 Results 35 4.1 Results of the 2D Models . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Results of the 3D Experiments . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 Comparison and Benchmarks . . . . . . . . . . . . . . . . . . 38 4.2.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.3 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 Discussion 41 5.1 Performance and Uncertainty of the Developed ML Models . . . . . . 41 5.1.1 Research Methodology . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Relation and Contribution to the Field . . . . . . . . . . . . . . . . . 44 5.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Conclusion 47 Bibliography 49 x List of Figures 2.1 Illustration of the axis angle representaiton. From [29], CC0 1.0 . . . 8 2.2 Block diagram of a complementary filter estimation angle and angular velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Illustration of an RNN unfolded trough time. . . . . . . . . . . . . . . 15 2.4 A illustration of the LSTM layer. Yellow boxes indicate linear lay- ers with activation functions and pink ellipses indicate element-wise operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 The Recurrent Neural Filter as presented in [24]. . . . . . . . . . . . . 18 3.1 A 10 s data sample from the test set (OxIOD), showing readings from the accelerometer, gyroscope, and magnetometer in the sensors body frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Illustration of the RNN architecture evaluated. The green box rep- resents an LSTM layer and the blue boxes represents fully connected layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Illustration of the RNF architecture evaluated. The green boxes rep- resents an LSTM layer and the blue boxes represents fully connected layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 Plot of a 2D test sequence with orientation estimates from a trained RNN and the benchmark EKF. . . . . . . . . . . . . . . . . . . . . . 36 4.2 The estimated sequence of quaternions (left) and the estimation error in Euler angles (right) from the RNF and the EKF in 3D. . . . . . . 38 4.3 Plot showing an orientation sequence and estimation (left), and the estimation error (right) produced by the trained RNF when input data was missing every other time step. . . . . . . . . . . . . . . . . . 40 4.4 Plot showing an orientation sequence and estimation (left), and the estimation error (right) produced by the trained RNF when input data was missing after 400 time steps. . . . . . . . . . . . . . . . . . . 40 5.1 An illustrative example of two univariate state estimators, with dif- ferent SNR assumptions applied on two sequences of different length. Ground truth is represented by the blue line, the estimate by the orange, and the measurements by crosses. . . . . . . . . . . . . . . . 43 5.2 The dynamics covered in the source domain (OxIOD) might not cover the target domain sufficiently. . . . . . . . . . . . . . . . . . . . . . . 44 xi List of Figures xii List of Tables 3.1 The distributions used for domain randomization of the noise parame- ters. These distributions are sampled for each generated 2D sequence. 12 is a vector of length two containing all ones. . . . . . . . . . . . . 25 4.1 Recorded test results from the trained 2D RNN model versions. . . . 35 4.2 Recorded test results from the trained 2D RNF model versions. . . . 36 4.3 Performance comparison of the benchmark EKF and the best in- stances of the proposed RNN and RNF for the 2D problem. . . . . . 37 4.4 Recorded test results from the trained 3D RNN model versions. . . . 37 4.5 Recorded test results from the trained 3D RNF model versions. . . . 37 4.6 Comparison of best results in 3D along with the models execution time. 39 xiii List of Tables xiv 1 Introduction In this chapter, the background of the thesis will be presented together with a summary of the current state of the field of machine learning based orientation estimation. The research questions and the contributions of the thesis are then presented, followed by an overview of the outline of the report. 1.1 Background Sensor fusion is, simply put, the practice of combining measurements from different sources to infer some property with greater certainty than what would have been possible had the measurements been considered separately. Sometimes the prop- erty of interest is hidden or latent and can not be estimated unless several different measurements are considered together. This inference of latent states from mea- surements on related quantities has a significant number of application areas and is essential for a range of technologies such as robotics, image processing, navigation, and situation awareness [1],[2],[3]. A commonly used filter in the field of sensor fusion is the Kalman filter (KF) [2],[4], mainly due to its relative simplicity and the properties of optimality it provides. The KF is a recursive Bayesian estimator that is optimal1 under the condition that i) the process and measurement noise is white2, ii) the exact covariance of the noise is known and iii) that the model is linear and precisely known3. KFs are often tuned via trial and error until an adequate performance is achieved [7], and thus the optimal estimate is not necessarily produced. Furthermore, KFs are generally not sufficient in cases with significantly non-linear processes, for which they tend to diverge. Work has been done on filters adapted to deal with cases where the conditions i)-iii) above are not met. The extended Kalman filter (EKF) [8] and the unscented Kalman filter (UKF) are both examples of filters designed to deal with non-linear processes [4]. In applications where there is a plant4-model mismatch and/or the noise covariance is unknown, it is possible to make use of an adaptive Kalman filter (AKF) [9]. 1Optimal in the sense that it minimizes the mean squared error. [5] 2White noise here refers to noise that is zero mean, uncorrelated over time, and of finite variance. 3The KF is still the optimal linear filter even though better non-linear filters may exist [5]. Also, although linear model abstractions are powerful tools, linear systems are very rare in the real world[6]. 4Plant here refers to the model of the dynamics of a system. 1 1. Introduction Sensor fusion in general and KFs, in particular, are important techniques since they can convert noisy sensory data to useful and reliable information. Despite the proven capabilities of classical probabilistic sensor fusion techniques, such as the KF, machine learning (ML) methods have started to transform the field entirely [10]. The fundamental shift from manually describing the dynamics of the system (from which information is inferred from) to learning it from examples opens up the door to solving problems previously too complex to model [10]. Looking at the progress made with neural networks (NN) in the recent decades, in tasks such as classification, regression, and sequence-to-sequence modeling between domains, it is not far-fetched to imagine employing NNs in filtering applications. Some researchers have headed in this direction, striving to combine NNs with KFs or done filtering with NNs without involving KFs at all [11],[12]. A specific sensor fusion problem where such an attempt can be investigated is the objective of retrieving the absolute orientation estimation from an inertial measure- ment unit (IMU) and a magnetometer. This combination of sensors is sometimes referred to as a Magnetic, Angular Rate, and Gravity (MARG) sensor. An IMU does not provide enough information to infer absolute orientation. Instead, it has to be inferred from the measurements of acceleration, angular velocity and an ad- ditional corrector, such as magnetic field direction. Orientation estimation using IMU/MARG data is often done with EKFs, complementary filter, or more recent methods such as the explicit complementary filter (ECF) and gradient descent based orientation filter (GDOF) [13],[14],[15]. Often the linear acceleration measured by the accelerometer is assumed only to be influenced by gravity, which is a reasonable assumption under many conditions. However, in the instances where the assump- tion does not hold, it can lead to significant errors in orientation estimation unless dealt with properly. Designing filters that deal with the situation where this as- sumption does not hold is possible and has been done [16]. These filters often adapt the covariance of the measurement when the acceleration magnitude differs from the gravitational acceleration, or omit some measurements altogether. Common to both methods is that they add both computational and implementational complexity. 1.1.1 Related Work There exists a variety of implementations that use NN to replace, or work in unison with, the KF and versions thereof. Such implementations vary both in architecture, application area, and overall approach. The applications found to be most relevant for this thesis are described below. Neural Network Moderated Kalman Filter In reference [12], where an NN in conjunction with a KF is proposed. The NN acts as a supervisor to monitor and improve the prediction accuracy of the KF algorithm. In an experiment this setup is used to predict actual temperature from noisy sensor data and the NN is used to estimate the extent of error in current readings and to update the process error covariance in the KF. Compared to a stand alone KF, where the process error covariance is static, the proposed implementation was able 2 1. Introduction to reduce the root mean squared error (RMSE) of the prediction noticeably. Neural Network as Extended Kalman Filter Substitute Another example is given in reference [17], where a sensorless approach to estimate the shaft speed of a dc-motor for closed-loop control using an NN is presented. The proposed NN algorithm performance was compared with the performance of an EKF and found to achieve a better performance. Deep Kalman In reference [18], a hybrid solution called the Deep Kalman Filter is presented. The key purpose of the filter is to learn the system model, and the application area of the study was to perform improved IMU positioning estimation by IMU error mod- eling. Further, the IMU error was modeled using long short term memory (LSTM) layers. This LSTM-based IMU error modeling is added to the filter alongside the standard filtering steps: prediction and update. The study showed promising results compared to the conventional filtering with an EKF. OriNet OriNet is a recently proposed network based on recurrent neural networks (RNN), that provides an orientation estimation from a single IMU sensor [19]. The net- work is trained with data from small flying robots5, and is the first network to provide an end-to-end ML solution for orientation estimation. The architecture of OriNet consists of two parts. The first part combines four LSTM channels and fuses the learned feature representations of the channels in a slow fusion gate [19]. The second part consists of another LSTM and fully connected gates that output the difference between the quaternion of the last time-step and the present time-step [19]. Allegedly6, OriNet improved the ability to estimate orientation with roughly 80% compared to reference [20] on the same dataset. Sensor De-noising Network The authors of reference [22] propose a method of de-noising IMU readings using a CNN. The achieved results are of such a quality that, using dead reckoning on the data de-noised by their model, they still manage to outperform OriNet on test data from the same dataset. 5The authors of reference [19] note that the data they use makes the assumption of stationary periods non-applicable. 6The performance of OriNet is reported in comparison with the quaternion-based complemen- tary filter presented in [20] alongside the previously mentioned GDOF. However, their presented results of the GDOF is seemingly extremely poor, something that does not coincide with the performance of the GODF in other literature. Further, the GODF is not intended to estimate absolute orientation in 3D without access to magnetometer measurements [14],[21], but to the best of our knowledge, this is exactly what was done in the OriNet paper. As a consequence, the 80% improvement is questionable even though the method is still promising. 3 1. Introduction IONet Yet an other example of an attempt to fuse sensor data with NN is IONet7 [11] which estimates odometry from IMU readings. IONet is based on bi-directional LSTMs, and performs windowed smoothing8 rather than filtering. Recurrent Neural Filter The recurrent neural filter (RNF) presented in [24] is an autoencoder/decoder archi- tecture based on LSTMs. The RNF takes inspiration from conventional recurrent Bayesian estimators such as KFs where the update step and the prediction step are done separately. The encoder portion the architecture is comprised of LSTM layers responsible for the different Bayesian filter steps while the decoder consists of an feed forward NN. One of the significant novel contributions of this approach regards the memory of the LSTM layers. Instead of a particular layer using an individual separate hidden state, the hidden and cell states of the LSTM is shared horizontally with the other filter steps. To a large part this makes the structure similar to the distinct steps of the KF. The hidden state shared by the LSTM layers are in this analogy equivalent to the state estimate in the KF. A strong motivation for using the RNF structure is that it enables filtering when the system dynamics are unknown while still maneuvering missing data points. Field Prospects To date, and to our knowledge, no studies proposing end-to-end ML solutions for estimation of absolute orientation using IMU/MARG data has been published. The previously mentioned IONet [11], OriNet [19] and the sensor de-noising network [22] all tackle orientation estimation relative to an initial orientation. Reference [10] concludes that the first direction of future studies within the field of data fusion with ML methods should be to explore more application areas. Equally important, also according to [10], artificial NNs are particularly suitable and able to model various non-linear relationships that are difficult to express using hand crafted models. 1.2 Problem Statement and Aim Although the KF, and its adaptations for nonlinear systems (EKF and UKF), are sufficient in many applications, they have some drawbacks. As a first accessible example, the tuning is typically performed by a weighting of the diagonal elements of the covariance matrices. This tuning is often done arbitrarily until a “good enough” result is achieved [7]. This is, of course, implementation-dependent, and if the performance satisfies all demands the filter will be useful. Nevertheless, this manner of tuning does not ensure that the best possible solution is utilized, which 7The authors behind IONet also published OxIOD [23], a datset with IMU readings and ground truth (GT) estimates from a vision system. 8Smoothing is the filtering of data with access to both past and future measurements. Windowed smoothing means that its smoothing over a fixed length window, and that the hidden state is not shared between windows. 4 1. Introduction in turn would be too time-consuming to actually find. More severe is the Markovian assumption9 present in Bayesian filters, such as the KF [18]. Even though this assumption simplifies the system modeling noticeably and makes the KF efficient, it also makes it insensitive to system behavior with longer correlation times. On the whole, complicated error models with a correlation over long periods are neglected in the KF. Further, a specific dynamic system can often operate in different regions of the dy- namic state space, which can be hard to model in the same state estimator with classical models. In some applications, this is solved by dividing the dynamics into different modes, where each mode approximately describes different regions. A straightforward example of this is the trade-off between modeling a car with a con- stant velocity10 or coordinated turn11 model. Both of these models capture some aspects of the dynamics, but depending on whether the car is turning or traveling somewhat straight, their ability to describe the dynamics in the current mode will differ. There exist approaches for when a system exhibits this kind of multifaceted dynamics, that exploits the fact that describing the dynamics of the modes sepa- rately requires less effort than modeling every aspect of interest of the dynamics in one monolithic model. Examples of filters that do this are Interacting Multiple Models filters and Switching Kalman Filters [25]. However, the manual labor and the knowledge about the system dynamics required, and the computational burden is somewhat prohibitive when using these filter alternatives. Finally, when modeling the state error distributions in KF, EKF and UKF one is limited to additive noise with multivariate white Gaussian distributions. The KF is also limited to modelling measurement and process noise as additive, although the EKF and UKF can, with minor modifications, be designed to deal with non-additive noise as well. Filter performance is generally adversely affected when the noise is not distributed in the way modeled, and outlier measurements12 can have a large impact. An example of a method that explicitly deals with non-Gaussian distributions is the famed particle filter [26], which comes at the cost of computational expense. This thesis aims to benefit from the applicable traits of ML, particularly NNs, in order to estimate absolute orientation using IMU and MARG sensor data. In the- ory, an ML approach will be able to estimate altering system dynamics and noise distribution as well as be able to exploit system behavior correlated over time. 9Given information about the most recent state of a system with the Markov propery, informa- tion about any previous state will be uninformative [6]. 10A constant velocity model assumes constant velocity. The acceleration’s affect on velocity is modeled as noise. 11A coordinated turn model assumes constant velocity along a circle segment. Changes in radius and velocity are modeled as noise. 12Outliers are samples that deviate from the distribution of the rest of the population. They can be due to both natural variance in the population, or due to some error with the observation. 5 1. Introduction 1.2.1 Research Questions Is the application of machine learning approaches for sensor fusion purposes a proper alternative? Can indications of this, from related research, be supported? How well does a NN based filter compare with a EKF in terms of quality (RMSE) and efficiency (computational expense) in estimating orientation from noisy IMU data? Can a measure of the certainty of the produced estimates, similar to the covariance of the EKF estimations, be found for an NN based filter? 1.3 Delimitations This thesis will be limited to the problem of absolute orientation estimation with IMU and MARG sensors. Hence, there will not be any attempt to estimate absolute position even if spatial translations are present, and no additional sensor data will be used. Further, the developed algorithms will be benchmarked against a traditional EKF and not any of the mentioned extensions that deal with issues present in the EKF. Further, in reference [10] a list of criteria, that a data fusion method ideally should fulfill, is established. The list is proposed to make a comprehensible evaluation of different data fusion algorithm more accessible. Based on the entries in mentioned compilation, this thesis will be limited to evaluate developed models in terms of quality, efficiency, robustness and tested on real world data. Thus, evaluation of properties within privacy, extensibility, and stability will be outside the scope of this thesis. 1.4 Contribution The primary contribution of this study is to suggest an ML method that estimates the absolute orientation using IMU/MARG. Secondly we hope this thesis will con- tribute to extending the field of sensor fusion with ML and showcase the suitability of utilizing ML for data filtering. 1.5 Thesis Outline This thesis constitutes of six chapters. The current chapter, the introduction, is the first out of the six. In the second chapter, relevant theory is presented. More specifically, theory on orientation representations, IMUs and ML is covered. Further, in the third chapter, the methods used in this thesis are motivated and accounted for. In chapter four the results are compiled. In the fifth and sixth chapters the presented results are first interpreted and discussed, and subsequently conclusions are drawn. 6 2 Theory 2.1 Representing Orientation The mathematical description of orientation is based on attaching different frames of reference to the objects of interest. A simple example is two frames of reference, the body frame b, and the world frame i. The body frame is fixed to the object for which the orientation is to be modeled and the world frame is stationary. The conversion between the two frames is easiest represented with a rotation matrix Rib relating the vectors v in the two frames vi = Ribvb. (2.1) In equation (2.1), vi is a vector in the world frame, and vb is the same vector only represented in the body frame. The rotation matrix Rib applies the rotation in the Euclidean space from frame b to frame i. Any rotation can be described using three consecutive elemental rotations. Thus, the rotation matrix R, parametrized as R = Rx(ϕ)Ry(ϑ)Rz(ψ), (2.2) can also describe any rotation in 3D space. Rx, Ry, and Rz are all elemental ro- tations about the axes x, y, and z with angles ϕ, ϑ, and ψ respectively. This parametrization is referred to as Euler angles or Tait-Bryan angels. The angles are in some application fields called roll, pitch, and yaw (ϕ, ϑ, and ψ). Further, the parametrization of a rotation matrix into single-axis rotations has two different pos- sible interpretations depending on if an intrinsic or extrinsic rotation convention is used. As a result, the rotation matrix R can represent either the extrinsic rotation about fixed axes z, y, x in that specific order, or the intrinsic rotation about the axes of rotation x′, y′, z′. Furthermore, applying the rotation gives different results in the two cases. From here on, only extrinsic representations will be used. The use of Euler angles is appealing, because they are somewhat visualizable and perhaps even the most intuitive representation for human comprehension. However, the Euler angles bring ambiguities in the form of gimbal lock, which occurs when no unique solution can be found for the Euler representation. Formatting a machine learning problem in such a way that input maps to outputs 1:1 makes modelling the output distribution as Gaussian viable. In problems where one input can map 7 2. Theory Figure 2.1: Illustration of the axis angle representaiton. From [29], CC0 1.0 to multiple correct outputs the usage of sum-of-squared-errors loss functions tend to result in models that predict the mean [27]. This motivates the employment of a quaternion representation which is free of the gimbal lock problem. A quaternion q with basis {1, i, j, k} is expressed as q = a+ bi+ cj + dk. (2.3) The real part of the quaternion, a, is often referred to as the scalar part, and the imaginary part, bi+cj+dk, is called the vector part of the quaternion. Quaternions that are used to represent orientation should be of unit length, i.e. |q| = 1 should be satisfied. A quaternion can be seen as an encoding of an axis-angle representation. The quaternion can then be interpreted as q = cos ( θ 2 ) + ex sin ( θ 2 ) i+ ey sin ( θ 2 ) j + ez sin ( θ 2 ) k, (2.4) where e = [ ex ey ez ]T is the unit vector that gives the direction of the axis around which the angle θ is applied. An illustration of the axis-angle rotation can be viewed in Figure 2.1 Note that, keeping the unit constraint on the quaternion also ensures that the axis e remains of unit length and in turn this will keep the represented rotation interpretable [28]. The unit constraint, on the orientation quaternions, is one of the few drawbacks with this representation. The constraint demands re-normalization after any operation that is not guaranteed to preserve the constraint, and it can also bring difficulties in optimization algorithms since the unity norm constraint is quadratic [28]. Further- more, the quaternion representation is redundant in that that q and −q represent the same rotation. To explain, a rotation of magnitude θ around an axis a is equal to a rotation of −θ around the axis −a pointing in the opposite direction. To remove the redundancy of the representation it is possible to use canonized quaternions instead. In the canonized quaternions the scalar part of the quaternion vector is always positive. 8 2. Theory There are a few ways in which a metric of the error between two quaternions can be expressed. In this thesis, we will have an estimation q̂ of the ground truth orientation quaternion q. A metric of how close the estimation is to the ground truth is necessary for the evaluation of different model performances. As a first alternative to the error metric, the Euclidean distance between two 3D orientations δq can be used [30] δq = min (‖q − q̂‖, ‖q + q̂‖) , (2.5) where ‖ . ‖ is the 2-norm and δq ∈ [ 0, √ 2 ] . A second alternative to the error metric is the quaternion orientation error or dif- ference rotation quaternion [31] qe = q ⊕ q̂−1. (2.6) Here, ⊕ indicates the Hamilton product [32] and q̂−1 is the quaternion conjugation. Note that the difference error quaternion is still a unit quaternion and, as the name indicated, interpreted as the rotation difference between q and q̂. As an extension of difference rotation quaternion, the angle of the difference rotation θqe can also be calculated. According to equation (2.4) the scalar part of qe will be equal to cos ( θqe 2 ) and further, from extension of equation (2.6) the scalar part of qe will also be equal to the dot product of q and q̂ which gives [30] θqe = 2 arccos (‖q · q̂‖) , (2.7) where θqe ∈ [0, π]. In conclusion, we have the Euclidean distance δq, the difference rotation quaternion qe, and finally, the angle of the distance quaternion θqe as possible error metrics. Neither alternative is directly comprehensible, and as a possibility, one can also calculate the error in Euler angles from the difference rotation quaternion. The error in Euler angle is at first hand meant to act as a visualization tool, and the error is easiest retrieved by simply transforming qe to Euler angles. The transformation from quaternions to Euler angles is, for example, derived in [28]. 2.2 The Inertia Measurement Unit A device containing a three-axis gyroscope and a three-axis accelerometer is com- monly referred to as an IMU. There are a wide range of IMU application areas such as robotics, navigation systems and virtual reality [33]. Much of the widespread use of IMUs is a result of the microelectromechanical system (MEMS) technology with which the IMUs have become both space efficien and cheap [9]. The gyroscope measures angular velocity while the accelerometer measures specific force i.e. ac- celeration relative to free-fall. Specific force is different from the more conventional linear acceleration. The specific force of a stationary object on the surface of earth will be constant 1g upwards while the linear acceleration in the same case will be zero. 9 2. Theory When using IMUs in applications, to estimate position and orientation, one ma- jor concern is the presence of integration drift [14]. Since the measurements are derivatives of the desired observation the usage of dead reckoning, where the state is tracked by integrating the derivatives, is appealing. However, measurements are imperfect and the raw sensor readings will include both noise and biases. Integrating these measurements will lead to accumulation of errors over time. A signal with a constant error, integrated once, will lead to an estimation error that grows linearly. Integrating once more and the error will exhibit a quadratic growth over time [34]. In the objective of orientation estimation, assumptions about periods of zero linear accelerations are common for the purpose of utilizing the accelerometer data to act as a corrector for the roll and pitch angles and thus prevent the drift caused by integrating the gyro. Still, the drift issue remains for the yaw angle which entitles the use of additional sensors such as a magnetometer or global navigation satellite system (GNSS) in order to include a correctional term in the remaining angle. How- ever, magnetometers are sensitive to disturbances from ferromagnetic objects and electrical devices in their surroundings [16] while GNSS perform poorly when the satellite signal is obstructed. Thus, neither is suitable in every application. In con- trast, the measurements from the gyroscope and accelerometer are independent of any such environmental attributes [35]. Along side the drift issue, IMUs also suffers from complicated error sources that have both computational (numeric range and resolution), random, as well as systematic sources [18]. Generally, the collected IMU error sources are [36]: • Input Range The input range is the maximum magnitudes the IMU can measure. This error source is more application dependent and need to be considered in the design process rather than in the fusion process, specifically if the application is of high-dynamic-range and risk of saturation may occur. Also to be considered is the amount of vibrations since substantial vibrations can lead to saturation if the device is at close to its dynamic-range border. Another subject of the input range is the bandwidth which can cause aliasing. • Bias Repeatabilty [deg/hr], [m/s2] Due to a number of different influences such as altering physical properties and initial conditions, the IMU initial bias will be different each time the device restarts. However, if the bias is very similar each restart (i.e. the repeatability is high), the bias can to some extent be compensated for by tuning he device. Bias repeatability is frequently also referred to as Run-to-Run or Turn-On to Turn-On Bias. • Bias Stability [deg/hr/hr], [m/s2/hr] During the IMU run-rime the initial bias, that is sometimes simplified as a constant, changes. Generally, this is caused by changing temperature or me- chanical stress and it is common that manufactures adds temperature com- pensation to the IMU which increases the stability. This error source also goes by the name In-Run Bias. 10 2. Theory • Scale Factor [ppm] The scale factor error describes the ratio between the sensed input and mea- sured output of the sensor. The ideal ratio is, of course, one-to-one but when this is not the case, the output is proportional to the input with a scale factor. Additionally, the scale factor also has a non-linear part that can be described separately, but often is described with one collective value. • Sensor Misalignment [mrad] The six degrees of freedom (DoF) IMU has three gyroscopes and three ac- celerometers set orthogonal to each other. The placement is however imperfect causing correlation between sensors. Sensor misalignment errors originates both from non-orthogonality between accelerometers or gyroscopes (within sensor sets) as well as between sensor sets. There is also the phenomenon of package misalignment where the attachment of IMU is misaligned with the object that it is attached to. • G Dependency [deg/hr], [m/s2] When a Coriolis effect based gyroscope is subjected to acceleration along its oscillation axis the acceleration will have an impact on the sensor readings. This effect is called G dependency, because gravity is often the main source of acceleration. • Latency When IMU sensors are aided by cooperating sensors, latency can arise. In other words the latency error source emerge when IMU and the aiding sensor both senses same motion but their timing disagree. If this timing disagreement is substantial this error can become problematic. 2.3 Conventional IMU Data Fusion There are many possible filter implementations that can be used to estimate the absolute orientation from IMU or MARG data. In this section two approachable ways will be covered. 2.3.1 The Complementary Filter The complementary filter provides a straightforward solution to attain orientation estimates from IMU data. The accelerometer is a good indication of orientation under the assumption of stationary, or at least slow-moving conditions. Furthermore, even though the accelerometer output is noisy, it is accurate over long periods of time and does not suffer from accumulating drift when estimating orientation [16]. In contrast, the gyroscope provides a reliable measurement of the angular rate at each time step but integration will cause errors to accumulate over time. Hence, the accuracy properties of the accelerometer and the gyroscope complement each other to assure accuracy on long and short timescales, respectively [16]. 11 2. Theory The main idea in the complementary filter is to low-pass filter the angle estimates from the long-term reliable accelerometer readings while high-pass filtering the in- tegrated estimates from the short-term reliable gyroscope readings. Despite the relatively simple implementation, the complementary filter can provide smooth and accurate estimates and demands less computation than for instance the EKF [37]. The structure of this filtering technique can be viewed, more in-depth, in Figure 2.2. Further, the complementary filter using only IMU can only estimate the roll and pitch but given MARG measurements the complementary filter can also estimate yaw, such an implementation can be found in [16]. Accelereometer Reading Low-Pass Filter Gyroscope Reading Nummeric Integraton + High-Pass Filter Angle Estiomation Angular Velocity Estimation Conversion to Angle Figure 2.2: Block diagram of a complementary filter estimation angle and angular velocity. 2.3.2 The Extended Kalman Filter As mentioned in the introduction, the EKF is a commonly used filter in the field of sensor fusion, with the aim to take nonlinear effects into account [16]. The EKF is a two-step algorithm, constituted of the prediction and update steps. The non-linearity of the EKF lies in the motion model f(.) and measurement model h(.) which can be non-linear as long as they are differentiable. Here, in equation (2.8) and (2.9), Fk and Hk are defined as the Jacobians of f(.) and g(.) with respect to the state x. The Jacobiance are evaluated at the current state estimation, and in the case of Fk also at the current control signal. Fk = ∂f ∂x ∣∣∣∣∣ x̂k−1|k−1,uk (2.8) Hk = ∂h ∂x ∣∣∣∣∣ x̂k|k−1 (2.9) Prediction step State prediction: x̂k|k−1 = f(x̂k−1|k−1,uk). (2.10) Covariance prediction: Pk|k−1 = FkPk−1|k−1F T k +Q. (2.11) 12 2. Theory In the prediction step, the predicted state vector x̂k|k−1 is computed via the prop- agation of the best state estimate from the previous time step x̂k−1|k−1 and the control vector of the current time step uk through the motion model f(.) as given in equation (2.10) [38]. Also in the prediction, the covariance prediction matrix Pk|k−1 is calculated as in equation (2.11) where Q denotes the process noise covariance matrix [38]. Update step Innovation: vk = yk − h(x̂k|k−1) (2.12) Innovation covariance: Sk = HkPk|k−1H T k +R (2.13) Kalman gain: Kk = Pk|k−1HkS −1 k (2.14) State update: x̂k|k = x̂k|k−1 +Kkvk (2.15) Covariance update: Pk|k = Pk|k−1 −KkHkPk|k−1 (2.16) The goal of the update step is to calculate the best state estimation and covariance given new observations. First, the innovation vk is attained as in equation (2.12) and can be understood as the difference between the new given observation yk and the expectation of the observation given the state prediction x̂k|k−1. Second, the innovation covariance Sk is calculated as in (2.13) where R is measurement noise covariance matrix [38]. Third, the Kalman gain Kk is achieved according to (2.14). The Kalman gain is then the influence that decides the relative impact of the innovation compared to state prediction, which can be viewed in equation (2.15) where the new updated state is estimated x̂k|k. Finally, covariance is also updated as in equation (2.16). Up until now, the general steps of the EKF have been described, and in order to adapt the EKF to produce orientation estimates, a number of design choices are required. Firstly, the state vector can be designed with different orientation conven- tions, such as Euler angles or unit quaternions. Moreover, the state can be composed of either only the actual orientation or, the orientation and the rotation rate or, aug- mented to also allow estimation of sensor biases. Then, depending of the design of the state vector, motion and measurement models have to be developed for chosen state representation. Furthermore, the covariance matrices Q and R introduce tun- ing parameters. The measurement noise covariance Q can usually be set based on the uncertainties of the sensors. The process noise covariance R is more tricky how- ever, and should reflect the certainty of the chosen motion model. The ratio between the trace of the two matrices also affects the performance. The ratio tr(Q)/tr(R) is sometimes referred to as signal to noise ratio (SNR), and has large implications on the convergence rate [5]. The specific design choices made in this thesis are stated in the methods chapter, Section 3.6, but as can be understood there exist a variety of possible implementations. A few examples of implementations which adopt unit quaternions as the orientation encoding can be found in [13],[16],[39]. 13 2. Theory 2.4 Machine Learning and Neural Networks NNs have recently become the most prominent approach in ML due to their ability to model complex relationships in data. An NN consists of a layered structure of operations that maps input features to some output. One of the simplest NN layers is the linear feed-forward layer. The linear feed-forward layer performs a linear mapping Rn → Rm of a feature vector, x ∈ Rn. It does this by taking the matrix- vector product of a weight matrix W ∈ Rm×n and x and by adding a bias term b ∈ Rm z(x,W, b) = Wx+ b. (2.17) However, layering many of these linear feed-forward layers is of no interest, as they can be reduced to a single layer operation. Hence, to model more complex rela- tionships, a key feture of NNs are their non-linearities. In the context of NNs, non-linearities are referred to as activation functions1, since the general feature of the non-linearity is (typically) to suppress small input and let through large input, i.e. be inactive or activate. One such function is the sigmoid function, σ(z) = ez ez + 1 (2.18) which will be close to zero for z < 0 and 1 for z > 0. The output of an activation function is referred to as an activation, a = σ(z(x,W, b)). (2.19) Alternating linear feed-forward layers and activation functions gives rise to models that, if the different weight matrices and bias vectors are carefully selected, can model almost any relationship. NNs consisting of only linear feed-forward layers and activation functions are sometimes referred to as multi layer perceptrons (MLPs). The most common way of selecting weights and biases is to randomly initialize them, and subsequently tune them using stochastic gradient decent (SGD) based optimiza- tion techniques in utilizing gradients computed with back propagation [40]. This is called training and is often done using examples of input with known desired out- put, which is referred to as supervised learning. A specific SGD based optimization technique which has been widely adopted since its release in 2014 is the ADAM optimizer, which key feature is adaptive moment estimation[41]. A loss function is a measure of the error between model output f(x,W) and desired output y, L(f(x,W), y). It is this function that is minimized with respect to the model parametersW , (i.e. W and b) during training. Minimizing L with respect to W with SGD-based methods requires the computation of ∂L ∂W , which is done with back propagation. The model weights are updated by subtracting the gradient, scaled by some factor. The scaling factor applied to the gradient is referred to as 1The term activation function stems from the fact that artificial NNs are inspired by biological NNs. Biological neurons fire or activate when they are excited sufficiently enough to overcome inhibitory input. Likewise, an artificial neuron modeled as σ(z(x,W, b)) will fire if weighted input excites the artificial neuron sufficiently. 14 2. Theory the learning rate. In regression models, some simple examples of loss functions are mean square error and mean absolute error. Apart from the mentioned sigmoid activation function tanh (z) = 2 1 + e−2z − 1, (2.20) and the Rectified Linear Unit ReLu(z) = max(0, z) (2.21) will be used in the following sections. 2.4.1 Recurrent Neural Networks and Long Short-Term Memory While MLP networks are static networks and unaffected by the input of the previous time step, RNNs are able to process time series information [42]. RNNs achieve this by using a hidden state h which stores information about the previous inputs x to the network, enabling the detection of temporal correlations between events separated by time [43]. This ability motivates why RNNs are useful in forecasting applications using sequence data. In short, RNNs adds memory to the system and in the most simple structure of an RNN, the activation a at time t is available to the same layer at the next time step, illustrated in Figure 2.3. Figure 2.3: Illustration of an RNN unfolded trough time. As mentioned, NNs are often trained using SGD-based optimization methods. Due to the hidden state, the gradient methods needs to be adjusted and one well- established way to train RNNs is to use Back Propagation Through Time (BPTT) [44]. The problem with BPTT, for RNNs, is that gradient forms a product of Jaco- bian matrices for the different time steps [43]. Hence, if the elements in the Jacobians are less than one this will cause the norm of the gradient to decrease exponentially towards zero. Unfortunately, the opposite can also occur and the norm of the gradi- ent will then grow exponentially. These issues makes it impractical to train RNNs and are referred to as the vanishing and exploding gradients problem respectively [43]. 15 2. Theory As a solution to the problem with vanishing as well as exploding gradients, the LSTM unit was proposed in 1997 [45]. In the original paper the LSTM unit consisted of two multiplicative gate units, namely the input gate and the output gate. Shortly after that, in 1999 [46] an additional forget gate was proposed and, today the architecture where each LSTM unit contains all three gates is the most common. The verison with a forget gate is shown in Figure 2.4. Figure 2.4: A illustration of the LSTM layer. Yellow boxes indicate linear layers with activation functions and pink ellipses indicate element-wise operations. At each time step the LSTM unit is feed information from three sources: i) the input vector xt is given, ii) the hidden state vector of the last time step ht−1 which is also the output of the last time step, and iii) the cell state from last time step ct−1. In the original paper [45], the cell state is referred to as the constant error carrousel which runs back and creates a connection to the past. The functions most important of the cell state is firstly, that it can store information over arbitrarily long periods of time which gives the LSTM its memory capabilities. Secondly, by construction, the gradient of the cell state has an additive property, instead of multiplicative as in the case with the original RNN, which prevents the gradient from vanishing during backpropagation [44]. As mentioned, the LSTM unit includes the three multiplicative gates whose purposes are to preserve and regulate the cell state. Firstly, the activation from the forget gate ft enables the possibility for the network to learn what information to forget and what to remember. The activation is given by [46] ft = σ ( Wf [ xt ht−1 ] + bf ) , (2.22) whereWf is the learned weight matrix and bf the learned bias of the forget gate. The σ-function assures that the activation ft will be bounded between 0 and 1. Hence, when ft is element wise multiplied with the cell state the value of ft determines how much of each element in ct−1 will be remembered. One can say that the forget gate’s activation can reset the cell state [44]. 16 2. Theory Secondly, the input gate activation vector it determines from which elements, of the current input xt, information should be added to the cell state. The input gate activation vector is given by it = σ ( Wi [ xt ht−1 ] + bi ) , (2.23) which is similar to the calculation of ft only with different weight matrix and bias vector. The input gate activation is then multiplied with the input activation vector c̃t which contains information from the current input. c̃t = tanh ( Wc [ xt ht−1 ] + bc ) (2.24) In short, the input gate and c̃t enables writing to the memory of the LSTM and altogether, equations (2.22), (2.23) and (2.24) determine what the updated cell state will be [45],[46] ct = ft · ct−1 + it · c̃t. (2.25) Here, “·” imply element wise multiplication. Lastly, the activation of the output gate ot decides what information, from the current cell state, to output and consequently enables reading from the cell state. ot = σ ( Wo [ xt ht−1 ] + bo ) (2.26) Moreover, the new hidden state and the output of the current time step is given by ht = ot · tanh (ct). (2.27) All in all, the multiplicative gates and the cell states constitutes the majority of what is required to understand the basics of the LSTM unit [45],[46]. 2.4.2 The Recurrent Neural Filter The RNF, presented in reference [24], is an autoencoder/decoder architecture based on LSTMs and inspired by Bayesian filtering. Instead of replacing the conventional Bayesian filtering framework with a single "block" network, the key idea behind the RNF is to benefit from the structure of the Bayesian filtering steps. The motivation behind this is that it enables filtering when the system dynamics are unknown while still maneuvering missing data points. The encoder portion of the architecture is comprised of LSTM layers responsible for the different Bayesian filter steps. These LSTMs share a single cell state and hidden state2, h, which they modify sequentially. The architecture of the RNF is illustrated in Figure 2.5. The decoder consists of an MLP, and is responsible for converting the network state representation to the desired estimate. 2Here h refers to both the cell state and the hidden state. 17 2. Theory Figure 2.5: The Recurrent Neural Filter as presented in [24]. The RNF in [24] has three LSTM layers; propagation, input dynamics and error correction. The propagation step and the input dynamics step are together the equivalent of propagation step in a Bayesian filter. The propagation LSTM is solely intended for propagation and does not take any input beyond the shared hidden state. The input dynamics LSTM takes the control signal or input signal, u, as input and is intended to adjust the prediction made by the propagation LSTM. The last of the three horizontal LSTMs, the error corrector, takes the measurements, x, and adjusts the prediction made to take the measurements into account. This is analogous to the update step in the Bayesian framework. To enforce a consistent representation of the state estimation in the hidden state, h, two measures are taken. i) The three LSTMs are trained with the same decoder and ii) skip training is employed. In skip training the elements of h̃t and ht are randomly overwritten with values from previous states, h̃′t and h̃t respectively. The probability of this skip to occur is a tune-able hyperparameter called skip-rate. 2.4.3 Estimating Covariance Knowing the error covariance of the estimates produced by the filter networks is useful for fusing the estimated orientation with other estimates, and also for evalu- ating performance. In reference [47], they explore modeling of uncertainty of state estimations from neural networks. They model their k-dimensional state estimation as a multivariate Gaussian, p(y|x) = 1√ (2π)k|Σ(x)| e− 1 2 (y−f(x))T Σ(x)(y−f(x)), (2.28) in the same way the state estimate in a Kalman filter is modeled. Where y is the ground truth. The networks are trained to output the mean, µ ∈ Rk, and a parameterized covaraince matrix Σ(s, r) where s ∈ Rk is the vector parameterization of the diagonal elements and r ∈ Rk(k−1)/2 is a vector parameterization of the off- diagonal elements, see equation (2.30). The mean and covariance can be expressed 18 2. Theory as a function of the network input x, µ = f(x) Σ(s, r) = Σ(x). (2.29) The parameterization of the covariance matrix, Σij(s, r) = i = j, σ2 i = esi j < i, ρijσiσj = tanh(rij) √ esiesj (2.30) describes the diagonal and the upper triangular portion of the matrix with i and j being row and column indices both ranging from 1 to k. The covariance matrix is symmetric, and lower triangular portion follows from the upper. Minimizing the negative log likelihood of the parameterized multivariate Gaussian distribution (see eq (2.28)) is what allows for simultaneously training the model to estimate the mean as well as the error covariance. Hence L(x,y) = 1 2(y− f(x))TΣ(x)(y− f(x)) + 1 2 ln |Σ(x)| (2.31) is used as loss function. The authors of reference [47] also suggest two minor alterations in order to improve numerical stability when using equation (2.31) as loss during training. The first suggestion is multiplying r with a constant (1 − ε), with ε ≈ 0.001. The second suggestion is to multiply the input to tanh with a small constant, α ≈ 0.05, to avoid saturating the activation function. 2.4.4 Regularization To prevent overfitting, of model parameters on noise in the training data, regulariza- tion techniques are employed. There are many different ways to achieve regularizing effects. Some of the common approaches are the following: Early Stopping When improvements are seen in the training loss but not in the validation loss, it is a key indication of poor generalization to data in the validation set. Early stopping is the practice of stopping the iterative training when no improvements are seen in some chosen metric or in the calculated loss on the validation set [48]. Mini Batches Mini batches is the term commonly used to refer to the batching of training data. The key idea is to consider several samples in a single update step during training. This is done by computing the gradient of the loss function, evaluated over all the samples in the mini batch, with respect to the model parameters. The impact of noise in the data will be smaller, the gradient will be a better approximation of the gradient computed using the entire dataset, and more general solutions are achieved. Batch processing also improve the efficiency of operations during training [49]. 19 2. Theory Weight Decay Weight decay is the introduction of a penalty of weights into the loss function. The most common way of doing this is to use L1 or L2 regularization, where the L1 or L2 norm of the weights is added to the loss function, scaled by some constant. The reasoning behind doing this is that solutions with smaller weights tend to generalize better and be more robust to noise [50]. Batch Normalization Batch normalization is the practice of normalizing activations of intermediate layers in a neural network based on metrics of a batch. This is done to deal with what is referred to as covariance shift. Covariance shift is when the changes in weights of a layer, during training, causes the distribution in its activations to shift. Subse- quently, the inputs to the next layer represents new features, which its weights are not tuned for [51]. 2.4.4.1 Drop-out One approach that almost always improves model performance is to train several similar models and average their predictions. However, this is an expensive ap- proach. Drop-out is a method inspired by this averaging of many different models, and is done by dropping out a subset of the model, resulting in a "thinned" network. By sampling a different subset network during each training step, and only updating the weights of this subset, a regularizing effect is achieved [52]. 2.4.5 Feature Scaling The numerical range of the input features affect the rate of convergence of NNs during training. If the features are ill-conditioned3 SGD based optimization will be slow to converge. This can be attributed both to oscillating network weights and the differing magnitude of the update steps for the first layer and the consecutive layers. To improve the convergence rate, the input features and the output labels are often re-scaled [53]. Two common ways to do feature scaling are min-max normalization, equation (2.34), and standardization, equation (2.33) [54]. std(x) = √ 1 N ΣN i=1 (xi −mean(x))2 (2.32) z = x−mean(x) std(x) (2.33) z = x−min(x) max(x)−min(x) (2.34) 3Features with magnitude that differs vastly results in the gradient of the loss function with respect to the model parameters to vary in magnitude as well. As a consequence, a learning rate appropriate for one model parameter will be much too large for another, causing either unstable or slow training. Features with large magnitude can also have a disproportionate influence on predictions. 20 2. Theory The mean and standard deviation or the min and max are computed from the test data. All data fed to the network is then scaled in the same way, both during training and inference. Min-max normalization of features is an attempt to ensure that the scaled features are constrained to [0, 1]. Likewise, standardisation is an attempt to ensure that the distribution of the scaled features has zero mean and variance one. 2.4.6 Domain Randomization When training ML models on synthetic data there is often a reality gap, where the simulated data differs from data captured in the real world. A simple way to deal with some of these differences is to use domain randomization (DR). In DR relevant parameters in the simulation are randomized to reflect the uncertainty of their real world counterpart [55]. 21 2. Theory 22 3 Methods This chapter will cover the methods used to answer the questions posed in Section 1.2.1. First, the tools used throughout the process are described, followed by a description of the problem set-up, and the data sets used. Then, the model archi- tectures, as well as the loss function and metrics used when training and evaluating the models, are described. Subsequently, the benchmarks used for performance evaluation of the model architectures are presented, and the considerations done when choosing them are accounted for. Finally, the experiments performed and the hyperparameters used during training are reported on. 3.1 Machine Learning Framework TensorFlow is a machine learning interface that supports both training and infer- ence phases. The interface is developed open source by Google and suitable for most machine learning application areas. The flexibility of TensorFlow lies in that it allows for experimenting with new models and researching with sufficiently high performance while still robust enough to allow for the deployment of machine learn- ing models [56]. TensorFlow is categorized as a low-level deep learning framework while its application programming interface Keras can be considered a high-level framework. Keras is most commonly used with TensorFlow as backend, although it supports a few other low-level frameworks [57]. 3.2 Separation into 2D and 3D Problem The main focus of this thesis is to estimate absolute orientation utilizing ML ap- proaches. Related literature, [11] and [19], employs supervised learning, which is an accessible approach, and therefore also adopted in this thesis. Here, a separation of the problem into two parts will be established. To investigate the general suitability of utilizing an ML approach for sensor fusion purposes a simplified 2-dimensional problem, with fewer DoFs, is set up. Here, the sensor device will be allowed to rotate around one single axis, and therefore only one angle will be estimated. In order to avoid the discontinuity arising where angles transition between +π and −π, the angles fed to and from the network are represented using [ sin θ, cos θ ] . This representation is also establishing normalization of the inputs and outputs to the 23 3. Methods NNs. Further, the representation can easily be converted back into angles using θ = arctan 2(sin θ, cos θ). As a second task, the problem of retrieving an absolute orientation estimation in 3D shall be investigated. As explained in Section 2.1, a quaternion representation of the absolute orientation can be beneficial for the 3D case. Further, the quater- nions used will be canonized such that the scalar part of the orientation quaternion is always positive. The reason for adopting canonized quaternions is that it sim- plifies the problem for the network such that the network does not have to learn two representations for the same orientation, requiring more complex models of the output distributions [27]. In short, the 3D networks will be trained to estimate the canonized orientation quaternion representation with supervised learning. 3.3 Datasets Two different data sets are used for separate problems of estimation, a 2D orienta- tion, and an absolute orientation in 3D, respectively. Namely, the sufficient number of correctors, in 2D and 3D, have influenced the choice of datasets. 3.3.1 Synthetic 2D Data Gnss-ins-sim [58] is a simulation tool for generating 3D IMU, MARG, and GNSS sensor data. It is also capable of generating 2D data, which is how it is used in the following experiments. The tool consists of a python module that enables specific settings to mimic the true behavior of IMU sensor noise (bias, random walk, and bias stability). The module enables the attainment of the ground truth both in quaternions and Euler angles. In 2D, it is sufficient to represent the orientation with a single axis rotation, and therefore the quaternion representation will not be necessary. A training set containing 60 000 sequences, each consisting of 50 samples sampled at 100 Hz, is generated with DR of the noise parameters (see Table 3.1). The sequence length had to be limited mainly to reduce computation per update step, but also to allow for a large variety of sequences relative to the dataset size. Similarly, a test set containing 500 sequences, each consisting of 1 000 samples sampled at the same frequency, is generated in the same noise domain. In these simulations, the initial orientation is sampled from U(−π, π), and the initial angular velocity was set to zero. The data is generated without linear acceleration. During training, 5% of the training set is used for validation. 1These are typical values, based on [59] 2This is the correlation coefficient used by gnss-ins-sim to model the bias as a first order Gauss- Markov process [60]. 24 3. Methods Table 3.1: The distributions used for domain randomization of the noise parameters. These distributions are sampled for each generated 2D sequence. 12 is a vector of length two containing all ones. Error source Unit Typ.1 Distribution gyro bias [deg/hr] 1440 U(±[400, 2000]) gyro angle random walk [ deg√ hr ] 0.15 U(0.1, 0.2) gyro bias instability [deg/hr] 2 U(1, 3) gyro bias instability corr.2 - - 100 acc. bias [m/s2] 1.37 · 10−2 U(1210−2,122 · 10−2) acc. vel. random walk [m/s√ hr ] 0.012 U(±12[0.008, 0.022]) acc. bias instability [m/s2] 3.6 · 10−5 U(122 · 10−5,125 · 10−5) acc. bias instability corr.2 - - 12200 3.3.2 3D MARG Dataset For training the models in the 3D case, the OxIOD dataset [23] is used. In contrast to the 2D dataset, OxIOD is not synthetic. It consists of 158 sequences totaling 43 km in traveled distance. Each sequence contains MARG sensor readings as well as a reference orientation from a computer vision system. The dataset is provided with a suggested training/testing split of the data3, which is used in the following 3D experiments. The authors of the OxIOD dataset compares it with similar datasets and presents a higher accuracy ground truth than other datasets. However, they only compare translational accuracy [23]. This reported accuracy is 0.5 mm at 100 Hz 4. OxIOD was chosen for its larger size compared to similar datasets, while still provid- ing an absolute orientation. Further, the dataset includes different types of motions making it somewhat diverse. The data sequences were recorded using a smartphone attached to different parts of a person5. There are, for example, sequences with the phone in a walking person’s pocket, hand, handbag, and on a trolley. There are also variations where the person is running, walking, and walking slowly. An example of a sequence, with a person carrying the phone in a handbag whilst walking, can be seen in Figure 3.1. reference. 3The suggested train/test split for OxIOD is to use specific sequences for training and testing, covering each of the different types of motion. 4For reference, the datasets with the second-highest accuracy are TUM IV (1 mm, 200 Hz) [61] and the dataset that OriNet is trained on, EuRoC MAV (1 mm, 200 Hz) [62]. 5To the best of our knowledge, no PhD student was severely harmed during the data collection. 25 3. Methods 1 0 Sp ec ifi c fo rc e [g m /s ²] Accelerometer Readings x y z 1 0 1 An gu la r v el . [ra d/ s] Gyroscope Readings 0 200 400 600 800 1000 Elapsed Time [10 ms] 25 0 25 M ag ne tic fi el d st re ng th , [ T] Magnetometer Readings Figure 3.1: A 10 s data sample from the test set (OxIOD), showing readings from the accelerometer, gyroscope, and magnetometer in the sensors body frame. All of the data in OxIOD is sampled at 100 Hz. We processed the data by dividing it into shorter sequences: a training set containing 51 784 sequences, each consisting of 50 samples and a test set containing 342 sequences, each consisting of 1 000 samples. Of the training data, 5% is used for validation. 3.4 Model Architectures Two main types of model structures are develop for the purpose of estimation orien- tation. Both of architectures follow an encoder-decoder structure with a recurrent encoder consisting of LSTM units. The encoder outputs both the estimate as well a parameterized covariance matrix from Section 2.4.3. 3.4.1 The Recurrent Neural Network Given the time-series nature of the IMU/MARG data and the dynamics of the ori- entation state, some sort of RNN is a natural choice, since RNNs do not require a fixed look-back window. More specifically, LSTMs were considered, in favor of sim- pler RNNs. This, due to their robustness towards vanishing gradients, as described in Section 2.4.1. Also, despite the success of attention-based models in fields such as natural language processing and image captioning, the fact that they rely on a discretization of the output makes them unsuitable when dealing with continuous outputs [24]. The proposed RNN models, designed and evaluated in this thesis, have a simple encoder-decoder structure. The encoder consists of a single LSTM layer, while the decoder consists of an MLP with the first layer having double the size of the hidden state. The second layer of the MLP has 2k+ k(k−1) 2 units, with k being the dimension of the estimate (k = 2 in the 2D case and k = 4 in the 3D case). 26 3. Methods Figure 3.2: Illustration of the RNN architecture evaluated. The green box represents an LSTM layer and the blue boxes represents fully connected layers. 3.4.2 The Recurrent Neural Filter A variation of the RNF presented in Section 2.4.2 is evaluated, in which only two of the three separate LSTM encoders are used. Since there is no natural control signal in the case of orientation estimation from IMU/MARG data, the input dynamics layer is omitted and no control signal is used. This results in a slightly different RNF than the one proposed in reference [24] where the part of the RNF that models input dynamics has been removed. The resulting structure is illustrated in Figure 3.3. The two steps then constitute a prediction and an update step, where all measurements are given to the update part. The RNF model uses the same type of decoder as the previously described RNN. During training the decoder decodes the hidden state of both the LSTMs comprising the propagation step, and the LSTM comprising the error correction step. [24] Figure 3.3: Illustration of the RNF architecture evaluated. The green boxes represents an LSTM layer and the blue boxes represents fully connected layers. 3.5 Loss Function and Error Metrics The negative log likelihood loss function and the root mean square of Euclidean distance metrics used have a direct relation to the performance of the ML model. While the loss function has a direct connection to the training of the network where it is minimized, the metric is detached from the training and can be chosen more freely. 27 3. Methods 3.5.1 Loss Function Both 2D and 3D experiments have been performed with the negative log likelihood function (equation (2.31)) which was described in Section 2.4.3. The benefit of estimating the error covariance of the estimates produced is first that it is useful for fusing the estimate with other information, and also for evaluating the performance of the estimates. 3.5.2 Error Metrics Two types of error metrics are used to measure and compare performance of the trained models over the entire test set. Given a time series error vector δ the root mean square error RMSE can be calculated as RMSE(δ) = √ 1 n (δ2 1 + ...+ δ2 n) (3.1) where δi is the scalar error at time i ∈ [1, n]. Averaging over the RMSE gives, our first error metric, the average root mean square error (aRMSE). Similarly we attain our second error metric from averaging the quantity RMSE20(δ) = √ 1 n− 20(δ2 21 + ...+ δ2 n) (3.2) Including the errors prior to the estimation has converged makes it difficult to com- pare performance on sequences of different length. The second metric is referred to as the aRMSE20 and is utilized to reduce the impact of sequence length on the metric, as well as to provide a metric in which the EKF is not favoured due to being provided the ground truth as prior in the first time step. 20 time steps was found to be sufficient to achieve robustness towards the sequence length. In the 2D case, the network only estimates one angle. The error δ, δ = θ − θ̂ (3.3) in this case, consists of the difference between the ground truth θ and the estimation θ̂ in degrees. This error δ is then used to calculate the aRMSE and the aRMSE20 as described in equations (3.1) and (3.2). From Section 2.1 we have the Euclidean distance δq, the difference error quaternion qe and the angle of the error quaternion θqe as possible error metrics in 3D rotations represented by quaternions. However, since the difference error quaternion is not a scalar, it is inconvenient to use it as a metric. Furthermore, the angle of the error quaternion brings a more complex calculation compared to the Euclidean distance. Therefore, the Euclidean distance δq is chosen as the 3D-error metric, which we calculate the aRMSE and the aRMSE20 as in equations (3.1) and (3.2). 28 3. Methods 3.6 Benchmarks Two EKFs, one for 2D orientation estimation and one for 3D orientation estimation, are used as a benchmark for model performance. Section 2.3.2 introduced the general theory behind the EKF, here the specific design choices of each of the benchmark EKFs will be presented. 2D EKF The EKF benchmark used in the 2D case utilize a state vector x consisting of the angle θ and its derivative ω, x = [ θ ω ] . (3.4) The chosen motion model, f(x), assumes constant angular velocity and models the angle by integrating the velocity over time. Further, the motion model constrains θ by wrapping it to the range [−π, π), in order to avoid the the problem of θ = 2π+θ. The wrapping function w(.)[−π, π) is defined as w[−π, π)(x) = mod (x+ π, 2π)− π. (3.5) The motion model and the Jacobian of the motion model, F (x), are then given by f(x) = [ w[−π π) (θ + ∆t ω) ω ] , F (x) = [ 1 ∆t 0 1 ] (3.6) . The update step of the 2D EKF is rather straightforward. The chosen measurement vector y consists of the gyroscope reading ωg, and the sin(.) and cos(.) of the angle retrieved from the normalized accelerometer reading θa, y =  ωg sin (θa) cos (θa)  . (3.7) The measurement model h(x) and its Jacobian H(x) are defined as h(x) =  ω sin (θ) cos (θ)  , H(x) = 1 0 0 cos (θ) 0 − sin (θ)  . (3.8) Given the measurement and motion model, the 2D EKF then follows the steps as explained in Section 2.3.2. 29 3. Methods 3D EKF In the 3D case, an EKF, estimating the absolute orientation in quaternions, is used to benchmark the developed ML models. The chosen state vector x is xk = qk, (3.9) where q is the unit quaternion encoding the true orientation at the discrete time- step k. Further, since the derivative of qk is excluded from the state, the angular rate ωk is instead included in the control signal6 uk. Note, that the control signal is expressed in the body frame and constitutes a three-axis rotation in Euclidean space. Other design alternatives are for example to use an augmented state vector to estimate the bias of the gyro and the accelerometer. It is also possible to include the quaternion rates as a part of the state vector. Both of these alternatives could potentially improve the estimation accuracy of the EKF, but also increases the complexity of the implementation. The continuous-time derivative of the orientation quaternion, q̇, with respect ω (see reference [31]) is q̇ = 1 2  0 −ωx −ωy −ωz ωx 0 ωz −ωy ωy −ωz 0 ωx ωz ωy −ωx 0  ︸ ︷︷ ︸ =Ω(ω) q, (3.10) which can also be re-written as q̇ = 1 2  −q1 −q2 −q3 q0 −q3 q2 q3 q0 −q1 −q2 q1 q0  ︸ ︷︷ ︸ =Ω̄(q) ω. (3.11) Given the derivative, the discretized motion model f can be chosen as f ( x̂k|k−1,uk,vk ) = ( I + ∆t 2 Ω (uk) ) x̂k−1|k−1 + ∆t 2 Ω̄ ( x̂k−1|k−1 ) vk, (3.12) where ∆t is the sampling time, and vk is the process noise. The process noise is modeled as an additive noise contribution to the angular rate ω in the body frame, i.e., uk = ωk + vk. (3.13) Thereby, the gyroscope measurements can be utilized as the control vector and the gyroscope measurement noise will a part of the process noise. 6The signal from the gyro is relatively accurate, and rapidly changing. Including it in the state vector and subsequently estimating it will generally not result in large improvements in its estimation. However, estimating it results in a larger state vector and subsequently larger system matrices and more computation. 30 3. Methods In order to simplify the expressions in equation (3.12), we introduce Fk = ( I + ∆t 2 Ω (uk) ) , Gk = ∆t 2 Ω̄ ( x̂k−1|k−1 ) . (3.14) By assuming vk ∼ N (0, Qk), the prediction step of the EKF, analogous to equations (2.10)-(2.11), is given as x̂k|k−1 = Fkx̂k−1|k−1, (3.15) Pk|k−1 = FkPk−1|k−1F T k +GkQkG T k . (3.16) Note that the vk is given in the body frame and translated to the world frame via Gk. Thus, Gk is also applied to the process covariance Qk in equation (3.16). The update step of the EKF is divided into two separate steps which updates the measurement models of the accelerometer and the magnetometer, respectively. This separation is introduced mainly to introduce adaptivity to the algorithm, since the EKF could potentially be run without the magnetometer data. Furthermore, the separation makes it possible to turn off either of the measurement models, which increases the transparency of the verification and debugging process. The accelerometer measurements ya,k can be modeled as ya,k = (R(qk)) T ( g0 + fa,k ) + ea,k, (3.17) where R(qk) is the function that transforms the current orientation qk to its corre- sponding rotation matrix. Further, g0 is the nominal gravitation vector, and fa,k is the linear acceleration. Both g0 and fa,k are expressed in the world frame and added together, they form the specific force. Thus, the transposed rotation ma- trix, R(qk)T , converts the specific force to the body frame. Lastly, ea,k denotes the accelerometer measurement error, also in the body frame. Equation (3.17) can be approximated using Taylor expansion, and with an assump- tion of zero linear acceleration, fa,k = 0, the accelerometer measurement model can be written as ya,k ≈ ( R ( x̂k|k−1 ))T g0 + ( dR dx ( xk − x̂k|k−1 ))T g0 + ea,k. (3.18) Then let, Hk = ( dR dx ( x̂k|k−1 ))T g0, (3.19) and the accelerometer measurement model component of the EKF update, similar 31 3. Methods to equations (2.12)-(2.16) in Section 2.3.2, becomes vk = ya,k − ( R ( x̂k|k−1 ))T g0 (3.20) Sk = HkPk|k−1H T k +Ra,k (3.21) Kk = Pk|k−1HkS † k (3.22) x̂k|k = x̂k|k−1 +Kkvk (3.23) Pk|k = Pk|k−1 −KkSkK T k (3.24) Similar to how the measurement model of the accelerometer has been derived, the magnetometer part of the update step can be attained. The magnetometer mea- surement model is developed from ym,k = (R (qk)) T ( m0 + fm,k ) + em,k, (3.25) where m0 is the earths magnetic field strength in the world frame, fm,k constitute other magnetic fields present, and em,k is the measurement error. The tuning used in this thesis, that was found to give the best performance on the test sequences, was Qk = 1 0 0 0 1 0 0 0 1  , Ra,k = 150 0 0 0 150 0 0 0 150  , Rm,k = 200 0 0 0 200 0 0 0 200  . (3.26) As a reminder Qk is the process noise covariance, Ra,k is the accelerometer measure- ment covariance, and Rm,k is the magnetometer measurement covariance. Further, the initial state of the EKF was chosen close to the ground truth for each test sequence. This was done in order to assure that the EKF would converge in all sequences. 3.7 Experiments A quantitative approach is taken in the regard of training the developed models. Here, different variations of two main types of architectures, the RNN and the RNF from Section 3.4, are trained and evaluated in both 2D and 3D. The approach is quantitative in that several models are trained with the same hyperparameter set- tings but different, randomly initialized, weights. The performance of these multiple instantiations is then evaluated in terms of best and average performance, i.e. the aRMSE and aRMSE20 metrics. Different settings are then compared to each other to indicate which parameter setting works better. This systematic parameter test- ing is commonly referred to as grid search [63], and the parameters evaluated in the experiments are the hidden state size, and the drop-out rate. Other parameters like the batch size, early stopping, learning rate, and the skip-rate are varied and chosen by engineering judgement. Other relevant information for the experiments are i) that the Adam optimizer is chosen, ii) gradient norm scaling is applied with the maximum L2 norm set to 0.5, iii) the training data is shuffled after each epoch to ensure unbiased subsequent updates, and iv) the training validation split is resampled for each trained model. 32 3. Methods 3.7.1 The 2D Case Experiments In the 2D case three different RNN types, with different recurrent state sizes, are trained with three different drop-out rates, resulting in nine different version of RNNs. For each RNN version, 20 instances are trained with randomly initialized weights, to verify consistent results. Similarly, three different RNF types are trained, also with varying recurrent state size, but not with drop-out due to problems with divergence. These models (the RNNs and the RNFs) are all trained with a learning rate of 10−3. Training is done for 60 epochs, or until the loss function does not improve for three consecutive epochs (early stopping). The synthetic gyroscope and accelerometer measurements are all standardized using mean and standard deviation computed from the entire training set. 3.7.2 The 3D Case Experiments The models designed to estimate orientation in 3D are trained to predict the can- onized quaternion. For reasons explained in Section 2.4.5, we know that feature scaling can be beneficial for the convergence rate of the developed RNN and RNF. Therefore, in this thesis the accelerometer, magnetometer and gyro data are feature scaled using standardization. The ground truth labels, however, are already scaled due to the constraint of the quaternions being unit length. The 3D models are trained on sequences 50 time steps long, corresponding to half a second worth of sensor readings. In total the models are trained on 51 784 sequences, whereof 5% are used for validation. The test set consists of 342 sequences with a length of 1 000 time steps. Further, models are trained with a learning rate of 10−2 for the first five epochs followed by 10−5 during the rest of training. Training is done for 60 epochs, or until the loss function does not improve for three consecutive epochs (early stopping). 33 3. Methods 34 4 Results In this chapter the results of the performed experiments are presented. Firstly, the results of the proposed 2D RNN and RNF models are displayed and compared. Secondly, the results of the 3D models are presented and compared to the benchmark EKF. Finally, results from an ablation study, and a missing data test with the 3D RNF is reported. 4.1 Results of the 2D Models Table 4.1 provides a comparison of the performance between the different RNN versions, trained for the 2D problem, in terms of the aRMSE and the aRMSE20. The errors reported from the 2D models are in degrees and twenty instances of each version was trained. Additionally, the best result achieved for each version is also reported on in Table 4.1 in the columns min(aRMSE) and min(aRMSE20). The results in Table 4.1 show that two parameter settings, version 1 and 8, are delivering the best performance. This result is somewhat counter-intuitive since the best parameter settings are farthest apart in the table. Table 4.1: Recorded test results from the trained 2D RNN model versions. RNN Version Hidden State Size Drop Out max aRMSE aRMSE min aRMSE max aRMSE20 aRMSE20 min aRMSE20 1. 8 0.0 3.16 0.5613 0.2297 1.573 0.2803 0.1226 2. 8 0.2 2.5 0.5464 0.1711 1.591 0.3108 0.134 3. 8 0.4 6.326 0.6758 0.1946 6.381 0.5176 0.1153 4. 16 0.0 2.538 0.4743 0.2564 2.52 0.3695 0.1742 5. 16 0.2 0.6635 0.3889 0.1982 0.5844 0.3274 0.1584 6. 16 0.4 1.692 0.459 0.1992 1.69 0.3841 0.1674 7. 32 0.0 1.663 0.4024 0.1909 1.628 0.3697 0.1832 8. 32 0.2 1.317 0.358 0.1655 1.307 0.3396 0.1653 9. 32 0.4 1.341 0.4319 0.212 1.326 0.3825 0.1871 Figure 4.1 provides a test sequence of the generated 2D data, estimation from a trained RNN model, and estimates from the benchmark EKF. Notably, the proposed RNN manages fairly well to estimate the ground truth of the test sequence. 35 4. Results Figure 4.1: Plot of a 2D test sequence with orientation estimates from a trained RNN and the benchmark EKF. Similarly to Table 4.1, Table 4.2 also shows a comparison between the performance, in terms of aRMSE and aRMSE20, but for the proposed 2D RNF models. Also here, 20 instances of each version have been tested, and the best instance of each version is indicated in terms of min(aRMSE) and min(aRMSE20). Note that the RNNs where the dropout was different from zero failed to converge for the 2D RNN experiments, and thus, are not reported in the table. Remembering that the motivation behind the 2D tests is to investigate the suitability of applying ML to sensor fusion problems. Finding an indication of which parameter setting works better is thus not crucial in the 2D case. Table 4.2: Recorded test results from the trained 2D RNF model versions. RNF Version Hidden State Size Drop Out max aRMSE aRMSE min aRMSE max aRMSE20 aRMSE20 min aRMSE20 1. 8 0.0 2.772 0.6884 0.2575 2.638 0.4963 0.1593 2. 16 0.0 0.5101 0.3052 0.1597 0.4786 0.2777 0.1453 3. 32 0.0 0.4433 0.2799 0.1499 0.445 0.2798 0.1504 In Table 4.3 a comparison between the orientation estimation errors generated by the benchmark EKF, the best instances of the proposed 2D RNNs and RNFs can be reviewed. 36 4. Results Table 4.3: Performance comparison of the benchmark EKF and the best instances of the proposed RNN and RNF for the 2D problem. Model aRMSE aRMSE20 RNN 0.1655 0.1226 RNF 0.1499 0.1453 EKF 2.187 0.0884 4.2 Results of the 3D Experiments In Table 4.4 and 4.5 the aRMSE and aRMSE20 for the trained instances of the proposed RNN and RNF models are presented. For the RNN architecture the best parameter setting was found for version 9, and the RNF architecture version 6 indicates best results. Note that, for the RNF with smaller hidden state sizes, 8 and 16, was not trained with drop out since it was concluded that larger hidden state sizes was needed. Table 4.4: Recorded test results from the trained 3D RNN model versions. RNN Version State Size Drop Out max aRMSE aRMSE min aRMSE max aRMSE20 aRMSE20 min aRMSE20 1. 8 0.0 0.2708 0.1694 0.1442 0.2706 0.1689 0.1439 2. 8 0.2 0.1796 0.1613 0.1489 0.1795 0.1610 0.1486 3. 8 0.4 0.2108 0.1668 0.1422 0.2108 0.1664 0.1411 4. 16 0.0 0.3139 0.1586 0.1326 0.3139 0.1582 0.1321 5. 16 0.2 0.1695 0.1468 0.1330 0.1694 0.1464 0.1325 6. 16 0.4 0.1669 0.1410 0.1271 0.1667 0.1405 0.1265 7. 32 0.0 0.2129 0.1394 0.1170 0.2121 0.1390 0.1164 8. 32 0.2 0.1619 0.1354 0.1189 0.1618 0.1350 0.1183 9. 32 0.4 0.1613 0.1328 0.1148 0.1609 0.1323 0.1141 Table 4.5: Recorded test results from the trained 3D RNF model versions. RNF Version State Size Drop Out max aRMSE aRMSE min aRMSE max aRMSE20 aRMSE20 min aRMSE20 1. 8 0.0 0.3218 0.1615 0.1304 0.3222 0.1613 0.1302 2. 16 0.0 0.297 0.1305 0.1099 0.2974 0.1304 0.1097 3. 32 0.0 0.119 0.106 0.0949 0.1188 0.1059 0.0947 4. 32 0.1055 0.1152 0.1067 0.096 0.1151 0.1065 0.0959 5. 32 0.2254 0.1332 0.1175 0.1065 0.1331 0.1174 0.1064 6. 64 0.0 0.1056 0.09569 0.082 0.1055 0.09553 0.0817 7. 64 0.1055 0.1035 0.09704 0.0871 0.1034 0.09691 0.0869 8. 64 0.2254 0.108 0.09955 0.0921 0.1079 0.09941 0.092 37 4. Results In Figure 4.2 a plot of a ground truth orientation quaternion and estimations of the ground truth can be viewed. Alongside the orientation quaternion, the estimation errors in Euler angels can also be viewed. The estimates are produced by the best performing instance of the RNF model version 6 and the benchmark EKF. Also visible in the plot, is the estimated error covariance of the RNF estimates and the calculated error covariance of the EKF estimates. The test sequence chosen for Figure 4.2 was selected with the intention to reflect the average performance of the best RNF version, and the plot neither shows the best or the worst performance. Figure 4.2: The estimated sequence of quaternions (left) and the estimation error in Euler angles (right) from the RNF and the EKF in 3D. 4.2.1 Comparison and Benchmarks The best results of the 3D experiments are compared with the benchmark EKF in Table 4.6. A closer inspection of the table shows that the proposed RNF performs better than both the EKF and the RNN for the chosen metrics. Also visible in the table is the number of trainable weights for the NN models, and the average one-step computation time for all three models. The average time to compute one step with the 3D EKF is 0.33668 ms on the used hardware1. The same figure for the RNN and RNF is 0.07017 ms and 0.1929 ms respectively. 1All of the models and filters reported on are run on a Lenovo ThinkPad T490s with a Intel® Core™ i7-8565U CPU @ 1.80GHz × 8. The TensorFlow models are run on the CPU. 38 4. Results Table 4.6: Comparison of best results in 3D along with the models execution time. Model RMSE RMSE20 Trainable Weights Execution Time RNN 0.1148 0.1141 8 398 0.07017 RNF 0.08200 0.08170 45 282 0.1929 EKF 0.1017 0.1021 - 0.3368 4.2.2 Ablation Study Removing parts of a network architecture to evaluate the impact it has on perfor- mance is referred to as an ablation study2 [64]. The two presented model architecture types are very similar. The proposed RNN architecture is essentially the same as the proposed RNF architecture, trained without separate LSTMs for propagation and prediction. Any differences in performance of the two model types can be attributed to this LSTM layer responsible for propagation. Comparing the performance of 3D RNN version 7 and 3D RNF version 3, two models with no dropout and identical state size, the RNF has a 23% lower aRMSE and aRMSE20 with an increase from 8 398 to 13 774 (roughly a 65% increase) trainable weights. 4.2.3 Missing Data In Section 2.4.2 the RNF model’s potential capability to handle missing data points was introduced. Figure 4.3 and Figure 4.4 shows the same sequence, estimated with the same model, as in Figure 4.2. The only difference is that sensor data was only given to the proposed RNF every other time step in Figure 4.3, and no sensor data was given to the RNF in Figure 4.4 after 400 time steps. 2The term ablation study comes from experimental biology and refers to the removal of brain tissue, with the intent to study the impact its removal has, and infer its function. 39 4. Results Figure 4.3: Plot showing an orientation sequence and estimation (left), and the estimation error (right) produced by the trained RNF when input data was missing every other time step. Figure 4.4: Plot showing an orientation sequence and estimation (left), and the estimation error (right) produced by the trained RNF when input data was missing after 400 time steps. 40 5 Discussion The aim of this study was to investigate the viability of using ML-based models to estimate absolute orientation given MARG data. Two models with different attributes have been developed, namely an RNN based and an RNF based design. In this chapter, the results reported in chapter 4 will be discussed, followed by some insights into the future potential of similar ML-based orientation estimation approaches. 5.1 Performance and Uncertainty of the Devel- oped ML Models An initial objective of the project was to verify that the problem is at all assessable by an ML approach. To answer this, simpler experiments in 2D were executed where the current study found that the 2D models were able to out-performed the EKF on synthetic sequences without linear acceleration in the aRMSE metric. However, in the aRMSE20 metric, the EKF still performs better than the ML models (see Table 4.3). The shown convergence of the presented models in 2D indicates the feasibility of an ML approach that estimates orientation. The presented ML models have been shown to perform similarly to the EKF bench- marks with an aRMSE of comparable magnitude. More specifically, for the 3D results, this thesis has reported that the best instance of the RNF model performs better than the EKF in both the aRMSE and the aRMSE20 metric (see Tabel 4.6). This result indicates that the proposed RNF model is a better choice of architecture compared to the RNN-base one. Another feature that argues in favor of the RNF is that the RNF has the potential to handle occurrences of missing data-points, see Figures 4.3 and 4.4. In real-time applications, this property may be highly valuable. Another finding was that the execution time was approximately eight times shorter for the proposed RNF compared to the EKF, and the proposed RNN was even faster. A shorter execution time implies a less computationally expensive filter, which speaks in favor of the ML models. Intuitively, it is easy to believe the NNs to be heavy to run due to the number of weights and non-linearities found in the activation functions. This is not the case here, and a contributing factor is that the proposed networks are not very deep. A note regarding the comparison of the proposed models and the benchmark EKF is that it is highly dependent on the specific EKF implementation used. Still, a reasonable execution time is necessary 41 5. Discussion to, at all, consider the developed models as feasible. The third research question regarded the possibility of quantifying the certainty of the measurement. As a solution, the error covariance estimation was included in the loss function presented in Section 2.4.3. The plot in Figure 4.2 shows the result of the implemented covariance estimation. Furthermore, the covariance estimation itself also appears to be of reasonable magnitude in relation to the estimation and its error. At this stage, the correctness of the covariance estimation has not been evaluated enough. Still, the inclusion of error covariance estimation is a highly desirable trait in many applications and should be explored further in future work. The relatively good aRMSE20 of 2D RNN version 1 might be an indication of the small state size forcing the network to learn a much simpler, less rich, representation of the system dynamics. 5.1.1 Research Methodology A number of factors affects the findings of this study. As mentioned in Section 3.6, the benchmark EKF is given a prior close to ground truth to ensure convergence on all test sequences. The proposed NN models however, are not given any good prior, their hidden state is initialized with all zeros at the start of a sequence. Potentially, the presented ML models can learn to perform well both with accurate priors as well as when the priors are poor. Training the models with short sequences, as described in Section 3.7, has potentially optimized them to converge at a rate that is beneficial given the length of the training sequences. Converging fast will have a greater impact on decreasing the loss on shorter sequences due to the relative contribution of the errors at each time step. If an NN does not have enough parameters to model the relationships that are required to make a good estimate when its prior is poor, as well as when its prior is good, there has to be a trade-off. An illustration of the potential trade-off between training on long versus short sequences can be found in Figure 5.1. Since the models are evaluated on test sequences 20 times longer than the training sequences, if this trade-off is at play, different length of training sequences might alter performance1. A way to steer this trade-off in the direction that is desired (fast convergence/accurate estimates over long time) could be weighting the loss. By weighting the loss of the first time steps higher a fast convergence could be encouraged. Alternatively, a lower weight of the loss at the first time steps could encourage more accurate estimates after convergence. Nevertheless, it can not be excluded that that the performance of the developed model could improve with longer training sequences. 1If minimizing the metric on the test data is the main concern, choosing training sequences of the same length as the test sequences will most likely yield the best result. However, doing so is both computationally expensive and the resulting model might still not perform better in real time applications with near infinite length sequences. 42 5. Discussion 1.0 0.5 0.0 0.5 1.0 Hi gh S NR as su m pt io n RMSE: 0.32 Short sequence RMSE: 0.20 Long sequence 0 2 4 6 8 10 1.0 0.5 0.0 0.5 1.0 Lo w SN R as su m pt io n RMSE: 0.36 0 10 20 30 40 50 RMSE: 0.17 Figure 5.1: An illustrative example of two univariate state estimators, with different SNR assumptions applied on two sequences of different length. Ground truth is represented by the blue line, the estimate by the orange, and the measurements by crosses. Another aspect that may influence the way the results are perceived is the bench- mark EKFs. It can be noted that the EKFs were not developed with the intention to make them “perfect”, but rather to achieve satisfactory orientation estimates. Further, the EKFs are given a quite unfair advantage in their priors, as previously menti